Installation & Setup
The official Anthropic Python SDK is the standard way to integrate Claude into Python applications. It provides typed interfaces, automatic retries, streaming support, and handles the low-level HTTP details so you can focus on your application logic.
Install with pip: pip install anthropic. The SDK requires Python 3.7+ and has minimal dependencies. For production environments, pin the version in your requirements file — check the Anthropic SDK changelog before upgrading major versions.
Authentication is handled via environment variable or explicit parameter. The SDK automatically reads ANTHROPIC_API_KEY from your environment — the recommended approach for production. Never hardcode API keys in source code. Use a secrets manager (AWS Secrets Manager, HashiCorp Vault) to inject the key at runtime, and keep separate keys for development, staging, and production.
import anthropic
# SDK reads ANTHROPIC_API_KEY from environment automatically
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{"role": "user", "content": "Summarise this contract in 3 bullet points."}
]
)
print(message.content[0].text)
Building a Claude integration in Python? Our engineering team can review your architecture and recommend production patterns. Free technical assessment.
Get Architecture Review →Synchronous vs Async Client
The SDK provides two client types: Anthropic (synchronous) and AsyncAnthropic (asynchronous). Choosing correctly has significant impact on your application's throughput and concurrency characteristics.
When to Use Synchronous
Use the synchronous Anthropic client for scripts, CLI tools, batch data pipelines (where you're processing one item at a time), and any code that runs in a sequential context without concurrency requirements. Simple and easy to debug — the call blocks until Claude responds.
When to Use Async
Use AsyncAnthropic in any web application or service handling multiple concurrent requests. FastAPI, Starlette, Django async views, and any asyncio-based application should use the async client. While your application awaits Claude's response (typically 2–10 seconds), it can process other incoming requests — dramatically improving throughput for user-facing applications.
import anthropic
import asyncio
# Async client for web applications
async_client = anthropic.AsyncAnthropic()
async def process_document(doc_text: str) -> str:
message = await async_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system="You are a contract analysis assistant.",
messages=[
{"role": "user", "content": f"Review this document:\n\n{doc_text}"}
]
)
return message.content[0].text
# Process multiple documents concurrently
async def process_batch(documents: list[str]) -> list[str]:
tasks = [process_document(doc) for doc in documents]
return await asyncio.gather(*tasks)
Free White Paper: CTO Guide to Claude API
The complete technical reference covering Python SDK patterns, authentication, streaming, prompt caching, rate limits, and production architecture — built from 200+ enterprise deployments.
Download Free →Streaming Responses
Streaming allows your application to receive Claude's response token by token as it's generated, rather than waiting for the complete response. This transforms user-perceived latency for long-form generation from "wait 15 seconds, then see everything at once" to "start seeing the response within 1 second, it fills in progressively." For user-facing applications generating reports, summaries, or analysis, streaming is essential.
# Streaming with context manager (recommended)
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# Final message with usage data
final_message = stream.get_final_message()
print(f"\nTokens used: {final_message.usage}")
# Async streaming for web applications
async def stream_response(prompt: str):
async with async_client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
) as stream:
async for text in stream.text_stream:
yield text # Yield to SSE or WebSocket
In production web applications, stream Claude's response directly to the client via Server-Sent Events (SSE) or WebSocket. FastAPI's StreamingResponse works well with the async streaming pattern — yield chunks from the async generator directly to the HTTP response. This gives users immediate feedback and significantly improves perceived performance for any generation task taking more than 2 seconds.
Prompt Caching in Production
Prompt caching is one of the most impactful optimisations available in the Python SDK. Mark static portions of your prompt as cacheable — they're stored between requests, dramatically reducing both cost and latency for subsequent calls that reuse the same prefix.
Caching works at the content block level: add "cache_control": {"type": "ephemeral"} to any content block you want cached. The block must be at least 1,024 tokens to qualify. Structure your prompts so static content (system prompt, reference documents, few-shot examples) comes first and is cached, while dynamic content (user query, variable data) comes after without caching.
# Prompt caching example
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # 2000+ tokens, static
"cache_control": {"type": "ephemeral"} # Cache this
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": REFERENCE_DOCUMENT, # Static doc, cache it
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": user_query # Dynamic, not cached
}
]
}
]
)
# Check if caching worked
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache create tokens: {response.usage.cache_creation_input_tokens}")
In production, prompt caching provides 60–80% cost reduction and 30–50% latency improvement for applications with large, consistent system prompts or shared reference documents. The ROI is immediate and requires no architectural changes beyond adding the cache_control annotation.
Production Architecture Patterns
Beyond the SDK basics, production Claude integrations require several engineering best practices:
Client Singleton Pattern
Instantiate the SDK client once at application startup and share it across all request handlers. The SDK client manages an internal HTTP connection pool — creating a new client per request wastes resources and may exhaust file descriptors under load. Use dependency injection or a module-level singleton.
Retry Logic
The SDK has built-in retry logic for transient errors (rate limits, server errors) with exponential backoff. Configure the max_retries parameter (default 2) based on your latency tolerance. For background processing tasks, increase to 5–6 retries. For user-facing requests, keep lower (2–3) and surface a graceful error if retries exhaust.
Timeout Configuration
Set explicit timeouts: timeout=httpx.Timeout(60.0, connect=5.0). A 60-second read timeout accommodates most long-form generation. For streaming, set longer timeouts. For classification tasks with short outputs, 30 seconds is sufficient. Never use default (unlimited) timeouts in production — they allow hung requests to accumulate and exhaust your connection pool.
Structured Output
For data extraction and classification, request JSON output and validate against a Pydantic schema. Claude returns valid JSON when explicitly instructed and given a schema. Validate all outputs before using them downstream — even well-designed prompts occasionally produce slightly malformed JSON that needs a retry.
Need a code review of your Claude Python integration? Our engineering team reviews production Claude code as part of the free assessment — covering security, performance, and architecture.
Get Code Review →Frequently Asked Questions
pip install anthropic. It supports Python 3.7+ and provides both synchronous (Anthropic) and asynchronous (AsyncAnthropic) interfaces. For production use, pin the SDK version in your requirements.txt. The SDK handles authentication, retries, and response parsing automatically."cache_control": {"type": "ephemeral"} to content blocks you want cached. The block must be at least 1,024 tokens to qualify. Structure prompts so static content (system prompt, reference docs, examples) comes first and is marked cacheable, while dynamic content (user query) comes after. Check cache_read_input_tokens in the response usage to confirm caching is working. Provides 60–80% cost reduction for large, consistent prompts.