Claude SDK Python Guide for Enterprise

Q: How do we handle API keys securely with the Python SDK?

Never hardcode API keys in source code. Use environment variables (ANTHROPIC_API_KEY) — the SDK reads this automatically if you don't pass api_key explicitly. In production, use a secrets manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager) to inject the key as an environment variable at runtime. Rotate keys quarterly and restrict each key to specific services — don't share a single key across all applications. Use separate keys for development, staging, and production environments.

Q: How do we implement prompt caching with the Python SDK?

Prompt caching is enabled by adding cache_control={'type': 'ephemeral'} to message content blocks you want cached. Cached content must be at least 1,024 tokens to qualify. Structure your prompts so the static content (system prompt, reference documents, examples) comes first and is marked as cacheable, while the dynamic content (user query, variable data) comes after. The SDK returns cache_read_input_tokens in the response usage to confirm caching is working. Cache lifetime is 5 minutes by default — for longer-running sessions, add cache_control to the last few messages to extend the window.

Installation & Setup

The official Anthropic Python SDK is the standard way to integrate Claude into Python applications. It provides typed interfaces, automatic retries, streaming support, and handles the low-level HTTP details so you can focus on your application logic.

Install with pip: pip install anthropic. The SDK requires Python 3.7+ and has minimal dependencies. For production environments, pin the version in your requirements file — check the Anthropic SDK changelog before upgrading major versions.

Authentication is handled via environment variable or explicit parameter. The SDK automatically reads ANTHROPIC_API_KEY from your environment — the recommended approach for production. Never hardcode API keys in source code. Use a secrets manager (AWS Secrets Manager, HashiCorp Vault) to inject the key at runtime, and keep separate keys for development, staging, and production.

import anthropic

# SDK reads ANTHROPIC_API_KEY from environment automatically
client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Summarise this contract in 3 bullet points."}
    ]
)
print(message.content[0].text)

Building a Claude integration in Python? Our engineering team can review your architecture and recommend production patterns. Free technical assessment.

Get Architecture Review →

Synchronous vs Async Client

The SDK provides two client types: Anthropic (synchronous) and AsyncAnthropic (asynchronous). Choosing correctly has significant impact on your application's throughput and concurrency characteristics.

When to Use Synchronous

Use the synchronous Anthropic client for scripts, CLI tools, batch data pipelines (where you're processing one item at a time), and any code that runs in a sequential context without concurrency requirements. Simple and easy to debug — the call blocks until Claude responds.

When to Use Async

Use AsyncAnthropic in any web application or service handling multiple concurrent requests. FastAPI, Starlette, Django async views, and any asyncio-based application should use the async client. While your application awaits Claude's response (typically 2–10 seconds), it can process other incoming requests — dramatically improving throughput for user-facing applications.

import anthropic
import asyncio

# Async client for web applications
async_client = anthropic.AsyncAnthropic()

async def process_document(doc_text: str) -> str:
    message = await async_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system="You are a contract analysis assistant.",
        messages=[
            {"role": "user", "content": f"Review this document:\n\n{doc_text}"}
        ]
    )
    return message.content[0].text

# Process multiple documents concurrently
async def process_batch(documents: list[str]) -> list[str]:
    tasks = [process_document(doc) for doc in documents]
    return await asyncio.gather(*tasks)

🐍

Free White Paper: CTO Guide to Claude API

The complete technical reference covering Python SDK patterns, authentication, streaming, prompt caching, rate limits, and production architecture — built from 200+ enterprise deployments.

Download Free →

Streaming Responses

Streaming allows your application to receive Claude's response token by token as it's generated, rather than waiting for the complete response. This transforms user-perceived latency for long-form generation from "wait 15 seconds, then see everything at once" to "start seeing the response within 1 second, it fills in progressively." For user-facing applications generating reports, summaries, or analysis, streaming is essential.

# Streaming with context manager (recommended)
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[{"role": "user", "content": prompt}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

    # Final message with usage data
    final_message = stream.get_final_message()
    print(f"\nTokens used: {final_message.usage}")

# Async streaming for web applications
async def stream_response(prompt: str):
    async with async_client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        async for text in stream.text_stream:
            yield text  # Yield to SSE or WebSocket

In production web applications, stream Claude's response directly to the client via Server-Sent Events (SSE) or WebSocket. FastAPI's StreamingResponse works well with the async streaming pattern — yield chunks from the async generator directly to the HTTP response. This gives users immediate feedback and significantly improves perceived performance for any generation task taking more than 2 seconds.

Prompt Caching in Production

Prompt caching is one of the most impactful optimisations available in the Python SDK. Mark static portions of your prompt as cacheable — they're stored between requests, dramatically reducing both cost and latency for subsequent calls that reuse the same prefix.

Caching works at the content block level: add "cache_control": {"type": "ephemeral"} to any content block you want cached. The block must be at least 1,024 tokens to qualify. Structure your prompts so static content (system prompt, reference documents, few-shot examples) comes first and is cached, while dynamic content (user query, variable data) comes after without caching.

# Prompt caching example
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,  # 2000+ tokens, static
            "cache_control": {"type": "ephemeral"}  # Cache this
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": REFERENCE_DOCUMENT,  # Static doc, cache it
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": user_query  # Dynamic, not cached
                }
            ]
        }
    ]
)
# Check if caching worked
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache create tokens: {response.usage.cache_creation_input_tokens}")

In production, prompt caching provides 60–80% cost reduction and 30–50% latency improvement for applications with large, consistent system prompts or shared reference documents. The ROI is immediate and requires no architectural changes beyond adding the cache_control annotation.

Production Architecture Patterns

Beyond the SDK basics, production Claude integrations require several engineering best practices:

Client Singleton Pattern

Instantiate the SDK client once at application startup and share it across all request handlers. The SDK client manages an internal HTTP connection pool — creating a new client per request wastes resources and may exhaust file descriptors under load. Use dependency injection or a module-level singleton.

Retry Logic

The SDK has built-in retry logic for transient errors (rate limits, server errors) with exponential backoff. Configure the max_retries parameter (default 2) based on your latency tolerance. For background processing tasks, increase to 5–6 retries. For user-facing requests, keep lower (2–3) and surface a graceful error if retries exhaust.

Timeout Configuration

Set explicit timeouts: timeout=httpx.Timeout(60.0, connect=5.0). A 60-second read timeout accommodates most long-form generation. For streaming, set longer timeouts. For classification tasks with short outputs, 30 seconds is sufficient. Never use default (unlimited) timeouts in production — they allow hung requests to accumulate and exhaust your connection pool.

Structured Output

For data extraction and classification, request JSON output and validate against a Pydantic schema. Claude returns valid JSON when explicitly instructed and given a schema. Validate all outputs before using them downstream — even well-designed prompts occasionally produce slightly malformed JSON that needs a retry.

Need a code review of your Claude Python integration? Our engineering team reviews production Claude code as part of the free assessment — covering security, performance, and architecture.

Get Code Review →

Frequently Asked Questions

What is the official Claude Python SDK and how do I install it?

The official Anthropic Python SDK is the 'anthropic' package, installable via pip install anthropic. It supports Python 3.7+ and provides both synchronous (Anthropic) and asynchronous (AsyncAnthropic) interfaces. For production use, pin the SDK version in your requirements.txt. The SDK handles authentication, retries, and response parsing automatically.

Should we use synchronous or async Claude SDK in production?

Use the async SDK (AsyncAnthropic) for any production web application handling multiple concurrent requests. Async allows processing other requests while waiting for Claude's response, dramatically improving throughput. Use the synchronous SDK only for scripts, data pipelines, and single-threaded batch processing. In FastAPI, Django async views, or any asyncio-based application, always use AsyncAnthropic.

How do we handle API keys securely with the Python SDK?

Never hardcode API keys in source code. Use the ANTHROPIC_API_KEY environment variable — the SDK reads this automatically. In production, use a secrets manager (AWS Secrets Manager, HashiCorp Vault) to inject the key at runtime. Rotate keys quarterly, use separate keys for dev/staging/production, and restrict each key to specific services.

How do we implement prompt caching with the Python SDK?

Add "cache_control": {"type": "ephemeral"} to content blocks you want cached. The block must be at least 1,024 tokens to qualify. Structure prompts so static content (system prompt, reference docs, examples) comes first and is marked cacheable, while dynamic content (user query) comes after. Check cache_read_input_tokens in the response usage to confirm caching is working. Provides 60–80% cost reduction for large, consistent prompts.

Claude SDK Python Guide for Enterprise: Production Patterns 2026

Table of Contents

Installation & Setup

Synchronous vs Async Client

When to Use Synchronous

When to Use Async

Free White Paper: CTO Guide to Claude API

Streaming Responses

Prompt Caching in Production

Production Architecture Patterns

Client Singleton Pattern

Retry Logic

Timeout Configuration

Structured Output

Frequently Asked Questions

Related Articles

Weekly Claude Enterprise Insights

Building a Production Claude Integration?

Claude SDK Python Guide for Enterprise: Production Patterns 2026

Table of Contents

Installation & Setup

Synchronous vs Async Client

When to Use Synchronous

When to Use Async

Free White Paper: CTO Guide to Claude API

Streaming Responses

Prompt Caching in Production

Production Architecture Patterns

Client Singleton Pattern

Retry Logic

Timeout Configuration

Structured Output

Frequently Asked Questions

Related Articles

Claude API Enterprise Integration Guide

Claude Streaming API Guide

Claude Error Handling Patterns

Weekly Claude Enterprise Insights

Building a Production Claude Integration?