When should I use streaming vs. non-streaming Claude API calls?

Use streaming for any user-facing interface where Claude generates text that will be displayed to a human. Streaming dramatically improves perceived performance — users see the first words within 1–2 seconds rather than waiting 10–30 seconds for the complete response. Use non-streaming for: batch processing where responses are processed programmatically (not displayed), simple classification tasks that return short responses, and any workflow where you need the complete response before taking an action.

How do I handle streaming errors in production?

Streaming errors can occur mid-stream (after the stream has started), which is different from non-streaming errors that return a clean HTTP error code. Implement: (1) Stream start timeout — if no tokens arrive within 10 seconds, the connection is likely dropped; (2) Mid-stream error events — Claude emits error events in the SSE stream; (3) Connection reset handling — reconnect with exponential backoff if the connection drops; (4) Partial response recovery — if the stream breaks mid-response, decide whether to retry the full request or surface a partial result to the user with an error message.

Can I stream Claude responses through my backend to a browser?

Yes — and this is the required architecture for security reasons (you must never expose your API key to the browser). The pattern: browser sends a request to your backend; backend calls Claude streaming API; backend proxies the SSE stream to the browser using chunked transfer encoding or WebSockets. Most frameworks have built-in support for this: Next.js Route Handlers support streaming, FastAPI supports SSE natively, and Express supports chunked responses. The added latency from the proxy hop is typically 10–30ms — negligible compared to the streaming UX benefit.

What is the typical first-token latency for Claude streaming?

First-token latency (time to first byte, TTFB) for Claude streaming typically ranges from 800ms to 3 seconds depending on: model (Haiku is fastest, Opus slowest), input token count (larger inputs take longer to process before the first output token), and current API load. Claude 3.5 Sonnet typically returns the first token in 1–2 seconds for standard-length inputs. This is the metric users perceive as 'how fast Claude starts responding' — even if the total generation time is 15 seconds, users find an 800ms TTFB acceptable because they see progress immediately.

Claude Streaming API Guide: Real-Time Responses for Enterprise Applications

Table of Contents

Why Streaming Matters for UX Understanding Server-Sent Events Basic Streaming Implementation Backend Proxy Architecture UX Patterns for Streaming Error Handling in Streaming

Why Streaming Matters for UX

Without streaming, your application sends a request to Claude and waits — sometimes 10, 20, or 30 seconds — for the complete response to arrive before displaying anything to the user. This creates a perception of extreme slowness even when Claude is working correctly. Users assume something is broken and abandon the interaction.

Streaming solves this by delivering Claude's response token by token as it's generated. The first words appear within 1–2 seconds. The user sees Claude "thinking" in real time. The experience feels dramatically faster even if the total generation time is identical.

Perceived performance metrics from our enterprise deployments: teams that implement streaming consistently report 62% reduction in user-reported "AI is slow" complaints even when actual generation time doesn't change. Users tolerate watching text appear; they don't tolerate staring at a loading spinner for 20 seconds.

Beyond perception, streaming enables a class of UX patterns that are impossible without it: "stop generating" buttons that let users interrupt Claude mid-response, progressive rendering of long documents that users can start reading immediately, and real-time status updates for complex multi-step workflows.

Understanding Server-Sent Events

Claude's streaming API uses Server-Sent Events (SSE) — a unidirectional HTTP protocol where the server pushes events to the client over a persistent connection. Unlike WebSockets, SSE is one-way (server to client only), which is all that's needed for streaming text.

The SSE protocol sends events in a specific text format over the HTTP response body. Each event has a type and data field. Claude's streaming API sends several event types in sequence:

# Example SSE stream from Claude API
event: message_start
data: {"type":"message_start","message":{"id":"msg_...","model":"claude-3-5-sonnet-20241022",...}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"The"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" key"}}

# ... many more delta events ...

event: message_stop
data: {"type":"message_stop"}

Your application accumulates the text_delta values from successive content_block_delta events to build the complete response. Each delta is typically 1–5 tokens — a few characters to a few words.

Basic Streaming Implementation

Anthropic's Python and TypeScript SDKs handle the SSE protocol for you. Here's the minimal streaming implementation in Python:

import anthropic

client = anthropic.Anthropic()

# Streaming with context manager
with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Analyze this contract..."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Get the complete message after streaming
message = stream.get_final_message()
print(f"\nTotal tokens: {message.usage.input_tokens + message.usage.output_tokens}")

The TypeScript equivalent for Node.js backends:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const stream = client.messages.stream({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  messages: [{ role: "user", content: prompt }],
});

// Process each text delta
stream.on("text", (text) => {
  process.stdout.write(text);
  // Or: res.write(text) to proxy to HTTP client
});

const finalMessage = await stream.finalMessage();

Need Enterprise Streaming Architecture Guidance?

Our engineering team has designed streaming implementations for customer-facing enterprise applications, internal tools, and high-throughput document processing pipelines. Free consultation.

Request Technical Consultation →

Backend Proxy Architecture

You must never call the Claude API directly from browser code — that would expose your API key. All Claude API calls, including streaming, route through a server you control. Your backend acts as a proxy that:

Receives the request from your frontend
Authenticates the user (ensure they're authorized to use Claude)
Constructs the Claude API request (adding system prompt, context, etc.)
Calls the Claude streaming API
Proxies the SSE stream back to the browser

In Next.js App Router (the most common enterprise React framework):

// app/api/chat/route.ts
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

export async function POST(req: Request) {
  const { message } = await req.json();

  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      const claudeStream = client.messages.stream({
        model: "claude-3-5-sonnet-20241022",
        max_tokens: 2048,
        system: "You are a helpful enterprise assistant.",
        messages: [{ role: "user", content: message }],
      });

      claudeStream.on("text", (text) => {
        controller.enqueue(encoder.encode(text));
      });

      await claudeStream.finalMessage();
      controller.close();
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/plain; charset=utf-8",
      "Transfer-Encoding": "chunked",
    },
  });
}

On the frontend, consume this stream with the Fetch API's ReadableStream interface:

const response = await fetch("/api/chat", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ message: userInput }),
});

const reader = response.body!.getReader();
const decoder = new TextDecoder();
let accumulated = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  accumulated += decoder.decode(value, { stream: true });
  setDisplayText(accumulated); // Update React state with each chunk
}

Free Research

CTO Guide to Claude API: Enterprise Integration Playbook

Streaming architecture, production patterns, cost optimization, and enterprise deployment frameworks. 62 pages from 200+ Claude deployments.

Download Free →

UX Patterns for Streaming

Streaming opens UX patterns that make enterprise Claude applications feel polished and professional. The patterns that matter most:

Typing indicator before first token: There's always a delay between the user submitting a request and the first token arriving (typically 1–3 seconds). Show a typing indicator (animated dots or a blinking cursor) immediately on submit, before any text appears. This eliminates the perception of a frozen UI in that critical first window.

Progressive markdown rendering: If Claude's output will be displayed as formatted text, don't wait until streaming completes to render markdown. Incrementally parse and render as tokens arrive. Libraries like react-markdown support streaming input. Users reading a long document shouldn't have to wait for it to fully generate before seeing the formatted first paragraph.

Stop generation button: For long responses, let users stop generation if they've read enough or the response isn't what they wanted. This requires maintaining a reference to the AbortController you pass to the fetch call, then calling abort() when the user clicks Stop. Surface the partial response — don't clear it on stop.

Token counter: For power users in document-heavy workflows, showing a running token count as the stream progresses helps them understand context consumption and budget. This data is available in the stream's message_start event (input tokens) and final message (output tokens).

Auto-scroll with user override: Long streaming responses should auto-scroll to keep the latest text in view. But if the user scrolls up (they're reading earlier content), stop auto-scrolling. Resume auto-scroll when they scroll back to the bottom. This is the standard pattern in all major chat applications — users expect it.

Error Handling in Streaming

Streaming error handling is more complex than non-streaming because errors can occur at three different points: before the stream starts, during the stream, or at the end of a stream.

Pre-stream errors: Standard HTTP errors (4xx, 5xx) returned before the stream opens. Handle with normal try/catch around the fetch call. Display a user-friendly error message and offer retry.

Mid-stream errors: Claude emits error events in the SSE stream itself. These appear as events with type error. The SDK's stream event handler includes an error event to catch these. Surface the partial response to the user with an error banner — don't discard what was generated before the error.

Connection drops: Network interruptions close the SSE connection mid-stream. Implement a heartbeat timeout: if no bytes arrive for 30 seconds, assume the connection has dropped and attempt to reconnect with exponential backoff. For user-facing apps, show "Connection interrupted — retrying..." rather than silently failing.

Timeout configuration: Default HTTP timeouts (often 30–60 seconds) will abort Claude responses on long generations. Configure your HTTP client to use long timeouts for Claude calls — 120 seconds minimum, 300 seconds for complex documents. Claude streaming sends bytes continuously, so standard timeout mechanisms (which trigger if no bytes arrive) work correctly with streaming. But connection timeout settings (maximum total time) must be extended.

The full production architecture — including streaming with tool use, multi-turn conversation streaming, and streaming in serverless environments — is covered in the Claude API Enterprise Guide. Teams building sophisticated streaming applications should also review the function calling guide, which covers streaming with tool use (a more complex pattern where Claude streams, calls tools, and continues streaming).