Apimart
Log inSign Up
Streaming Claude API — Key Features Explained

Streaming Claude API — Key Features Explained

How Claude API streaming works — SSE events, backend proxy setup, TTFT and cost control, dropped-stream recovery, and production reliability tips.

Tutorial

If you want Claude to feel fast, turn on streaming. Instead of waiting 15–25 seconds for one full reply, users can often see the first token in about 300–800 ms.

Here’s the short version:

  • I’d use SSE streaming when I want replies to appear as they’re generated
  • I’d keep Claude calls behind a backend proxy to protect API keys
  • I’d watch TTFT (time to first token), max_tokens, and disconnects to control UX and spend
  • I’d parse streamed events in order: message_start, content block events, message_delta, then message_stop
  • I’d buffer tool JSON fragments until the content block ends
  • I’d plan for dropped connections, because Claude does not resume streams natively

A few numbers stand out:

  • Haiku 4.5: about 410 ms TTFT
  • Sonnet 4.6: about 720 ms TTFT
  • Opus 4.7: about 980 ms TTFT
  • Default proxy timeouts of 30–60 seconds may cut off long outputs
  • Longer read timeouts of 120–300 seconds are often needed
  • Batch jobs can cost 50% less than standard token rates

In plain English: streaming changes delivery, not the model. Claude API sends text in small event chunks over a live connection, and my app rebuilds the final answer on screen. For chat, copilots, and assistants, that usually means a better user feel, fewer idle timeout issues, and more work on the client and backend sides.

Here’s what matters most:

AreaWhat I’d focus on
SpeedFirst-token time, prompt size, model choice
Setupstream: true, SSE handling, proxy buffering off
UXSpinner first, then token-by-token text, plus a Stop button
CostTrack input/output tokens, set max_tokens, cancel dead sessions
ReliabilityRetry only retryable errors, keep partial text, restart with a continuation prompt

So if I were setting this up, I’d think about streaming as a UX and infrastructure choice, not just an API flag.

Claude API Streaming: Model Speed, Cost & Key Metrics Compared
Claude API Streaming: Model Speed, Cost & Key Metrics Compared

Building with Claude API - Part 4 - Response Streaming

Claude

How Claude API Streaming Works at the Protocol Level

Turning on streaming doesn’t change the model itself. It changes how the output gets delivered. Instead of waiting for one complete response, Claude sends small chunks as they’re ready.

Which endpoint and request settings enable streaming

Streaming uses the same Claude Messages API endpoint, /v1/messages, as a normal request. The only difference is adding "stream": true to the request body [1][3]. That flag shifts the connection from a standard request-response flow to Server-Sent Events (SSE), where the server pushes chunks as they’re generated.

The official Python and TypeScript SDKs include stream helpers that handle event parsing and message assembly for you. When the stream ends, calling .get_final_message() in Python or .finalMessage() in TypeScript returns the completed response, along with token counts and the stop reason [1][4].

Those settings start the stream. The next piece is the event flow that comes back over the connection.

What events arrive during a stream

A stream follows a set sequence of named events. Each event carries the data your client needs to rebuild the full response the right way.

Event TypePurposeKey Data
message_startOpens the streamMessage ID, role, model, and input_tokens
content_block_startBegins a new content segmentBlock index and block type (text, tool_use, or thinking)
content_block_deltaSends partial contentBlock index and text_delta, input_json_delta, or thinking_delta
content_block_stopCloses a content segmentIndex of the completed block
message_deltaUpdates message-level stateCumulative output_tokens and stop_reason
message_stopEnds the streamFinal signal to close the connection
pingHeartbeatSent during model processing
errorReports a stream errorError type and message

This is the data your client buffers and shows in real time. For plain text output, append each text_delta to a local buffer as events arrive. That’s how the full response string comes together piece by piece. Tool-call inputs work a little differently: they arrive as input_json_delta fragments, so you should buffer them first and parse only after content_block_stop [1][6].

When APIMart is relevant for Claude streaming

GccAi

If Claude streaming sits inside a multi-model app, APIMart can give you one place to route Claude access and streaming through a unified LLM API. That tends to matter most when you want the same streaming flow across web and mobile clients.

How to Add Claude Streaming to Web and Mobile Apps

SSE, HTTP streaming, or WebSocket wrappers: which one to use

Once you understand how Claude sends streamed events, the next step is getting those events to web and mobile clients. The main question is simple: which transport fits your app best?

TransportSetup ComplexityBrowser FitConnection Behavior
Server-Sent Events (SSE)LowNative (EventSource/Fetch)Unidirectional; automatic reconnection
Plain HTTP StreamingMediumRequires ReadableStreamNo built-in event structure or reconnection
WebSocket WrapperHighRequires library/wrapperBidirectional; stateful; may be blocked by some corporate firewalls

For most browser-based apps, SSE is the default choice. It works well with most proxies and CDNs, and the browser support story is simple.

WebSockets can still make sense. If your app already depends on bidirectional, live communication, they may fit right in. But for a standard chat app, they often add more moving parts than they’re worth.

Why a backend proxy is usually the safest architecture

After you pick the transport, put the stream handling behind your server. Keep Claude requests behind a backend proxy to protect your API keys for various models [9][5].

That proxy layer is also the right place to:

  • inject system prompts
  • enforce per-user rate limits
  • log first-token and last-token timing

Set X-Accel-Buffering: no to stop proxy buffering [2][8]. Also connect the client’s abort signal to the stream. That way, if a user stops generation, the request is canceled right away instead of burning tokens on a response nobody will read.

One more gotcha: default timeouts of 30–60 seconds can cut off long responses [9][2]. In production, use read timeouts of about 120–300 seconds for longer generations.

What good client-side UX looks like during a live response

Once the stream is protected and relayed, the job shifts to the UI. This is where streaming either feels smooth or clunky.

Show a "Thinking..." indicator or spinner as soon as the user submits a prompt. That covers the first-token delay. As soon as the first content_block_delta arrives, remove the indicator and begin rendering text.

Then append each text_delta as it comes in so the response appears with a typewriter effect. To keep the interface from stuttering, batch updates with requestAnimationFrame so you don’t trigger too many re-renders. Auto-scroll should follow the response while it’s generating, but it should back off if the user scrolls up to read older content.

Always include a "Stop" button wired to AbortController so it can cancel the Fetch request. That should end the stream cleanly without wiping out text that’s already on the screen.

If the connection drops, keep the partial output visible. For recovery, hold on to that partial text and restart with a continuation prompt, since Claude does not resume a dropped stream natively [3][1].

How to Manage Latency, Cost, and Reliability in Production

Once the stream is live, production work comes down to three controls: latency, spend, and recovery.

How streaming changes perceived latency

The main UX metric here is Time to First Token (TTFT): the time until the first word shows up. In a streaming app, that first visible token shapes the whole feel of the product. If it appears fast, the app feels responsive. If it drags, users notice.

Benchmarks show clear gaps across Claude models: Haiku 4.5 reaches about 410 ms TTFT, Sonnet 4.6 lands around 720 ms, and Opus 4.7 comes in at roughly 980 ms [3]. The simple rule is to use the fastest model that still clears your quality bar.

Prompt size matters too. Bigger context windows can push TTFT toward 1–3 seconds [5]. So if your system prompt has extra instructions, old rules, or bloated examples, cutting them can make the app feel a lot snappier.

How to track token usage and control USD costs

Streaming and batch cost the same per token. The only thing that changes is when the output arrives. Token counts are included in the stream itself: the message_start event includes usage.input_tokens, and the message_delta event near the end includes the final cumulative usage.output_tokens [1][4]. Your backend should store that final usage data after the stream ends so billing stays accurate.

Set max_tokens on every request. It gives you a hard stop and keeps long generations from driving up costs [1][11]. You should also watch for client disconnects on the server side. If the user is gone and generation keeps running, you're still burning tokens for no reason [5][7].

For jobs that don't need live output, the math changes. Batch summarization, offline processing, and overnight report generation are good examples. In those cases, the Batch API offers a 50% discount on normal token rates [5][10].

Claude 3.5 Sonnet costs about $3.00 per 1 million input tokens and $15.00 per 1 million output tokens on standard streaming [8]. With the Batch API, async workloads cost half as much.

Those usage numbers also help with billing and rate-limit monitoring. For more strategies on managing high-volume requests, see our AI API cost tips.

How to prevent dropped streams and recover safely

Claude does not have a server-side resume feature [1][3]. If a stream drops, send the partial output in a new request and ask Claude to continue from the break point.

Only retry failures that are likely to clear on their own.

Error CodeTypeAction
429Rate limitRetry with backoff: 5s → 10s → 20s
529Server overloadedRetry after 30–60 seconds
408Connection timeoutReconnect immediately
4xxClient errorDo not retry; fix the request

After that, the final step is choosing which streaming features matter most.

Conclusion: Which Claude Streaming Features Matter Most

Once you’ve looked at latency, cost, and reliability, Claude streaming boils down to three things: perceived responsiveness, clear event signals, and solid error handling.

Streaming matters because first-token delivery makes Claude feel fast and responsive.

The SSE event flow gives you text deltas, usage counts, and error signals in real time [1][3].

Use a backend proxy to protect keys, turn off buffering, and deal with disconnects [2][8]. For multi-model apps, APIMart can centralize Claude streaming, logging, and billing.

In production, fast TTFT, usage tracking, and proxy resilience matter most.

FAQs

When should I use streaming instead of a normal Claude API response?

Use streaming when your app is user-facing. It sends tokens as they’re generated, so responses feel almost instant. That small shift can make the whole product feel faster and smoother.

Streaming works best for real-time chat, long answers, and agent workflows with tool calls. It also helps you avoid timeouts when outputs run long or token limits are high.

Skip it for short, simple requests or batch jobs where throughput matters more than latency.

What should I do if a Claude stream disconnects mid-response?

If a Claude stream disconnects, Server-Sent Events won't pick up where they left off on their own. Your app needs to handle that part.

When the connection drops, you can either retry the full request or show the partial output you've already saved. Use a try/except block to catch APIConnectionError or APIStatusError, and keep a reference to the content you've accumulated so far.

If you want the stream to continue with less friction, track the last event ID and manually replay the stream from that point.

How can I reduce streaming costs without hurting UX?

Focus on prompt tuning and efficient session handling. Set max_tokens hard limits, keep prompts short, and add an early cancel option so users can stop generation once they have what they need.

For batch workloads that don't need immediate back-and-forth, use non-streaming mode. For accurate cost tracking, wait for the final message_stop event instead of estimating from mid-stream chunks.

Ready to build?

Choose the model you want in the model marketplace

Try chat, image and video models in the APIMart model marketplace, and experience model capabilities quickly with one unified API.

Chat modelsImage modelsVideo models
Explore model marketplace