LLM Streaming Responses
debt(d7/e3/b3/t7)
Closest to 'only careful code review or runtime testing' (d7). No detection tools are specified in detection_hints. Misuse—such as missing ob_flush(), missing X-Accel-Buffering header, or not handling stop reasons—will not be caught by a compiler, linter, or standard SAST tool. These failures only surface at runtime: chunks are silently buffered and the client sees nothing until the full response completes, or responses are silently truncated. A developer must test end-to-end with real network conditions or careful inspection to discover the problem.
Closest to 'simple parameterised fix' (e3). The quick_fix confirms the fix is a handful of header calls and a flush pattern after each token. It's more than a single-line patch because it requires coordinating Content-Type header, X-Accel-Buffering header, ob_flush() + flush() calls, and correct SSE event formatting—but it's confined to a single endpoint/component with no cross-cutting refactor needed.
Closest to 'localised tax' (b3). The streaming pattern applies only to web contexts (applies_to: web) and is scoped to specific AI-response endpoints. It introduces a persistent but contained tax: each streaming endpoint must maintain the flush discipline and header setup. The rest of the codebase is unaffected, making this a localised rather than system-wide burden.
Closest to 'serious trap' (t7). The misconception field identifies the primary trap explicitly: developers assume WebSockets are required and miss that SSE over plain HTTP is sufficient and simpler. Additionally, the common_mistakes list reveals multiple secondary traps that contradict reasonable assumptions—PHP/Nginx output buffering silently swallows chunks (the 'obvious' streaming code appears to work but delivers nothing), and proxy buffering requires a non-obvious vendor-specific header. These contradictions are serious because the failure mode is silent and the assumed solution (WebSockets) is more complex than necessary.
Also Known As
TL;DR
Explanation
Without streaming, a user waits 5–15 seconds staring at a spinner before any text appears. With streaming, text appears within ~200ms and continues flowing. On the server side, LLM APIs return server-sent events or chunked transfer encoding. The PHP SDK receives delta objects with partial text. The PHP backend must forward these to the browser either via SSE (text/event-stream) or by buffering and returning the full response. For real-time display, SSE from PHP to browser is the standard pattern. Each delta contains a small text fragment; accumulate them server-side for logging while streaming each to the client. Stop reasons (end_turn, max_tokens, tool_use) indicate why generation stopped.
Common Misconception
Why It Matters
Common Mistakes
- Forgetting ob_flush() — output buffering in PHP (and Nginx) swallows chunks; you must call both ob_flush() and flush() after each SSE event.
- Not setting X-Accel-Buffering: no — Nginx buffers proxy responses by default; this header disables it for the streaming endpoint.
- Not handling the stream's stop reason — always check for tool_use or max_tokens stop reasons to know whether the response was truncated.
- Logging the full response without accumulating from deltas — each SSE event contains only a fragment; accumulate all deltas server-side if you need to log the complete response.
Code Examples
<?php
// ❌ Buffering full response before sending — user waits 10+ seconds
Route::post('/chat', function (Request $request) {
$response = Anthropic::messages()->create([
'model' => 'claude-sonnet-4-20250514',
'max_tokens' => 1000,
'messages' => [['role' => 'user', 'content' => $request->message]],
]);
return response()->json(['text' => $response->content[0]->text]);
// User sees nothing for 8 seconds then gets the full text at once
});
<?php
// ✅ Streaming SSE response from PHP
Route::post('/chat', function (Request $request) {
return response()->stream(function () use ($request) {
header('Content-Type: text/event-stream');
header('X-Accel-Buffering: no'); // Disable Nginx buffering
header('Cache-Control: no-cache');
$stream = Anthropic::messages()->stream([
'model' => 'claude-sonnet-4-20250514',
'max_tokens' => 1000,
'messages' => [['role' => 'user', 'content' => $request->message]],
]);
foreach ($stream as $event) {
if ($event->type === 'content_block_delta') {
$delta = $event->delta->text ?? '';
echo 'data: ' . json_encode(['delta' => $delta]) . "\n\n";
ob_flush();
flush(); // Send chunk immediately
}
}
echo 'data: ' . json_encode(['done' => true]) . "\n\n";
ob_flush(); flush();
}, 200, ['Content-Type' => 'text/event-stream']);
});
// Client: new EventSource('/chat') + listen for delta events