LLM Context Window
debt(d8/e5/b7/t7)
Closest to 'silent in production until users hit it' (d9), scored d8. The detection_hints field explicitly states 'automated: no' and the code pattern (entire conversation history sent without truncation) doesn't trigger any linter or static analysis tool. Token costs grow unboundedly and the lost-in-the-middle degradation in model quality is invisible at the code level — it only surfaces as unexpected API bills or subtly wrong model outputs in production. Slightly below d9 because overflow errors can at least surface as runtime exceptions, giving some signal.
Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix mentions implementing a sliding window or summarisation strategy, which goes beyond a one-line patch. It requires designing and integrating a token-tracking mechanism, a conversation truncation or summarisation pipeline, and potentially a RAG retrieval layer — work that spans prompt management, API call logic, and possibly storage of conversation state across multiple components.
Closest to 'strong gravitational pull' (b7). The applies_to covers both web and cli contexts broadly. Every feature that involves LLM calls must be designed around context window constraints: conversation history management, RAG pipeline design, token budgeting, and cost control all flow from this choice. Any engineer adding new LLM-powered features must reason about context limits, making this a persistent cross-cutting architectural concern that shapes how the entire AI integration layer is structured.
Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field explicitly calls out that 'a larger context window always means better performance' — this is the canonical wrong belief. Developers naturally assume more context = better results (analogous to more data = better outcomes in ML), but the lost-in-the-middle problem means the opposite can be true. Additionally, the common mistake of truncating from the start (losing the system prompt) instead of the middle contradicts intuition. These are serious, non-obvious behavioral inversions.
Also Known As
TL;DR
Explanation
A context window is measured in tokens (roughly 3-4 characters each). Claude 3.5: 200K tokens (~150K words). GPT-4: 128K. Smaller models: 4K-32K. The full window is processed on every call — larger windows cost more and are slower. Strategies for large contexts: chunking (split documents, process separately), RAG (retrieve only relevant chunks), summarisation (compress conversation history), and sliding windows (keep recent messages, summarise older ones). Performance degrades in the middle of very long contexts (lost in the middle problem).
Common Misconception
Why It Matters
Common Mistakes
- Sending entire documents when only paragraphs are relevant — use RAG instead.
- Not tracking token usage — unexpected costs from large context on every call.
- Truncating from the start instead of the middle — recent messages and system prompt matter most.
- Ignoring the lost-in-the-middle problem — critical instructions buried in the middle of a long context may be ignored.
Code Examples
// Sending entire codebase in every call — expensive:
$allCode = file_get_contents('/var/www/app/src/**/*.php'); // 500KB
$response = $claude->messages->create([
'messages' => [[
'role' => 'user',
'content' => $allCode . '
Fix the bug in UserController',
]],
]);
// Cost: 500K tokens * $0.003/1K = $1.50 per request
// RAG — retrieve only relevant files:
$relevantFiles = $vectorDb->search('UserController bug', limit: 5);
$context = implode('
', array_column($relevantFiles, 'content'));
$response = $claude->messages->create([
'messages' => [[
'role' => 'user',
'content' => $context . '
Fix the bug in UserController',
]],
]);
// Cost: 5K tokens * $0.003/1K = $0.015 per request (100x cheaper)