AI Context Management
debt(d9/e5/b5/t7)
Closest to 'silent in production until users hit it' (d9). detection_hints.automated is 'no' and the only signal is a code_pattern regex for 'messages' => $history; truncation failures are silent, surfacing as wrong answers or mid-answer cutoffs invisible in the prompt text. No automated tooling catches degraded relevance.
Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix (token budget with response headroom plus trim/summarise oldest history) is more than a one-liner — it requires building budgeting and trimming logic, token counting, and retrieval-chunk selection that touches the assembly component.
Closest to 'persistent productivity tax' (b5). applies_to spans web, cli, and queue-worker contexts and the assembly logic shapes cost, latency, and quality on every request; unmanaged context becomes a recurring tax across all LLM-touching work streams, but it's confinable to the prompt-assembly layer rather than defining the whole system shape.
Closest to 'serious trap' (t7). The misconception ('bigger context windows mean just send everything') is the canonical wrong belief, and the lost-in-the-middle / silent-truncation behaviour contradicts the intuitive assumption that more context always helps — the obvious approach actively degrades results.
Also Known As
TL;DR
Explanation
AI context management is the discipline of deciding what information an LLM sees on each request. Because a model only knows what fits in its context window, the quality of an answer depends as much on what you put in front of it as on the prompt wording. Context typically combines several sources: a system prompt, conversation history, retrieved documents (RAG), tool outputs, and the user's current message. Each of these competes for a finite token budget, so management means prioritising, summarising, truncating, and ordering these pieces deliberately rather than dumping everything in.
Key techniques include sliding-window history (keep the last N turns), summarisation (compress old turns into a running summary), retrieval (pull only the chunks relevant to the current query), and structured slotting (reserve fixed budgets for system instructions, retrieved context, and history). Token counting is central: you must measure how many tokens each part consumes and reserve headroom for the model's response. Position matters too - models attend more reliably to content at the start and end of context (the 'lost in the middle' effect), so critical instructions and the user query should not be buried.
In PHP applications, context management usually happens in an application service layer that assembles the request payload before calling the model API. You count tokens (often via an approximation or a tokenizer library), enforce a budget, drop or summarise the oldest history, inject retrieved passages, and log what was actually sent for debugging. Poor context management shows up as truncated answers, ignored instructions, runaway costs, and hallucinations when the model invents details that were silently dropped. Good context management is the difference between a chatbot that remembers the conversation and stays cheap, and one that forgets, contradicts itself, or burns the token budget on irrelevant boilerplate.
Common Misconception
Why It Matters
Common Mistakes
- Appending unbounded conversation history until the window overflows and oldest messages are silently dropped.
- Not counting tokens before sending, so requests fail or truncate unpredictably under load.
- Placing critical instructions in the middle of a large context where the model attends to them least.
- Including full documents instead of retrieving and inserting only the relevant chunks.
- Reserving no headroom for the response, causing the model to cut off mid-answer.
Avoid When
- Single-shot prompts with small fixed inputs where the whole payload trivially fits the window.
- Prototypes where the conversation length is bounded and cost is negligible.
When To Use
- Multi-turn chat where history grows unbounded across a session.
- RAG pipelines that inject retrieved documents alongside history and instructions.
- Any system where token cost, latency, or truncation failures are observable in production.
Code Examples
<?php
// Unbounded history - eventually overflows the window and silently truncates
class Chat {
private array $history = [];
public function ask(string $message, ClaudeClient $client): string {
$this->history[] = ['role' => 'user', 'content' => $message];
// Sends entire history every time, no token counting, no budget
$response = $client->complete([
'system' => $this->bigSystemPrompt(),
'messages' => $this->history,
]);
$this->history[] = ['role' => 'assistant', 'content' => $response];
return $response;
}
}
<?php
// Budgeted context: reserve response headroom, trim oldest turns
class Chat {
private array $history = [];
private int $maxContextTokens = 100_000;
private int $responseReserve = 4_000;
public function ask(string $message, ClaudeClient $client, Tokenizer $tok): string {
$this->history[] = ['role' => 'user', 'content' => $message];
$system = $this->systemPrompt();
$budget = $this->maxContextTokens - $this->responseReserve - $tok->count($system);
// Keep most recent turns that fit the budget
$messages = [];
$used = 0;
foreach (array_reverse($this->history) as $turn) {
$cost = $tok->count($turn['content']);
if ($used + $cost > $budget) break;
$used += $cost;
array_unshift($messages, $turn);
}
$response = $client->complete([
'system' => $system,
'messages' => $messages,
'max_tokens' => $this->responseReserve,
]);
$this->history[] = ['role' => 'assistant', 'content' => $response];
return $response;
}
}