When should you NOT use AI Context Management?

Single-shot prompts with small fixed inputs where the whole payload trivially fits the window. Prototypes where the conversation length is bounded and cost is negligible.

When is AI Context Management the right choice?

Multi-turn chat where history grows unbounded across a session. RAG pipelines that inject retrieved documents alongside history and instructions. Any system where token cost, latency, or truncation failures are observable in production.

← Back to glossary

AI Context Management

ai_ml PHP 8.0+ Intermediate

debt(d9/e5/b5/t7)

d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). detection_hints.automated is 'no' and the only signal is a code_pattern regex for 'messages' => $history; truncation failures are silent, surfacing as wrong answers or mid-answer cutoffs invisible in the prompt text. No automated tooling catches degraded relevance.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix (token budget with response headroom plus trim/summarise oldest history) is more than a one-liner — it requires building budgeting and trimming logic, token counting, and retrieval-chunk selection that touches the assembly component.

b5 Burden Structural debt — long-term weight of choosing wrong

Closest to 'persistent productivity tax' (b5). applies_to spans web, cli, and queue-worker contexts and the assembly logic shapes cost, latency, and quality on every request; unmanaged context becomes a recurring tax across all LLM-touching work streams, but it's confinable to the prompt-assembly layer rather than defining the whole system shape.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7). The misconception ('bigger context windows mean just send everything') is the canonical wrong belief, and the lost-in-the-middle / silent-truncation behaviour contradicts the intuitive assumption that more context always helps — the obvious approach actively degrades results.

About DEBT scoring → scored by claude-opus-4-8 · 2026-06-08 · reviewed by human

Also Known As

context window management prompt assembly context engineering context budgeting

TL;DR

The practice of selecting, ordering, and trimming what goes into an LLM's context window to maximise relevance while staying under token limits.

Explanation

AI context management is the discipline of deciding what information an LLM sees on each request. Because a model only knows what fits in its context window, the quality of an answer depends as much on what you put in front of it as on the prompt wording. Context typically combines several sources: a system prompt, conversation history, retrieved documents (RAG), tool outputs, and the user's current message. Each of these competes for a finite token budget, so management means prioritising, summarising, truncating, and ordering these pieces deliberately rather than dumping everything in.

Key techniques include sliding-window history (keep the last N turns), summarisation (compress old turns into a running summary), retrieval (pull only the chunks relevant to the current query), and structured slotting (reserve fixed budgets for system instructions, retrieved context, and history). Token counting is central: you must measure how many tokens each part consumes and reserve headroom for the model's response. Position matters too - models attend more reliably to content at the start and end of context (the 'lost in the middle' effect), so critical instructions and the user query should not be buried.

In PHP applications, context management usually happens in an application service layer that assembles the request payload before calling the model API. You count tokens (often via an approximation or a tokenizer library), enforce a budget, drop or summarise the oldest history, inject retrieved passages, and log what was actually sent for debugging. Poor context management shows up as truncated answers, ignored instructions, runaway costs, and hallucinations when the model invents details that were silently dropped. Good context management is the difference between a chatbot that remembers the conversation and stays cheap, and one that forgets, contradicts itself, or burns the token budget on irrelevant boilerplate.

Common Misconception

✗ Bigger context windows mean you no longer need to manage context - just send everything. In practice, stuffing context degrades relevance (lost-in-the-middle), raises cost and latency, and increases hallucination risk; deliberate selection still beats brute-force inclusion.

Why It Matters

Context determines answer quality, cost, and latency on every request, so disciplined assembly directly affects user experience and bill size. Unmanaged context silently truncates history or documents, producing wrong answers that are hard to debug because the failure is invisible in the prompt text.

Common Mistakes

Appending unbounded conversation history until the window overflows and oldest messages are silently dropped.
Not counting tokens before sending, so requests fail or truncate unpredictably under load.
Placing critical instructions in the middle of a large context where the model attends to them least.
Including full documents instead of retrieving and inserting only the relevant chunks.
Reserving no headroom for the response, causing the model to cut off mid-answer.

Avoid When

Single-shot prompts with small fixed inputs where the whole payload trivially fits the window.
Prototypes where the conversation length is bounded and cost is negligible.

When To Use

Multi-turn chat where history grows unbounded across a session.
RAG pipelines that inject retrieved documents alongside history and instructions.
Any system where token cost, latency, or truncation failures are observable in production.

Code Examples

✗ Vulnerable

<?php
// Unbounded history - eventually overflows the window and silently truncates
class Chat {
    private array $history = [];

    public function ask(string $message, ClaudeClient $client): string {
        $this->history[] = ['role' => 'user', 'content' => $message];
        // Sends entire history every time, no token counting, no budget
        $response = $client->complete([
            'system'   => $this->bigSystemPrompt(),
            'messages' => $this->history,
        ]);
        $this->history[] = ['role' => 'assistant', 'content' => $response];
        return $response;
    }
}

✓ Fixed

<?php
// Budgeted context: reserve response headroom, trim oldest turns
class Chat {
    private array $history = [];
    private int $maxContextTokens = 100_000;
    private int $responseReserve = 4_000;

    public function ask(string $message, ClaudeClient $client, Tokenizer $tok): string {
        $this->history[] = ['role' => 'user', 'content' => $message];

        $system = $this->systemPrompt();
        $budget = $this->maxContextTokens - $this->responseReserve - $tok->count($system);

        // Keep most recent turns that fit the budget
        $messages = [];
        $used = 0;
        foreach (array_reverse($this->history) as $turn) {
            $cost = $tok->count($turn['content']);
            if ($used + $cost > $budget) break;
            $used += $cost;
            array_unshift($messages, $turn);
        }

        $response = $client->complete([
            'system'     => $system,
            'messages'   => $messages,
            'max_tokens' => $this->responseReserve,
        ]);
        $this->history[] = ['role' => 'assistant', 'content' => $response];
        return $response;
    }
}