Prompt Caching
debt(d7/e3/b3/t5)
Closest to 'only careful code review or runtime testing' (d7). No standard linter or SAST tool flags missing cache_control on repeated system prompts; detection requires inspecting response.usage.cache_read_input_tokens at runtime or reviewing the call pattern. detection_hints lists no specific tools.
Closest to 'simple parameterised fix' (e3). Per quick_fix, the remediation is reordering prompt blocks (static first, dynamic last) and adding a cache_control marker — slightly more than a one-line swap because it may require restructuring message assembly in one place.
Closest to 'localised tax' (b3). Applies to LLM-calling components; once the prompt-assembly helper is structured correctly, the rest of the codebase is unaffected. Doesn't reshape the system, but does impose ordering discipline on prompt construction.
Closest to 'notable trap' (t5). Per misconception, devs assume caching affects output quality; combined with documented gotchas (prefix-order requirement, minimum token thresholds, write-cost on first call, silent bypass on small prefixes), this is the kind of well-known footgun developers learn after getting burned in production.
Also Known As
TL;DR
Explanation
Prompt caching lets you mark portions of a prompt as cacheable so the provider (Anthropic, OpenAI, Google) reuses computed key-value attention state across requests instead of reprocessing the prefix every call. Cached input tokens are billed at a steep discount (Anthropic: ~10% of standard input cost on cache reads) and skip the compute cost of the prefix, which dominates for long-context applications. Trade-offs: the first request that writes the cache costs more than a non-cached request (Anthropic: ~125% of input cost), and caches expire after a TTL (Anthropic: 5 minutes by default, 1 hour with extended TTL beta). Caching only works when the cached prefix is byte-identical across requests — a single character difference invalidates it. Best fits: chatbots with long system prompts, document Q&A where the document is fixed, agents reusing tool definitions, batch evaluation jobs hitting the same prompt repeatedly. The pattern is: put static content (system prompt, tool defs, large context) at the start of the prompt and mark it cached; put dynamic content (user message, RAG retrieval) at the end uncached.
Watch Out
Common Misconception
Why It Matters
Common Mistakes
- Putting dynamic content before static — caching requires a consistent prefix; user input must come last.
- Forgetting cache TTL — cold-path requests miss the cache and pay full price; busy endpoints benefit most.
- Not measuring hit rate — production code can ship with caching configured but never hitting; check the usage block in API responses.
- Caching tiny prefixes — providers have minimum cacheable token counts (Anthropic: 1024 tokens for Sonnet); short prefixes silently bypass the cache.
- Treating cache writes as free — the first call that writes the cache costs more than a non-cached call; only pay off if the prefix is reused multiple times within the TTL window.
Avoid When
- Prefix is shorter than the provider's minimum cacheable size (typically ~1k tokens) — caching silently no-ops.
- Each call has a unique prefix (high-entropy system prompts) — cache will never hit and you pay the write premium.
- Calls are spaced longer than the cache TTL — every request becomes a cold write.
When To Use
- Repeated calls within minutes that share a long static prefix (chatbots, agents, batch evals).
- Long-context document Q&A where the document is fixed across many user questions.
- RAG pipelines where the retrieval-augmented prefix is reused across follow-up turns.
Code Examples
// ❌ Long system prompt sent fresh every request — full input cost each call
foreach ($userQuestions as $question) {
$response = $client->messages->create([
'model' => 'claude-sonnet-4-20250514',
'max_tokens' => 1000,
'system' => $largeSystemPrompt, // 8000 tokens, charged in full every time
'messages' => [[
'role' => 'user',
'content' => $question
]]
]);
}
// ✅ Mark the static system prompt as cached — billed at ~10% on cache hits
foreach ($userQuestions as $question) {
$response = $client->messages->create([
'model' => 'claude-sonnet-4-20250514',
'max_tokens' => 1000,
'system' => [
[
'type' => 'text',
'text' => $largeSystemPrompt, // 8000 tokens
'cache_control' => ['type' => 'ephemeral'] // mark cacheable
]
],
'messages' => [[
'role' => 'user',
'content' => $question // dynamic — stays uncached at the end
]]
]);
// Inspect cache hit/miss for cost monitoring
$usage = $response->usage;
error_log(sprintf(
'cache_read=%d cache_write=%d input=%d output=%d',
$usage->cache_read_input_tokens ?? 0,
$usage->cache_creation_input_tokens ?? 0,
$usage->input_tokens,
$usage->output_tokens
));
}