Prompt Caching
Also Known As
TL;DR
Explanation
Prompt caching lets you mark portions of a prompt as cacheable so the provider (Anthropic, OpenAI, Google) reuses computed key-value attention state across requests instead of reprocessing the prefix every call. Cached input tokens are billed at a steep discount (Anthropic: ~10% of standard input cost on cache reads) and skip the compute cost of the prefix, which dominates for long-context applications. Trade-offs: the first request that writes the cache costs more than a non-cached request (Anthropic: ~125% of input cost), and caches expire after a TTL (Anthropic: 5 minutes by default, 1 hour with extended TTL beta). Caching only works when the cached prefix is byte-identical across requests — a single character difference invalidates it. Best fits: chatbots with long system prompts, document Q&A where the document is fixed, agents reusing tool definitions, batch evaluation jobs hitting the same prompt repeatedly. The pattern is: put static content (system prompt, tool defs, large context) at the start of the prompt and mark it cached; put dynamic content (user message, RAG retrieval) at the end uncached.
Watch Out
Common Misconception
Why It Matters
Common Mistakes
- Putting dynamic content before static — caching requires a consistent prefix; user input must come last.
- Forgetting cache TTL — cold-path requests miss the cache and pay full price; busy endpoints benefit most.
- Not measuring hit rate — production code can ship with caching configured but never hitting; check the usage block in API responses.
- Caching tiny prefixes — providers have minimum cacheable token counts (Anthropic: 1024 tokens for Sonnet); short prefixes silently bypass the cache.
- Treating cache writes as free — the first call that writes the cache costs more than a non-cached call; only pay off if the prefix is reused multiple times within the TTL window.
Avoid When
- Prefix is shorter than the provider's minimum cacheable size (typically ~1k tokens) — caching silently no-ops.
- Each call has a unique prefix (high-entropy system prompts) — cache will never hit and you pay the write premium.
- Calls are spaced longer than the cache TTL — every request becomes a cold write.
When To Use
- Repeated calls within minutes that share a long static prefix (chatbots, agents, batch evals).
- Long-context document Q&A where the document is fixed across many user questions.
- RAG pipelines where the retrieval-augmented prefix is reused across follow-up turns.
Code Examples
// ❌ Long system prompt sent fresh every request — full input cost each call
foreach ($userQuestions as $question) {
$response = $client->messages->create([
'model' => 'claude-sonnet-4-20250514',
'max_tokens' => 1000,
'system' => $largeSystemPrompt, // 8000 tokens, charged in full every time
'messages' => [[
'role' => 'user',
'content' => $question
]]
]);
}
// ✅ Mark the static system prompt as cached — billed at ~10% on cache hits
foreach ($userQuestions as $question) {
$response = $client->messages->create([
'model' => 'claude-sonnet-4-20250514',
'max_tokens' => 1000,
'system' => [
[
'type' => 'text',
'text' => $largeSystemPrompt, // 8000 tokens
'cache_control' => ['type' => 'ephemeral'] // mark cacheable
]
],
'messages' => [[
'role' => 'user',
'content' => $question // dynamic — stays uncached at the end
]]
]);
// Inspect cache hit/miss for cost monitoring
$usage = $response->usage;
error_log(sprintf(
'cache_read=%d cache_write=%d input=%d output=%d',
$usage->cache_read_input_tokens ?? 0,
$usage->cache_creation_input_tokens ?? 0,
$usage->input_tokens,
$usage->output_tokens
));
}