← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

Prompt Caching

ai_ml Intermediate

Also Known As

prefix caching cache control prompt cache KV cache reuse

TL;DR

API feature where a static prompt prefix (system instructions, large context) is cached server-side, dramatically reducing cost and latency on repeated calls that share the prefix.

Explanation

Prompt caching lets you mark portions of a prompt as cacheable so the provider (Anthropic, OpenAI, Google) reuses computed key-value attention state across requests instead of reprocessing the prefix every call. Cached input tokens are billed at a steep discount (Anthropic: ~10% of standard input cost on cache reads) and skip the compute cost of the prefix, which dominates for long-context applications. Trade-offs: the first request that writes the cache costs more than a non-cached request (Anthropic: ~125% of input cost), and caches expire after a TTL (Anthropic: 5 minutes by default, 1 hour with extended TTL beta). Caching only works when the cached prefix is byte-identical across requests — a single character difference invalidates it. Best fits: chatbots with long system prompts, document Q&A where the document is fixed, agents reusing tool definitions, batch evaluation jobs hitting the same prompt repeatedly. The pattern is: put static content (system prompt, tool defs, large context) at the start of the prompt and mark it cached; put dynamic content (user message, RAG retrieval) at the end uncached.

Watch Out

Cache hits require byte-identical prefixes. Even reordering whitespace or version-stamping the system prompt with a timestamp will invalidate every cache entry.

Common Misconception

Prompt caching improves output quality or reduces hallucination. It does not affect output at all — it is a pure cost and latency optimisation. The model behaves identically whether the prefix was cached or freshly computed.

Why It Matters

For PHP applications making many LLM calls with shared system prompts (chatbots, code review tools, batch processors), prompt caching can reduce input costs by ~90% on the cached portion and cut time-to-first-token by 50–80%. On long-context workloads it is the difference between economically viable and not.

Common Mistakes

  • Putting dynamic content before static — caching requires a consistent prefix; user input must come last.
  • Forgetting cache TTL — cold-path requests miss the cache and pay full price; busy endpoints benefit most.
  • Not measuring hit rate — production code can ship with caching configured but never hitting; check the usage block in API responses.
  • Caching tiny prefixes — providers have minimum cacheable token counts (Anthropic: 1024 tokens for Sonnet); short prefixes silently bypass the cache.
  • Treating cache writes as free — the first call that writes the cache costs more than a non-cached call; only pay off if the prefix is reused multiple times within the TTL window.

Avoid When

  • Prefix is shorter than the provider's minimum cacheable size (typically ~1k tokens) — caching silently no-ops.
  • Each call has a unique prefix (high-entropy system prompts) — cache will never hit and you pay the write premium.
  • Calls are spaced longer than the cache TTL — every request becomes a cold write.

When To Use

  • Repeated calls within minutes that share a long static prefix (chatbots, agents, batch evals).
  • Long-context document Q&A where the document is fixed across many user questions.
  • RAG pipelines where the retrieval-augmented prefix is reused across follow-up turns.

Code Examples

💡 Note
First iteration writes the cache (small premium); subsequent iterations within 5 minutes hit the cache and pay ~10% of input cost on the prefix.
✗ Vulnerable
// ❌ Long system prompt sent fresh every request — full input cost each call
foreach ($userQuestions as $question) {
    $response = $client->messages->create([
        'model'    => 'claude-sonnet-4-20250514',
        'max_tokens' => 1000,
        'system'   => $largeSystemPrompt,  // 8000 tokens, charged in full every time
        'messages' => [[
            'role'    => 'user',
            'content' => $question
        ]]
    ]);
}
✓ Fixed
// ✅ Mark the static system prompt as cached — billed at ~10% on cache hits
foreach ($userQuestions as $question) {
    $response = $client->messages->create([
        'model'    => 'claude-sonnet-4-20250514',
        'max_tokens' => 1000,
        'system'   => [
            [
                'type' => 'text',
                'text' => $largeSystemPrompt,           // 8000 tokens
                'cache_control' => ['type' => 'ephemeral']  // mark cacheable
            ]
        ],
        'messages' => [[
            'role'    => 'user',
            'content' => $question  // dynamic — stays uncached at the end
        ]]
    ]);

    // Inspect cache hit/miss for cost monitoring
    $usage = $response->usage;
    error_log(sprintf(
        'cache_read=%d cache_write=%d input=%d output=%d',
        $usage->cache_read_input_tokens ?? 0,
        $usage->cache_creation_input_tokens ?? 0,
        $usage->input_tokens,
        $usage->output_tokens
    ));
}

Added 28 Apr 2026
Views 11
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 4 pings T 0 pings W 0 pings T 0 pings F 1 ping S 2 pings S 1 ping M 0 pings T 0 pings W 0 pings T
No pings yet today
No pings yesterday
Google 3 ChatGPT 2 Perplexity 2 SEMrush 1
crawler 6 crawler_json 2
DEV INTEL Tools & Severity
🟡 Medium ⚙ Fix effort: Low
⚡ Quick Fix
Move static content (system prompt, tool defs, fixed context) to the start of the prompt and add cache_control: { type: 'ephemeral' } to that block. Verify hit rate via response.usage.cache_read_input_tokens.
📦 Applies To
web cli queue-worker
🔗 Prerequisites
🔍 Detection Hints
messages->create with large repeated 'system' field across a loop, no cache_control
Auto-detectable: ✓ Yes
🤖 AI Agent
Confidence: High False Positives: Low ✗ Manual fix Fix: Low Context: Function

✓ schema.org compliant