When should you NOT use Prompt Caching?

Prefix is shorter than the provider's minimum cacheable size (typically ~1k tokens) — caching silently no-ops. Each call has a unique prefix (high-entropy system prompts) — cache will never hit and you pay the write premium. Calls are spaced longer than the cache TTL — every request becomes a cold write.

When is Prompt Caching the right choice?

Repeated calls within minutes that share a long static prefix (chatbots, agents, batch evals). Long-context document Q&A where the document is fixed across many user questions. RAG pipelines where the retrieval-augmented prefix is reused across follow-up turns.

← Back to glossary

Prompt Caching

ai_ml Intermediate

Also Known As

prefix caching cache control prompt cache KV cache reuse

TL;DR

API feature where a static prompt prefix (system instructions, large context) is cached server-side, dramatically reducing cost and latency on repeated calls that share the prefix.

Explanation

Prompt caching lets you mark portions of a prompt as cacheable so the provider (Anthropic, OpenAI, Google) reuses computed key-value attention state across requests instead of reprocessing the prefix every call. Cached input tokens are billed at a steep discount (Anthropic: ~10% of standard input cost on cache reads) and skip the compute cost of the prefix, which dominates for long-context applications. Trade-offs: the first request that writes the cache costs more than a non-cached request (Anthropic: ~125% of input cost), and caches expire after a TTL (Anthropic: 5 minutes by default, 1 hour with extended TTL beta). Caching only works when the cached prefix is byte-identical across requests — a single character difference invalidates it. Best fits: chatbots with long system prompts, document Q&A where the document is fixed, agents reusing tool definitions, batch evaluation jobs hitting the same prompt repeatedly. The pattern is: put static content (system prompt, tool defs, large context) at the start of the prompt and mark it cached; put dynamic content (user message, RAG retrieval) at the end uncached.

Watch Out

⚠ Cache hits require byte-identical prefixes. Even reordering whitespace or version-stamping the system prompt with a timestamp will invalidate every cache entry.

Common Misconception

✗ Prompt caching improves output quality or reduces hallucination. It does not affect output at all — it is a pure cost and latency optimisation. The model behaves identically whether the prefix was cached or freshly computed.

Why It Matters

For PHP applications making many LLM calls with shared system prompts (chatbots, code review tools, batch processors), prompt caching can reduce input costs by ~90% on the cached portion and cut time-to-first-token by 50–80%. On long-context workloads it is the difference between economically viable and not.

Common Mistakes

Putting dynamic content before static — caching requires a consistent prefix; user input must come last.
Forgetting cache TTL — cold-path requests miss the cache and pay full price; busy endpoints benefit most.
Not measuring hit rate — production code can ship with caching configured but never hitting; check the usage block in API responses.
Caching tiny prefixes — providers have minimum cacheable token counts (Anthropic: 1024 tokens for Sonnet); short prefixes silently bypass the cache.
Treating cache writes as free — the first call that writes the cache costs more than a non-cached call; only pay off if the prefix is reused multiple times within the TTL window.

Avoid When

Prefix is shorter than the provider's minimum cacheable size (typically ~1k tokens) — caching silently no-ops.
Each call has a unique prefix (high-entropy system prompts) — cache will never hit and you pay the write premium.
Calls are spaced longer than the cache TTL — every request becomes a cold write.

When To Use

Repeated calls within minutes that share a long static prefix (chatbots, agents, batch evals).
Long-context document Q&A where the document is fixed across many user questions.
RAG pipelines where the retrieval-augmented prefix is reused across follow-up turns.

Code Examples

💡 NoteFirst iteration writes the cache (small premium); subsequent iterations within 5 minutes hit the cache and pay ~10% of input cost on the prefix.

✗ Vulnerable

// ❌ Long system prompt sent fresh every request — full input cost each call
foreach ($userQuestions as $question) {
    $response = $client->messages->create([
        'model'    => 'claude-sonnet-4-20250514',
        'max_tokens' => 1000,
        'system'   => $largeSystemPrompt,  // 8000 tokens, charged in full every time
        'messages' => [[
            'role'    => 'user',
            'content' => $question
        ]]
    ]);
}

✓ Fixed

// ✅ Mark the static system prompt as cached — billed at ~10% on cache hits
foreach ($userQuestions as $question) {
    $response = $client->messages->create([
        'model'    => 'claude-sonnet-4-20250514',
        'max_tokens' => 1000,
        'system'   => [
            [
                'type' => 'text',
                'text' => $largeSystemPrompt,           // 8000 tokens
                'cache_control' => ['type' => 'ephemeral']  // mark cacheable
            ]
        ],
        'messages' => [[
            'role'    => 'user',
            'content' => $question  // dynamic — stays uncached at the end
        ]]
    ]);

    // Inspect cache hit/miss for cost monitoring
    $usage = $response->usage;
    error_log(sprintf(
        'cache_read=%d cache_write=%d input=%d output=%d',
        $usage->cache_read_input_tokens ?? 0,
        $usage->cache_creation_input_tokens ?? 0,
        $usage->input_tokens,
        $usage->output_tokens
    ));
}

References

↗ https://docs.claude.com/en/docs/build-with-claude/prompt-caching