{
    "slug": "prompt_caching",
    "term": "Prompt Caching",
    "category": "ai_ml",
    "difficulty": "intermediate",
    "short": "API feature where a static prompt prefix (system instructions, large context) is cached server-side, dramatically reducing cost and latency on repeated calls that share the prefix.",
    "long": "Prompt caching lets you mark portions of a prompt as cacheable so the provider (Anthropic, OpenAI, Google) reuses computed key-value attention state across requests instead of reprocessing the prefix every call. Cached input tokens are billed at a steep discount (Anthropic: ~10% of standard input cost on cache reads) and skip the compute cost of the prefix, which dominates for long-context applications. Trade-offs: the first request that writes the cache costs more than a non-cached request (Anthropic: ~125% of input cost), and caches expire after a TTL (Anthropic: 5 minutes by default, 1 hour with extended TTL beta). Caching only works when the cached prefix is byte-identical across requests — a single character difference invalidates it. Best fits: chatbots with long system prompts, document Q&A where the document is fixed, agents reusing tool definitions, batch evaluation jobs hitting the same prompt repeatedly. The pattern is: put static content (system prompt, tool defs, large context) at the start of the prompt and mark it cached; put dynamic content (user message, RAG retrieval) at the end uncached.",
    "aliases": [
        "prefix caching",
        "cache control",
        "prompt cache",
        "KV cache reuse"
    ],
    "tags": [
        "optimization",
        "llm",
        "cost",
        "caching",
        "performance",
        "api",
        "ai"
    ],
    "misconception": "Prompt caching improves output quality or reduces hallucination. It does not affect output at all — it is a pure cost and latency optimisation. The model behaves identically whether the prefix was cached or freshly computed.",
    "why_it_matters": "For PHP applications making many LLM calls with shared system prompts (chatbots, code review tools, batch processors), prompt caching can reduce input costs by ~90% on the cached portion and cut time-to-first-token by 50–80%. On long-context workloads it is the difference between economically viable and not.",
    "common_mistakes": [
        "Putting dynamic content before static — caching requires a consistent prefix; user input must come last.",
        "Forgetting cache TTL — cold-path requests miss the cache and pay full price; busy endpoints benefit most.",
        "Not measuring hit rate — production code can ship with caching configured but never hitting; check the usage block in API responses.",
        "Caching tiny prefixes — providers have minimum cacheable token counts (Anthropic: 1024 tokens for Sonnet); short prefixes silently bypass the cache.",
        "Treating cache writes as free — the first call that writes the cache costs more than a non-cached call; only pay off if the prefix is reused multiple times within the TTL window."
    ],
    "when_to_use": [
        "Repeated calls within minutes that share a long static prefix (chatbots, agents, batch evals).",
        "Long-context document Q&A where the document is fixed across many user questions.",
        "RAG pipelines where the retrieval-augmented prefix is reused across follow-up turns."
    ],
    "avoid_when": [
        "Prefix is shorter than the provider's minimum cacheable size (typically ~1k tokens) — caching silently no-ops.",
        "Each call has a unique prefix (high-entropy system prompts) — cache will never hit and you pay the write premium.",
        "Calls are spaced longer than the cache TTL — every request becomes a cold write."
    ],
    "related": [
        "ai_cost_management",
        "ai_in_php",
        "llm_streaming",
        "large_language_models"
    ],
    "prerequisites": [
        "ai_in_php",
        "ai_cost_management"
    ],
    "refs": [
        "https://docs.claude.com/en/docs/build-with-claude/prompt-caching"
    ],
    "bad_code": "// ❌ Long system prompt sent fresh every request — full input cost each call\nforeach ($userQuestions as $question) {\n    $response = $client->messages->create([\n        'model'    => 'claude-sonnet-4-20250514',\n        'max_tokens' => 1000,\n        'system'   => $largeSystemPrompt,  // 8000 tokens, charged in full every time\n        'messages' => [[\n            'role'    => 'user',\n            'content' => $question\n        ]]\n    ]);\n}",
    "good_code": "// ✅ Mark the static system prompt as cached — billed at ~10% on cache hits\nforeach ($userQuestions as $question) {\n    $response = $client->messages->create([\n        'model'    => 'claude-sonnet-4-20250514',\n        'max_tokens' => 1000,\n        'system'   => [\n            [\n                'type' => 'text',\n                'text' => $largeSystemPrompt,           // 8000 tokens\n                'cache_control' => ['type' => 'ephemeral']  // mark cacheable\n            ]\n        ],\n        'messages' => [[\n            'role'    => 'user',\n            'content' => $question  // dynamic — stays uncached at the end\n        ]]\n    ]);\n\n    // Inspect cache hit/miss for cost monitoring\n    $usage = $response->usage;\n    error_log(sprintf(\n        'cache_read=%d cache_write=%d input=%d output=%d',\n        $usage->cache_read_input_tokens ?? 0,\n        $usage->cache_creation_input_tokens ?? 0,\n        $usage->input_tokens,\n        $usage->output_tokens\n    ));\n}",
    "example_note": "First iteration writes the cache (small premium); subsequent iterations within 5 minutes hit the cache and pay ~10% of input cost on the prefix.",
    "quick_fix": "Move static content (system prompt, tool defs, fixed context) to the start of the prompt and add cache_control: { type: 'ephemeral' } to that block. Verify hit rate via response.usage.cache_read_input_tokens.",
    "severity": "medium",
    "effort": "low",
    "created": "2026-04-28",
    "updated": "2026-04-28",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/prompt_caching",
        "html_url": "https://codeclaritylab.com/glossary/prompt_caching",
        "json_url": "https://codeclaritylab.com/glossary/prompt_caching.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Prompt Caching](https://codeclaritylab.com/glossary/prompt_caching) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/prompt_caching"
            }
        }
    }
}