Prompt Caching Explained: Anthropic, OpenAI, and the Math
Prompt caching is the single highest-leverage cost lever for most production LLM workloads in 2026. The idea is simple — reuse the prefill compute of a previously seen prompt prefix instead of recomputing it. The implementations are different across providers, and the math of when it pays off is worth understanding before you turn it on.
What is actually being cached
When an LLM processes a prompt, it goes through two phases:
Prefill is dominated by the prompt length. For a 50,000-token prompt + 200 tokens of output, prefill is the overwhelming majority of the cost. Prompt caching reuses the KV cache for a prompt prefix you have already processed, skipping the prefill for that prefix entirely.
Anthropic's prompt caching
Anthropic's implementation is explicit and tunable. You mark a cache_control breakpoint in your prompt; everything before it can be cached.
Pricing as documented (Anthropic prompt caching docs):
Breakeven math: with the 5-minute cache, one write at 1.25× plus one read at 0.10× costs 1.35× the price of just paying input twice (2.0×) — so breakeven is at the second cache hit. With the 1-hour cache, write is 2× and reads are 0.10× — so breakeven is at the third cache hit.
Up to 4 cache breakpoints per request are supported. Caches are scoped per workspace as of February 2026 (formerly per organization). (Anthropic docs)
OpenAI's prompt caching
OpenAI's implementation is automatic. You do not mark anything; the platform detects shared prefixes (≥1024 tokens) across recent requests and serves cache hits when available.
Pricing as reported (DigitalOcean blog, OpenRouter):
Cache TTL is reportedly several minutes (often 5–10 with extensions on use). [Unverified — exact policy can shift]
The behavioral difference: OpenAI is opt-out, Anthropic is opt-in. Anthropic gives you more control and a steeper discount on hits; OpenAI gives you a smaller discount with no work to claim it.
Designing prompts for caching
The win pattern: put the stable prefix first, the variable suffix last.
A canonical layout:
[1] System prompt + role definition ← cached
[2] Tool definitions ← cached
[3] Few-shot examples ← cached
[4] Knowledge base context (if static) ← cached
[--- cache breakpoint ---]
[5] User's variable input ← not cachedIf your conversation has a long history, append the new turn at the end and put the cache breakpoint between the stable history and the new turn. Each additional turn extends the cached prefix — the cache grows with the conversation.
When caching does not help (or hurts)
A quick decision check
Caching pays off when:
Practical setting: a customer-support agent with a 10K-token system prompt + tools + few-shot, hit 100 times per minute, will see 80–95% cost reduction on the cached portion of input. A one-off batch job that calls the same system prompt 5 times across an hour will not.
Cache hits in observability
If you log per request, capture:
cache_creation_input_tokens (Anthropic) / inferred from response (OpenAI)cache_read_input_tokensinput_tokens (uncached)output_tokensA simple cache-effectiveness metric: cache_read_tokens / (cache_read_tokens + input_tokens). Healthy values for a stable agent are 0.7–0.95. Below 0.3, your cache is mostly missing — investigate prefix stability.
Bottom line
Prompt caching is the closest thing to free money in 2026 LLM operations. On Anthropic, mark a breakpoint after your stable prefix; on OpenAI, structure prompts so stable content is at the front. With both, watch your cache hit rate as a first-class metric. Workloads that get this right routinely shave 50–90% off input cost; workloads that ignore it leave that money on the table.


