Skip to main content

Prompt Caching Explained: What Cuts Your Bill and What Breaks It

Inferbase Team14 min read

Long prompts have quietly become the norm. A production assistant carries a few thousand tokens of system instructions before the user ever types a word. A retrieval pipeline prepends pages of documents. An agent loop resends its entire tool catalog and working context on every step of a multi-step task. In all three cases the same large block of text travels to the model again and again, and on a standard pricing plan you pay full freight for it every single time.

Prompt caching is the mechanism that stops that repeated payment. It stores the model's intermediate computation for a stable prefix and replays it on later requests, so the repeated portion is billed at a fraction of the normal rate and skips most of its processing time. The savings are large, the read discount on current flagship models sits near 90 percent, but they are conditional. Caching matches on an exact prefix, which means a single stray token can quietly throw the whole discount away. This guide covers how prompt caching works, how the major providers differ, and the cache hygiene rules that decide whether you actually get the discount you think you are getting.

How prompt caching works

When a model processes a prompt, it first runs a prefill pass that reads every input token and builds a key-value (KV) cache, the internal representation the model attends to while generating output. Prefill is the expensive part of reading a long prompt, and its cost scales with the number of input tokens. Prompt caching exploits a simple observation: if two requests share an identical prefix, the KV cache for that prefix is also identical, so it can be computed once and reused.

The reused unit is a prefix, not an arbitrary substring. Providers hash the prompt from the very first token and look for the longest leading run they have already computed. Everything up to the first point of difference can be served from cache; everything after that point has to be computed fresh. This is why the position of your variable content matters so much, a topic the cache hygiene section returns to in detail. The same prefix-matching idea drives self-hosted caching too, where it usually goes by the name prefix caching.

Because the cached KV state is numerically identical to a freshly computed one, caching does not change model outputs. It only changes what you pay and how fast the response starts. If you want the deeper background on why the KV cache dominates long-context serving, our analysis of why GPU memory calculators get the KV cache wrong covers the memory mechanics, and our explainer on how context windows actually work covers why prompts grow large in the first place.

Two scenarios for a four-part prompt made of system prompt, tools, documents, and user query. In the first, the stable system, tools, and documents lead the prompt and are served from cache at about 90 percent off on the second request, while only the changed user query is recomputed at full price. In the second, a timestamp placed at the top of the prompt changes on every request, so the first point of difference is the very first token and the entire prompt, including the otherwise stable prefix, is recomputed at full price every time. The rule behind both cases is that caching matches on an exact prefix from the first token, so stable content belongs at the front and variable content at the end.

What gets cached, and what it costs

Three numbers describe any caching scheme: the discount on a cache read, the surcharge (if any) on a cache write, and the cost of keeping the entry alive. The read discount is the headline. Across the current flagship families it has converged near 90 percent, so cached input tokens cost roughly one-tenth of standard input tokens. The older figures of 50 and 75 percent that still circulate are now mostly legacy-model artifacts rather than the rate you will pay on a recent model.

The write side is where providers diverge. Most charge nothing extra to populate the cache, you simply pay the normal input rate on the first request and the discounted rate afterward. Anthropic is the notable exception: writing to its cache costs more than a normal input token, 25 percent more for the short retention window and 100 percent more for the extended one, which it then recovers through the 90 percent read discount on subsequent hits. That tradeoff only pays off if the cached prefix is read enough times to amortize the write premium, which makes write cost a real design input for low-reuse workloads.

Storage is the third dimension and it is usually free. The exception is an explicitly created cache, where some providers charge rent for the time the entry is held in memory. That converts caching from a pure discount into a small bet: you pay to hold the prefix, and you come out ahead only if it is read enough before it expires.

Prompt caching across providers

The table below summarizes the caching behavior of four major API providers and the dominant self-hosted engine. Treat it as a snapshot. Cache pricing, retention windows, and minimum lengths have all changed more than once over the past year, so verify any number against the linked provider documentation before you build a budget on it.

ProviderHow to enableCache-read discountCache-write costMinimum cacheable prefixRetention windowStorage cost
Anthropic ClaudeOpt-in via cache_control markers~90% (0.1x input)+25% (5-min) or +100% (1-hour)1,024 tokens (512 to 4,096, model-dependent)5 min default, 1 hour optionalNone
OpenAIAutomatic~90% on GPT-5 family (50% on legacy GPT-4o)None1,024 tokens, then 128-token steps5 to 10 min in-memory, up to 24 hours extendedNone
Google GeminiImplicit (default on recent models) or explicit (opt-in)~90%None2,048 to 4,096 tokens, model-dependentImplicit best-effort; explicit 1 hour defaultExplicit only: ~$1 to $4.50 per 1M tokens per hour
DeepSeekAutomatic, on-diskCache hit priced ~98% below a missNone~64-token unitHours to days, usage-based evictionNone
vLLM (self-hosted)Automatic prefix caching (default on)No per-token cost; skips prefill computeNone16-token blockHeld until evicted (LRU)GPU memory

Sources: Anthropic prompt caching, OpenAI prompt caching, Gemini context caching, DeepSeek KV cache, and vLLM automatic prefix caching. Figures as of June 2026.

A few patterns are worth reading out of the table. The first is the split between automatic and opt-in caching. OpenAI, DeepSeek, and self-hosted vLLM cache without any code change, and Gemini caches implicitly by default on recent models. Anthropic is the one provider that does nothing until you mark a block with cache_control, which is more work but also gives you explicit control over what is cached and at which retention. Automatic caching is convenient, but it does not absolve you of structuring the prompt well, because the cache still only matches on an identical leading prefix.

The second pattern is that "cached input tokens" can mean two slightly different billing models. Most providers expose caching as a discount applied to the cached fraction of an otherwise normal request. DeepSeek instead reports a cache-hit token count and a cache-miss token count separately, priced very differently, with a hit costing a small fraction of a miss. The effect is the same, you pay far less for repeated prefixes, but the accounting on your invoice looks different, which matters when you build cost dashboards.

The third pattern is that the explicit, opt-in caches come with a storage meter. Gemini's explicit cache charges per token per hour for the time the prefix is held, while its implicit cache and everyone else's automatic cache do not. That rent is small in absolute terms, but it inverts the usual logic: with an explicit cache you are paying to keep a prefix warm whether or not it is read, so it only makes sense for prefixes you are confident will be reused soon.

A worked example: where the savings actually land

Consider an agentic coding workflow with a stable prefix of roughly 8,000 tokens, a system prompt, a set of tool definitions, and a handful of reference files, that resends that prefix on every step of a 12-step task. Assume a model priced at $3 per million input tokens, and a representative 90 percent cache-read discount.

Without caching, the prefix alone costs 12 steps times 8,000 tokens, or 96,000 input tokens per task, billed in full. With caching, the first step writes the 8,000-token prefix once and the remaining 11 steps read it at one-tenth the rate. The effective input is roughly 8,000 tokens for the initial write plus 11 reads of about 800 effective tokens each, around 16,800 to 18,800 effective tokens depending on whether the provider charges a write premium. That is close to an 80 percent reduction on the repeated portion of the prompt, or about $0.29 down to roughly $0.06 per task at the assumed rate.

The number that turns a rounding error into a budget line is volume. The same 80 percent reduction across a few thousand agent tasks per day compounds into the largest single lever most teams have on input cost, second only to moving eligible work to a batch tier. The catch is that the worked example assumes every step after the first is a clean cache hit. In practice that assumption is exactly what breaks, which is the subject of the next section. For the broader set of cost levers beyond caching, our LLM cost optimization guide and our breakdown of the hidden costs of LLM APIs cover output ratios, rate-limit tiers, and batch processing.

Cache hygiene: why your prompt cache misses

A cache that never hits is worse than no cache, because you have done the structural work and still pay full price. Most cache misses trace back to a small set of avoidable mistakes. The governing rule behind nearly all of them is that caching matches on an exact prefix: providers state plainly that a cache hit requires identical prompt segments, so any difference invalidates the cache from the point of difference onward.

Anything dynamic in the prefix breaks every cache after it

This is the single most common failure. A timestamp, a request ID, a UUID, a per-user greeting, or a "today's date is" line placed inside the system prompt changes on every request, so the cache misses on every request even though 99 percent of the prompt is stable. Anthropic documents this failure by name, and it follows directly from the exact-prefix rule on every provider. Keep volatile values out of the cached region entirely, or push them to the very end of the prompt where they cannot poison the prefix ahead of them.

Static content must come first, dynamic content last

Because only the leading run up to the first difference can be reused, the ordering of your prompt is a cost decision, not just a style one. Put the stable material, system instructions, tool definitions, reference documents, at the top, and append the variable material, the current user query and any per-request context, at the bottom. OpenAI, Anthropic, and Google all give the same guidance for the same reason. A prompt that opens with the user's message and ends with a fixed instruction block has effectively disabled its own cache.

Tool and function definitions are part of the prefix

Tool definitions usually sit near the front of the prompt, which means editing or reordering them invalidates everything that follows. Anthropic places the cache prefix in a fixed order, tools first, then system, then messages, and notes that modifying the tool definitions invalidates the entire cache. If you generate your tool list dynamically, make sure it serializes deterministically, including the order of the tools and their fields, or you will rewrite the cache on every call without realizing it.

Non-deterministic serialization is a silent cache killer

Even with the right ordering, a prefix built from a dictionary or set can serialize in a different key order from one process to the next. The bytes differ, so the cache misses, even though the logical content is identical. This one is rarely called out in provider documentation because it is a consequence of the exact-match rule rather than a separate feature, but it is a real and frustrating cause of misses in production. Serialize cached structures with a stable key order and avoid embedding anything whose textual form is not reproducible.

A prefix below the minimum length never caches

Every API provider sets a minimum cacheable prefix, ranging from around 1,000 tokens to several thousand depending on the model. A prompt shorter than that threshold is silently never cached, with no error to tell you so. If your prompts hover near the boundary, confirm the threshold for your specific model rather than assuming a single global number, because the minimums vary by model even within one provider.

Crossing a model, region, or tenant boundary drops the cache

Caches are scoped. Switching to a different model, a different region, or a different privacy partition means there is no shared prefix to hit, so the first request to the new target pays full price again. This has a direct consequence for routing, covered below: a router that moves a cache-warm conversation to a different backend has thrown away the discount, even if the new backend is nominally cheaper per token.

A practical workflow is to log the cache-hit token counts that providers return and watch the hit rate, rather than assuming the cache is working. Some providers now also expose a diagnostic field that reports why a given request missed, which turns cache debugging from guesswork into a lookup.

Self-hosted caching with vLLM

If you serve your own models, caching is not a billing feature but a memory feature, and it is largely automatic. vLLM implements automatic prefix caching by hashing the KV cache at the block level, where a block is a small fixed run of tokens, 16 by default. Each block's hash chains from the blocks before it, so a new request can reuse any leading blocks whose hashes match and recompute only the divergent tail. On recent versions this is enabled by default.

There is no per-token charge for a self-hosted cache hit, but it is not free either. Cached KV blocks occupy GPU memory that would otherwise serve active requests, and vLLM evicts the least recently used blocks when it needs the space back. The design treats this as close to a free lunch, since reused blocks would have cost memory to compute anyway, but on a memory-constrained deployment the cache competes with batch capacity. The same block-boundary logic from the API providers applies: a hit requires an identical prefix aligned to block boundaries, so the cache hygiene rules above carry over almost unchanged to self-hosting.

Caching and routing: keeping the discount

Prompt caching and model routing interact in a way that is easy to miss. A cache lives on one specific backend, so the value of a warm prefix is tied to where the next request lands. Route a cache-warm conversation to a different provider to shave a few percent off the per-token rate, and you forfeit a 90 percent discount on the prefix to chase a much smaller one. The cheaper backend can easily produce the more expensive request.

The implication is that a routing layer has to be cache-aware, not just price-aware. Keeping a multi-turn conversation or an agent loop pinned to the backend that already holds its prefix is often worth more than any cross-provider price difference, and the decision should be visible rather than buried. This is one of the reasons our routing layer treats cache locality as a first-class signal and records why each request went where it did. For workloads that have not moved off direct provider APIs yet, our serverless inference and unified API handle the multi-provider plumbing while preserving cache affinity, and you can compare model behavior on your own prompts in the playground before committing.

Prompt caching is the rare optimization that improves cost and latency at the same time, with no effect on output quality. The discount is sitting there for almost every workload with a stable prefix. Whether you collect it comes down to two habits: structure the prompt so the stable part leads and never changes, and route requests so a warm cache stays warm.

Frequently asked questions

prompt cachingLLM costinferencecachinginfrastructure

Have thoughts on this article?

We would love to hear your feedback, questions, or experience with these topics. Reach out on social media or drop us a message.

Related Articles

Stay up to date

Get notified when we publish new articles on AI model selection, cost optimization, and infrastructure planning.

Start building with the right model.

Automatically route workloads to the right model for every task, every time.