What are the hidden costs of LLM APIs beyond the token price?

Four factors beyond the list $/M token rate typically determine what you pay in production: prompt caching discounts (which vary from 50% to 90% across providers and can shift the effective rate dramatically), output-to-input token ratios (reasoning models can generate 3-10x more output tokens than expected), rate limit tier surcharges (production throughput often requires higher-priced tiers), and reliability costs (every retry is a duplicate billable request). Together these can push the true cost 40-60% above the published rate.

How much can prompt caching save on LLM API costs?

Prompt caching discounts range from roughly 50% to 90% depending on the provider. Anthropic offers up to 90% off cached input tokens on Claude models, OpenAI offers 50% off on cached tokens for GPT-5 series, and Google offers automatic context caching on Gemini models. For applications with stable system prompts or repeated document context, caching can reduce input-token spend by 70-80% in aggregate once warm.

Do all LLM providers offer the same batch API discount?

Most major providers offer roughly 50% off list price for batch (asynchronous) processing, including OpenAI, Anthropic, and Google. The tradeoff is latency: batch requests can take up to 24 hours to complete. For workloads that can tolerate the delay, such as overnight enrichment jobs, scheduled report generation, or offline evaluation runs, batch APIs represent the single largest untapped discount in most LLM spending.

Why does the same open-source model cost different amounts across providers?

Open-source models like Llama 4 or Qwen are served by multiple inference providers (Together, Fireworks, Groq, and others), each operating at different GPU utilization, margin, and infrastructure profiles. Prices can vary by 2-4x for the same model. Beyond the list rate, providers differentiate on throughput guarantees, tail latency, and reliability. The cheapest provider on a list is rarely the cheapest provider in production once retries and rate-limit restarts are accounted for.

The Hidden Costs of LLM APIs: What Token Price Tables Don't Show

Every LLM provider publishes a pricing page with dollars per million tokens as the list rate. Most teams build their cost projections on those numbers and discover, usually one or two months into production, that the actual bill is 40-60% higher than the model predicted. The gap is not caused by a single miscalculation. It is the cumulative effect of four factors that provider pricing pages rarely emphasize, and that most internal forecasting spreadsheets omit entirely.

This post is not a pricing comparison. It is a framework for reasoning about LLM API costs in a way that survives contact with production workloads. We use illustrative numbers from publicly documented provider pricing as of April 2026, but the principles apply regardless of which vendor or model you choose. When we publish a full pricing comparison in a later post, it will be structured around these four factors rather than a flat $/M token ranking.

Why list prices are incomplete

The $/M token figure on a provider's landing page describes the cost of a single hypothetical request in which every token is freshly computed, the request succeeds on the first attempt, the model produces exactly the output length you expected, and you are paying the standard public tier. In practice, no production workload operates under all four of those assumptions simultaneously.

A more useful mental model is that the list rate establishes the floor, and four multipliers determine where your actual cost lands above that floor. The multipliers are not independent: a design decision that reduces one can amplify another. Reasoning about them separately makes it possible to optimize the largest contributors first rather than chasing savings in the wrong dimension.

Factor 1: Cached input tokens

Most production workloads send the same system prompt, tool definitions, or document context hundreds or thousands of times per day. Providers that implement prompt caching allow these repeated tokens to be stored and replayed at a dramatically reduced rate, and the size of that reduction varies substantially across vendors.

Provider	Cached input discount	Notes
Anthropic (Claude family)	Up to 90% off cached tokens	Explicit `cache_control` breakpoints; 5-minute default TTL
OpenAI (GPT-5 family)	50% off cached tokens	Automatic for prompts ≥1024 tokens; 5-10 minute TTL
Google (Gemini family)	Automatic context caching	Pricing varies by model and cache duration

The difference between a 50% and a 90% discount on the cached portion of a prompt becomes substantial at scale. An application with a 4,000-token system prompt sending 100,000 requests per day pays for roughly 400 million input tokens monthly on that prompt alone. At Anthropic's 90% discount, that portion costs one-tenth of the uncached rate. At OpenAI's 50% discount, it costs one-half. The list input-token price of the two providers might be similar, but the effective rate for a cache-heavy application can differ by a factor of five.

The practical implication is that cache-friendly prompt design is not a nice-to-have optimization. Structuring prompts so that the stable portion (system instructions, tool definitions, reference documents) appears before the variable portion (the current user query) is often the single highest-leverage change a team can make to LLM spending. Our guide on LLM cost optimization strategies covers prompt structuring in detail.

Factor 2: Output-to-input token ratios

LLM pricing pages typically list two rates, one for input tokens and one for output tokens, with output priced at 3-5x the input rate. This asymmetry is not a rounding concern. For a meaningful fraction of production workloads, output tokens dominate the bill, and the dominance has grown sharper with the proliferation of reasoning-heavy models.

A traditional chat application might produce output tokens at roughly 10-30% of input volume. A reasoning-heavy workload using extended thinking modes can produce output tokens at 200-500% of input volume, because the model is allocating significant compute to internal chain-of-thought before emitting the visible response. On a model like Claude Opus 4.7, priced at $5 per million input tokens and $25 per million output tokens, a shift in output ratio from 20% to 200% changes the effective cost of a request by roughly 9x without any change to the input prompt.

Teams that design their cost forecasts around input volume and then enable reasoning modes in production without revisiting the projection are the most common case of post-launch bill shock. The framing that works better in practice is to forecast output tokens directly based on the task type, treat input tokens as the supporting cost rather than the primary one for reasoning workloads, and verify the assumption empirically before committing a workload to production.

A practical rule of thumb: any workload that uses extended thinking, agentic tool use, or structured-output generation should be modeled with output tokens as the primary cost driver. The guide on choosing the right model includes a section on matching task types to cost profiles.

Factor 3: Rate-limit tiers and throughput tax

Provider pricing pages quote a per-token rate, but the rate at which you are allowed to consume tokens is a separate dimension that often carries its own cost. On most major providers, the default free or developer tier has strict limits on tokens-per-minute (TPM) and requests-per-minute (RPM) that are insufficient for any meaningful production workload. Unlocking higher throughput requires either moving to a paid tier with per-dollar spending commitments or qualifying for an enterprise arrangement.

The tax this imposes is twofold. The first is the direct cost of the tier itself: higher throughput tiers frequently have minimum spending commitments that a small or pre-production workload cannot justify, effectively pricing the workload out of using the better model. The second, less obvious cost is operational: workloads that run close to their rate-limit ceiling absorb latency variance (slower responses as requests queue behind limits) and failure modes (explicit 429 rate-limit errors that must be retried with backoff) that degrade user experience in ways that are not captured in the $/M token figure.

For comparison purposes across providers, the dimension to evaluate is the TPM available at your planned spending level, not just the list rate. A provider that is 10% cheaper per token but requires 3x the spending commitment to reach comparable throughput is only cheaper in the abstract. In production, the team either pays the higher commitment or accepts throughput ceilings that shape user experience and system architecture in costly ways.

Factor 4: Reliability tax

Every major LLM provider publishes uptime numbers that sound favorable in marketing (three-nines, four-nines) but that map, in practice, to non-zero error rates in every production workload we have measured. Typical error rates from direct provider APIs range from roughly 0.5% to 3% depending on the provider, the region, the model, and whether the request falls within a capacity-constrained window (evenings in US time zones are particularly notable).

The reliability tax is the cost of the retries required to maintain a target end-to-end success rate. If your user-facing service must succeed 99.9% of the time, and the upstream provider succeeds 98% of the time, you need a retry policy that absorbs the remaining two percentage points of failure. Each retry is a second billable request for the same logical operation, and for workloads with large input contexts, the retry cost can be significant.

The compounding effect is what makes this tax easy to underestimate. A workload with a 2% first-attempt failure rate and a single retry has an effective cost multiplier of 1.02x, which sounds negligible. But the same workload routed through a second retry, or operating during a provider incident where the first-attempt failure rate climbs to 15-20%, can see effective cost multipliers of 1.3-1.5x for hours at a time. Teams that do not model this often end up overspending their projected budget during their first week in production simply because they did not account for retry amplification.

Multi-provider routing with fallback, which has historically been associated with latency or availability concerns, increasingly turns out to be a cost optimization as well. When a primary provider is degraded, routing the retry to a healthy secondary provider rather than hammering the primary avoids both the latency cost and the duplicated billing.

Two underused discounts most teams miss

The four factors above describe where costs hide. Before concluding, it is worth noting two substantial discounts that providers publish openly but that most teams do not use.

Batch API. Most major providers (OpenAI, Anthropic, Google) offer a batch processing tier at approximately 50% off list price, in exchange for up to 24-hour completion latency. Any workload that can tolerate that delay (overnight enrichment jobs, scheduled report generation, offline evaluation runs, fine-tuning data preparation) is paying twice the necessary rate if it runs through the standard synchronous API. The batch API is the single largest published discount in the LLM ecosystem, and the adoption rate remains surprisingly low.

Volume commitments. Enterprise tiers at most providers offer 20-40% discounts in exchange for annual spending commitments, often with additional throughput guarantees that reduce the rate-limit tax discussed above. For teams with predictable monthly spend above a certain threshold, the combination of batch processing and volume commitment can bring the effective rate 60-70% below the published list price.

A decision framework

Different workloads are sensitive to different cost dimensions, and the optimization sequence that works for one workload can waste effort on another. A short framework for deciding where to focus:

Workload type	Dominant cost factor	First optimization
Chatbot / customer support	Cached input tokens	Cache-friendly prompt design; provider choice by cache discount depth
Reasoning / agents	Output-token ratio	Model selection; adaptive reasoning tiers over fixed-max
High-volume extraction	Rate-limit throughput	Tier planning; batch API where latency permits
Real-time / low-latency	Reliability tax	Multi-provider routing with fallback
Overnight enrichment	Missed batch discount	Move to batch API (single largest lever)

The common thread across workload types is that the effective cost of an LLM API is determined by the interaction between workload characteristics and provider pricing mechanics, not by the list rate alone. Two teams using the same model at the same published rate can see monthly bills that differ by 3x or more based on how well their workload design aligns with the provider's discount structure.

What this framework sets up

Inferbase is an end-to-end AI inference platform. The four factors above are the same factors we evaluate when deciding which providers to support, how our routing layer makes fallback decisions, and which models to surface for which workload types. A pricing comparison across providers only becomes useful once these dimensions are factored in, which is why we have deferred publishing one until it can be structured around them.

If you are evaluating LLM providers for a production workload, the practical step beyond reading a comparison post is running the actual models against your actual prompts. Our playground lets you compare responses side-by-side across models without a commitment, and the unified inference API handles the multi-provider routing and fallback logic that keeps the reliability tax in check once you move to production.

Token price is the start of the cost conversation, not the end. Teams that build their forecasting around the four factors above, rather than around the $/M token list rate, consistently arrive at budgets that survive contact with production.

The Hidden Costs of LLM APIs: What Token Price Tables Don't Show

Why list prices are incomplete

Factor 1: Cached input tokens

Factor 2: Output-to-input token ratios

Factor 3: Rate-limit tiers and throughput tax

Factor 4: Reliability tax

Two underused discounts most teams miss

A decision framework

What this framework sets up

Frequently asked questions

Have thoughts on this article?

Related Articles

The Real Cost of Inference at Enterprise Scale: A 2026 Pricing Audit

How Close Are Roofline Estimates to Real vLLM Benchmarks?

Why Most GPU Memory Calculators Are Wrong About KV Cache

Stay up to date

Start building with the right model.