How much does enterprise LLM inference actually cost in 2026?

There is no single number that captures it, because the effective rate depends on multiple compounding factors. The same 1M tokens of Llama 3.3 70B costs anywhere from $0.10 on DeepInfra to $0.90 on Fireworks, a 9x spread for identical model weights. Frontier proprietary models cluster between $0.10/M (Gemini 2.5 Flash-Lite) and $30/M input (GPT-5.5 Pro and Claude Opus 4.7 fast mode), a 300x range. The effective rate a team pays depends on cache hit rate, output ratio, rate-limit tier, region, and whether they are using batch or real-time inference. A realistic enterprise budget treats the published $/M token rate as the upper bound and works downward from there.

How much can prompt caching reduce an Anthropic bill?

Two ways. Cached read tokens are billed at 0.1x the base input rate, so any prompt prefix reused across requests gets a 90% discount. The less-known second lever is that cache reads do not count toward the input-tokens-per-minute rate limit on Claude 4.x models. A team at Tier 4 with a 2M ITPM ceiling and an 80% cache hit rate has effective throughput of 10M input tokens per minute, with the bottom 80% billed at 10% of the headline rate. Combined, a cache-friendly workload can run at roughly 1/8 of the cost of an identical uncached workload and clear 5x more traffic within the same rate-limit tier.

When does self-hosting LLM inference become cheaper than serverless?

Around 10-50 billion input tokens per month of steady traffic for an open-source 70B-class model, before counting engineering overhead. At 50M tokens per month, the cheapest serverless option (DeepInfra at $0.10 input, $0.32 output) costs roughly $10, while a 2x H100 instance to self-host the same model costs around $2,900 per month at neocloud prices and $9,900 per month on AWS. The crossover happens when a steady traffic floor justifies paying for full GPU utilization 24/7. Bursty or unpredictable load almost always favors serverless even at high aggregate volume, because cold-start exposure and over-provisioning cancel the per-token savings.

Is AWS Bedrock more expensive than calling Anthropic and OpenAI directly?

Yes, by approximately 20-35% on Anthropic models per third-party trackers in 2026, and Bedrock further charges a 10% premium for regional or multi-region inference profiles over global endpoints on Claude 4.5 and newer. The premium buys integration with AWS IAM, VPC routing, CloudTrail logging, and a unified bill, which is real value for regulated industries. For teams without those constraints, calling provider APIs directly is consistently cheaper, and the gap widens once prompt caching is enabled because Bedrock's caching support lags first-party APIs by several months on each new model release.

The Real Cost of Inference at Enterprise Scale: A 2026 Pricing Audit

Q: Why does the same open-source model cost 9x more on some providers?

Open-source models like Llama 3.3 70B are served by 15-20 inference providers each operating at different GPU utilization, hardware mix, and margin profiles. DeepInfra and Nebius run aggressive FP8 quantization on dense H100 fleets and price near cost, while Fireworks and Together price for stability and feature breadth (function calling, structured output, fine-tuning support). Groq operates an entirely different architecture (the LPU) that delivers 300+ tokens per second but at a 6x premium over the cheapest hosts. The cheapest provider on a list rarely turns out to be the cheapest provider in production once latency tail, throughput ceiling, and tool support are factored in.

A team running an agent in production on Claude Sonnet 4.6 with a 30,000-token system prompt sends 100,000 requests per day. The published rate of $3 per million input tokens predicts a monthly bill of $9,000. The actual bill, in our experience and in conversations with several enterprise teams over the last quarter, lands closer to $14,000, and sometimes $18,000. The gap is not measurement error. It reflects the cumulative effect of rate-limit tier surcharges, regional data-residency uplifts, retry amplification, tool-use side charges, and the gap between cached and uncached input pricing, very little of which any public pricing comparison surfaces in a single place.

In April we published a framework for thinking about LLM API costs, and explicitly deferred the cross-provider comparison until it could be structured around the four factors that actually drive enterprise spend: caching mechanics, output-to-input ratios, rate-limit tier surcharges, and reliability tax. This post is that deferred comparison. It uses pricing data captured directly from primary-source provider documentation in the first week of May 2026, and applies the framework to real numbers across the three layers a typical enterprise stack actually touches: frontier proprietary APIs, open-source models on serverless hosts, and self-hosted GPU infrastructure.

The argument, stated up front, is that the headline $/M token rate has become the least informative part of an enterprise inference budget. Most of the variation that affects a real bill now lives in the surrounding mechanics: caching architecture, rate-limit tier ladders, region multipliers, context-tier surcharges, and tool-use side charges. The rate card is still where the conversation starts, but it is no longer where it usefully ends.

Frontier proprietary pricing as of May 2026

The four major frontier vendors (OpenAI, Anthropic, Google, DeepSeek) have converged on a similar pricing structure. Each publishes a base input rate, a cache-read rate at 5-10% of input, an output rate at 4-6x input, and a batch tier at 50% off for asynchronous workloads. Within that structure, the absolute numbers continue to shift quarterly, and the May 2026 snapshot below covers the frontier lineup at each vendor across the nano, small, and flagship tiers.

Vendor	Model	Input $/M	Cache hit $/M	Output $/M	Batch input $/M	Batch output $/M
OpenAI	gpt-5.4	$2.50	$0.25	$15.00	$1.25	$7.50
OpenAI	gpt-5.4-mini	$0.75	$0.075	$4.50	$0.375	$2.25
OpenAI	gpt-5.4-nano	$0.20	$0.02	$1.25	$0.10	$0.625
Anthropic	Claude Opus 4.7	$5.00	$0.50	$25.00	$2.50	$12.50
Anthropic	Claude Sonnet 4.6	$3.00	$0.30	$15.00	$1.50	$7.50
Anthropic	Claude Haiku 4.5	$1.00	$0.10	$5.00	$0.50	$2.50
Google	Gemini 3.1 Pro Preview	$2.00	$0.20	$12.00	$1.00	$6.00
Google	Gemini 2.5 Pro	$1.25	$0.125	$10.00	$0.625	$5.00
Google	Gemini 2.5 Flash-Lite	$0.10	$0.01	$0.40	$0.05	$0.20
DeepSeek	v4-flash	$0.14	$0.0028	$0.28	n/a	n/a
DeepSeek	v4-pro (promotional)	$0.435	$0.003625	$0.87	n/a	n/a

Three observations stand out from the table. First, the 200x spread between the cheapest viable frontier model (Gemini 2.5 Flash-Lite at $0.10 input) and the most capable (Claude Opus 4.7 at $5 input, GPT-5.5 Pro at $30) is larger than any prior generation, and the smaller models have closed enough of the quality gap that a meaningful share of production workloads no longer require the flagships at all.

Second, DeepSeek v4-flash at $0.14 input, $0.28 output is a structural break in pricing: a frontier-class model priced at OpenAI nano levels, with cache hits at roughly three-tenths of a cent per million. The promotional v4-pro pricing window through May 31 brings full reasoning capability to under $0.50/M input, which is the lowest published rate for a frontier-class reasoning model in the current ecosystem.

Third, Anthropic's caching architecture is strictly more granular than OpenAI's or Google's, with separate prices for 5-minute and 1-hour cache writes and an explicit cache-control mechanism. The result is a different optimization surface than the automatic caching the others provide, which becomes important once a workload is routed by cost.

The pricing table also obscures Anthropic's tokenizer change in Opus 4.7, which uses up to 35% more tokens for the same fixed text compared to earlier Claude models. The headline input rate held at $5/M from Opus 4.5 through 4.7, but the effective cost per unit of work increased commensurate with the token count, an adjustment that does not appear on any pricing page and that several teams have surfaced in private conversations as a reason their budget overshot in late April.

The same model, nine times the price

The proprietary table is comparing different models. The more revealing comparison is the same model across hosting providers, which strips out capability differences and isolates infrastructure economics. Llama 3.3 70B is the canonical case, served by roughly twenty providers on hardware ranging from H100 SXM clusters to Groq LPU racks to AMD MI300X to bespoke FP8 quantization stacks.

Provider	Input $/M	Output $/M	Throughput (tok/s)	p50 latency (s)
DeepInfra (Turbo, FP8)	$0.10	$0.32	n/a	n/a
Nebius Base	$0.13	$0.40	n/a	n/a
Novita	$0.14	n/a	n/a	n/a
Lightning AI	n/a	$0.30	n/a	n/a
Groq	$0.59	$0.79	302.7	0.83
Replicate	$0.65	$2.75	n/a	n/a
Together AI	~$0.88	~$0.88	n/a	n/a
Fireworks	$0.90	$0.90	n/a	0.82
Google Vertex	not published	not published	148.3	0.66
Amazon Bedrock	not published	not published	141.7	n/a
SambaNova	not published	not published	293.7	n/a

Sources: Artificial Analysis Llama 3.3 70B provider comparison, individual provider pricing pages.

Bar chart of input cost per million tokens for Llama 3.3 70B Instruct across seven providers, ranging from $0.10 on DeepInfra to $0.90 on Fireworks. The 9x spread is annotated below the chart, with Groq highlighted separately for its 302 tokens-per-second throughput.

The 9x input-price spread, $0.10 to $0.90 for identical model weights, is the most important number in this audit. It exists because providers operate on different cost structures: DeepInfra and Nebius prioritize raw throughput on FP8-quantized workloads with thin margins, Groq prioritizes throughput at a hardware-specific architecture, and Together and Fireworks bundle the model with feature breadth (function calling, structured output, fine-tuning, dedicated capacity options) that the cheapest providers do not offer at their advertised price.

Throughput is the second axis, and it is more correlated with price than naive cost analysis suggests. Groq's 302 tokens per second on Llama 3.3 70B is roughly 2x the throughput delivered by Vertex or Bedrock and 4-6x the throughput observed on the cheapest hosts (where output rates are not always published, but practical measurements cluster around 50-80 tok/s for the budget tier). For latency-sensitive interactive workloads, paying 6x for Groq versus DeepInfra is often the right call. For batch enrichment, the cost spread dominates and the throughput is irrelevant.

A note on Bedrock and Vertex: neither publishes Llama 3.3 70B pricing on its public-facing comparison pages with the granularity of an independent host, and the actual rate visible inside the AWS or Google Cloud console is typically 20-35% above the cheapest comparable serverless option. The premium funds the integration value (IAM, VPC, audit logging, single bill) that regulated industries require, but it is rarely competitive on pure inference cost.

Rate-limit tiers as deferred cost

The published price assumes the team has earned the right to pay it. On Anthropic and OpenAI, that right is gated by a tier ladder calibrated to cumulative spend, not just current spending capacity. The Anthropic structure as of May 2026 illustrates the pattern.

Tier	Cumulative credit purchase	Monthly spend cap	Sonnet 4.x RPM	Sonnet 4.x ITPM	Sonnet 4.x OTPM
Tier 1	$5	$100	50	30,000	8,000
Tier 2	$40	$500	1,000	450,000	90,000
Tier 3	$200	$1,000	2,000	800,000	160,000
Tier 4	$400	$200,000	4,000	2,000,000	400,000
Custom	sales contact	unlimited	custom	custom	custom

A Tier 1 Anthropic account is capped at 30,000 input tokens per minute on Sonnet 4.6, which is roughly one 30K-context request per minute and nothing more. A team that has provisioned a Sonnet 4.6 deployment and pointed production traffic at it will hit 429 errors on the second concurrent request, regardless of how much it is willing to spend on the bill. Reaching Tier 4 requires $400 of cumulative credit purchase plus payment history, which translates to several weeks of production usage at moderate volume before the rate-limit ceiling stops being the bottleneck. OpenAI's tier system follows a similar logic with five tiers gated at $5, $40, $200, $1,000, and higher cumulative spend, though the exact threshold structure has shifted twice in the last twelve months.

The latent cost is the friction. A team forecasting an LLM budget at $20,000 per month often discovers that the first $1,000 of that spend is paid simply to qualify for a tier where the rest of the spend is mechanically possible. For workloads with steady demand, this is a one-time tax that amortizes quickly. For workloads with bursty traffic patterns or that need to scale to a regulatory deadline, the tier ladder can become a hard blocker that no amount of budget can purchase around without an enterprise sales conversation.

The dimension that almost never appears in cost analyses is what counts toward the rate limit. On all current Claude 4.x models, cache-read tokens do not count against ITPM. A team at Tier 4 with a 2,000,000 ITPM ceiling and an 80% cache hit rate has effective throughput of 10,000,000 input tokens per minute, with the cached 80% billed at 10% of the headline rate on top of that. The same architectural decision (structuring prompts so that 70-90% of the input is reusable) reduces the rate-limit constraint by roughly 5x while reducing the bill by roughly 5-8x.

There is no equivalent mechanism on OpenAI or Google, where cached and uncached tokens both consume rate-limit budget. For workloads on Anthropic, prompt-cache discipline is simultaneously a cost optimization and a throughput one, and the throughput effect is often the more consequential of the two for teams approaching their tier ceiling.

Region surcharges, context surcharges, and tool side-charges

Three categories of side-charges materially change the effective bill and rarely appear in a top-line provider comparison.

Regional and data-residency uplifts. Anthropic's 1P API charges a 1.1x multiplier on every token category (input, output, cache reads, cache writes, batch) when the request specifies inference_geo: "us" for guaranteed US-only inference on Claude Opus 4.6 and newer models. Anthropic on Bedrock and Vertex applies the same 1.1x premium for regional or multi-region endpoints on Claude 4.5 and beyond. OpenAI's regional processing endpoints carry a similar 10% uplift. For a workload spending $30,000 per month with strict US data-residency requirements, the multiplier alone is $3,000.

Context-tier surcharges. Google Gemini 2.5 Pro and Gemini 3.1 Pro Preview both double their entire pricing structure (input, output, batch) for prompts above 200K tokens. A request with a 250K-token codebase context and 15K tokens of output costs effectively 2x the headline rate. Anthropic Opus 4.7, Opus 4.6, and Sonnet 4.6 take the opposite approach: the full 1M-token context window is billed at the standard rate with no surcharge, which makes Anthropic structurally cheaper than Gemini for long-context workloads despite Gemini's lower headline price. OpenAI does not surface a context-tier surcharge on its public pricing page.

Tool-use side-charges. On Anthropic, web search is billed at $10 per 1,000 searches in addition to standard token costs, code execution is free up to 1,550 hours per month per organization and then $0.05/hour per container, and the tool-use system prompt automatically adds 313-346 tokens to every request that uses tools. An agent making 3-5 tool calls per turn over thousands of turns per day routinely sees side-charges add 20-40% to the headline token bill, and that overhead does not appear in any top-line comparison. OpenAI charges separately for file search, web search, and code interpreter; Google bundles some tools into the Gemini API at no additional cost but charges for others. The shape of the agent's tool graph is therefore a primary cost driver in agentic workloads, alongside the model selection itself.

The self-hosted crossover

The decision to self-host LLM inference is sometimes framed as a margin question (we want to capture what we pay providers) and sometimes as a control question (we need on-premise data residency). The economics rarely favor self-hosting for general-purpose workloads, and the crossover point is higher than most teams' first-pass calculation suggests. The May 2026 numbers are concrete enough to evaluate directly.

A 2x H100 80GB instance is sufficient to serve Llama 3.1 70B FP16 with reasonable batching. Per-hour costs on that instance vary by roughly 4x across hosting providers.

Provider	Form factor	$/hr	Monthly (24/7)
Runpod	2x H100 PCIe	$3.98	$2,866
CUDO Compute	2x H100 80GB	$4.50	$3,240
Lambda Labs	2x H100 SXM	$8.58	$6,178
AWS p5.4xlarge	2x H100 (single instance)	$13.76	$9,907
Azure NCCadsv5	2x H100	$15.78	$11,362

Sources: getdeploying.com H100 cloud comparison, individual provider pricing pages.

Even the cheapest option costs roughly $2,900 per month before any operational overhead. To compare against serverless, assume an even input/output mix and the cheapest comparable serverless rate ($0.10 input, $0.32 output on DeepInfra).

Monthly volume	Serverless cost	vs. cheapest self-host	vs. AWS self-host
50M tokens	~$10.50	self-host 273x more	self-host 943x more
500M tokens	~$105	self-host 27x more	self-host 94x more
5B tokens	~$1,050	self-host 2.7x more	self-host 9.4x more
50B tokens	~$10,500	self-host 0.27x (3.7x cheaper)	self-host 0.94x (parity)
500B tokens	~$105,000	self-host 0.027x (37x cheaper)	self-host 0.094x (10x cheaper)

The crossover for a 70B-class open-source model on the cheapest neocloud is somewhere between 10 billion and 50 billion tokens per month of steady traffic. Below that threshold, self-hosting is mechanically more expensive than calling DeepInfra. The crossover for a hyperscaler-hosted self-deploy (AWS, Azure) is closer to 50-100B tokens per month, because the per-GPU rate is 3-4x higher.

Log-log line chart of monthly cost versus monthly token volume for three lanes: DeepInfra serverless inference at $0.21 per million tokens blended (rising diagonal), self-host Runpod 2x H100 PCIe at $2,866 per month flat, and self-host AWS 2x H100 at $9,907 per month flat. Crossover points are marked at approximately 13.6 billion tokens per month (Runpod) and 47 billion tokens per month (AWS).

What the table does not show is the operational cost. Engineering time to deploy, monitor, scale, and update a self-hosted inference stack runs roughly $10,000 to $30,000 per month in loaded engineering cost for a small team, and significantly more if reliability requirements demand on-call rotations and multi-region failover. Cold-start exposure on bursty traffic patterns can add another 20-40% in over-provisioning. Model updates require a redeploy every time the underlying weights change, which on the open-source frontier is roughly every 6-10 weeks.

A practical guideline that holds across most enterprise contexts: do not self-host below 1B tokens per month, regardless of how cheap the silicon looks. Above 50B tokens per month of steady traffic, the case strengthens, and above 500B tokens per month it becomes hard to justify not running a hybrid stack with self-hosted as the primary lane and serverless as the burst overflow. Between those bounds, the right architecture is almost always serverless with multi-provider routing, because the cost difference is dominated by operational expense rather than per-token rate.

How to actually estimate enterprise inference cost

The four-factor framework from the hidden costs post maps onto the data above in a specific way. A defensible enterprise budget produces an estimate in roughly this order.

Step 1: Compute the effective input rate after caching. Decompose the average request into stable prefix (system prompt, tools, retrieval context) and variable tail (user query). For a workload with a 4K-token system prompt and 200-token average user message, roughly 95% of input is reusable. At an 80% cache hit rate that converts to an effective input rate of 0.2 × base + 0.8 × (0.1 × base) = 0.28 × base. The headline rate has dropped by a factor of 3.5 before any other optimization.

Step 2: Multiply by the output ratio appropriate to the workload. Reasoning-heavy and agentic workloads run output-to-input ratios of 1:1 to 5:1, while extraction and classification workloads run closer to 1:10. The output rate is typically 4-6x the input rate, so for a workload with a 1:1 ratio, output dominates the bill by roughly 5x.

Step 3: Apply tier headroom and side-charge stack. Verify that the projected per-minute throughput fits within the rate-limit ceiling at the tier the team can realistically reach in the first month of operation. Add region multipliers (1.1x for US-only on Anthropic, comparable on Bedrock and Vertex), context-tier surcharges (2x for Gemini Pro above 200K), and tool-use overhead (20-40% on agentic workloads). The fully loaded rate is typically 1.4-1.6x the rate from Step 2.

Step 4: Compare across providers using the fully loaded rate, not the published rate. A provider that is 10% cheaper on the headline rate but 30% more expensive on the cache-read multiplier will lose to the more expensive provider for any cache-friendly workload. Reasoning workloads, long-context workloads, and tool-heavy agentic workloads each have a different optimal provider, and the answer is rarely the one with the cheapest landing-page price.

The framework remains the one from the April post; the May 2026 data above provides the calibration. A team that walks through the four steps with a real workload almost always arrives at a budget that is 30-50% lower than the naive forecast and within 10% of actual production spend, which is the level of accuracy that lets a finance team plan around AI infrastructure rather than continuously revise it.

What this means for inference architecture

Three structural conclusions emerge from the May 2026 data.

The first is that provider pricing has flattened enough at the headline level that the optimization surface has moved to the surrounding mechanics. A few years ago the dominant question was whether to use OpenAI or a competitor at all. In 2026 the dominant question is which mechanics each provider supports well (caching depth, batch tier availability, tool integration cost, rate-limit ceiling at your tier), and the answer is workload-dependent. A team that picks one provider and standardizes on it is paying for organizational simplicity at a cost that is now visible in the bill.

The second is that the OSS-hosted model market has bifurcated. Cheap, throughput-tier providers (DeepInfra, Nebius, Novita) operate near commodity economics for batch and high-volume workloads. Premium-tier providers (Together, Fireworks, Replicate) charge more but offer feature breadth that matters for specific workload types. Specialized hardware providers (Groq, SambaNova) deliver throughput that the others cannot at any price. The right answer is rarely a single provider, and routing logic that picks per-workload is increasingly a structural cost optimization, not a feature.

The third is that self-hosting is a more specialized decision than the cost narrative suggests. The crossover threshold for a 70B-class model is 10-50B tokens per month of steady traffic, which describes a fairly narrow band of mid-stage SaaS companies. Below that, serverless wins decisively on cost-per-token. Above 100B tokens per month, the question stops being "should we self-host" and becomes "what hybrid architecture optimally splits steady traffic from burst." The middle band is where most strategic mistakes happen, in both directions: companies that self-host too early and companies that delay too long.

The pricing page has become a less useful document than it used to be for budgeting an enterprise workload. The rate-limit tier policies, the cache architecture documentation, and the side-charge schedules are where the economics of enterprise inference in 2026 actually play out, and a budget that is built without consulting those documents tends to miss its target by 30-50% in the first month of production.

Inferbase operates a unified inference API across the major proprietary and OSS-hosted providers, with routing logic that handles the cache, retry, and provider-selection mechanics described in this post. The playground supports side-by-side cost comparisons across providers for a given prompt without committing to any of them. For teams running an enterprise inference workload, the four-step estimation framework above applies regardless of whether the routing layer is built in-house or sourced; what matters is that the bill is being modeled at the level of the mechanics rather than the rate card.

The published price remains the obvious starting point for any cost forecast. It has also become a less reliable predictor of what an enterprise team will actually spend, and a forecast built around it alone has roughly even odds of being off by a factor of two over the next twelve months.

The Real Cost of Inference at Enterprise Scale: A 2026 Pricing Audit

Frontier proprietary pricing as of May 2026

The same model, nine times the price

Rate-limit tiers as deferred cost

Region surcharges, context surcharges, and tool side-charges

The self-hosted crossover

How to actually estimate enterprise inference cost

What this means for inference architecture

Frequently asked questions

Have thoughts on this article?

Related Articles

The Hidden Costs of LLM APIs: What Token Price Tables Don't Show

Claude-Class Agent Workloads: When Self-Hosting Beats the Anthropic API

How Close Are Roofline Estimates to Real vLLM Benchmarks?

Stay up to date

Start building with the right model.