Skip to main content

The Real Cost of Inference at Enterprise Scale: A 2026 Pricing Audit

Inferbase Team19 min read

A team running an agent in production on Claude Sonnet 4.6 with a 30,000-token system prompt sends 100,000 requests per day. The published rate of $3 per million input tokens predicts a monthly bill of $9,000. The actual bill, in our experience and in conversations with several enterprise teams over the last quarter, lands closer to $14,000, and sometimes $18,000. The gap is not measurement error. It reflects the cumulative effect of rate-limit tier surcharges, regional data-residency uplifts, retry amplification, tool-use side charges, and the gap between cached and uncached input pricing, very little of which any public pricing comparison surfaces in a single place.

In April we published a framework for thinking about LLM API costs, and explicitly deferred the cross-provider comparison until it could be structured around the four factors that actually drive enterprise spend: caching mechanics, output-to-input ratios, rate-limit tier surcharges, and reliability tax. This post is that deferred comparison. It uses pricing data captured directly from primary-source provider documentation in the first week of May 2026, and applies the framework to real numbers across the three layers a typical enterprise stack actually touches: frontier proprietary APIs, open-source models on serverless hosts, and self-hosted GPU infrastructure.

The argument, stated up front, is that the headline $/M token rate has become the least informative part of an enterprise inference budget. Most of the variation that affects a real bill now lives in the surrounding mechanics: caching architecture, rate-limit tier ladders, region multipliers, context-tier surcharges, and tool-use side charges. The rate card is still where the conversation starts, but it is no longer where it usefully ends.

Frontier proprietary pricing as of May 2026

The four major frontier vendors (OpenAI, Anthropic, Google, DeepSeek) have converged on a similar pricing structure. Each publishes a base input rate, a cache-read rate at 5-10% of input, an output rate at 4-6x input, and a batch tier at 50% off for asynchronous workloads. Within that structure, the absolute numbers continue to shift quarterly, and the May 2026 snapshot below covers the frontier lineup at each vendor across the nano, small, and flagship tiers.

VendorModelInput $/MCache hit $/MOutput $/MBatch input $/MBatch output $/M
OpenAIgpt-5.4$2.50$0.25$15.00$1.25$7.50
OpenAIgpt-5.4-mini$0.75$0.075$4.50$0.375$2.25
OpenAIgpt-5.4-nano$0.20$0.02$1.25$0.10$0.625
AnthropicClaude Opus 4.7$5.00$0.50$25.00$2.50$12.50
AnthropicClaude Sonnet 4.6$3.00$0.30$15.00$1.50$7.50
AnthropicClaude Haiku 4.5$1.00$0.10$5.00$0.50$2.50
GoogleGemini 3.1 Pro Preview$2.00$0.20$12.00$1.00$6.00
GoogleGemini 2.5 Pro$1.25$0.125$10.00$0.625$5.00
GoogleGemini 2.5 Flash-Lite$0.10$0.01$0.40$0.05$0.20
DeepSeekv4-flash$0.14$0.0028$0.28n/an/a
DeepSeekv4-pro (promotional)$0.435$0.003625$0.87n/an/a

Three observations stand out from the table. First, the 200x spread between the cheapest viable frontier model (Gemini 2.5 Flash-Lite at $0.10 input) and the most capable (Claude Opus 4.7 at $5 input, GPT-5.5 Pro at $30) is larger than any prior generation, and the smaller models have closed enough of the quality gap that a meaningful share of production workloads no longer require the flagships at all.

Second, DeepSeek v4-flash at $0.14 input, $0.28 output is a structural break in pricing: a frontier-class model priced at OpenAI nano levels, with cache hits at roughly three-tenths of a cent per million. The promotional v4-pro pricing window through May 31 brings full reasoning capability to under $0.50/M input, which is the lowest published rate for a frontier-class reasoning model in the current ecosystem.

Third, Anthropic's caching architecture is strictly more granular than OpenAI's or Google's, with separate prices for 5-minute and 1-hour cache writes and an explicit cache-control mechanism. The result is a different optimization surface than the automatic caching the others provide, which becomes important once a workload is routed by cost.

The pricing table also obscures Anthropic's tokenizer change in Opus 4.7, which uses up to 35% more tokens for the same fixed text compared to earlier Claude models. The headline input rate held at $5/M from Opus 4.5 through 4.7, but the effective cost per unit of work increased commensurate with the token count, an adjustment that does not appear on any pricing page and that several teams have surfaced in private conversations as a reason their budget overshot in late April.

The same model, nine times the price

The proprietary table is comparing different models. The more revealing comparison is the same model across hosting providers, which strips out capability differences and isolates infrastructure economics. Llama 3.3 70B is the canonical case, served by roughly twenty providers on hardware ranging from H100 SXM clusters to Groq LPU racks to AMD MI300X to bespoke FP8 quantization stacks.

ProviderInput $/MOutput $/MThroughput (tok/s)p50 latency (s)
DeepInfra (Turbo, FP8)$0.10$0.32n/an/a
Nebius Base$0.13$0.40n/an/a
Novita$0.14n/an/an/a
Lightning AIn/a$0.30n/an/a
Groq$0.59$0.79302.70.83
Replicate$0.65$2.75n/an/a
Together AI~$0.88~$0.88n/an/a
Fireworks$0.90$0.90n/a0.82
Google Vertexnot publishednot published148.30.66
Amazon Bedrocknot publishednot published141.7n/a
SambaNovanot publishednot published293.7n/a

Sources: Artificial Analysis Llama 3.3 70B provider comparison, individual provider pricing pages.

Bar chart of input cost per million tokens for Llama 3.3 70B Instruct across seven providers, ranging from $0.10 on DeepInfra to $0.90 on Fireworks. The 9x spread is annotated below the chart, with Groq highlighted separately for its 302 tokens-per-second throughput.

The 9x input-price spread, $0.10 to $0.90 for identical model weights, is the most important number in this audit. It exists because providers operate on different cost structures: DeepInfra and Nebius prioritize raw throughput on FP8-quantized workloads with thin margins, Groq prioritizes throughput at a hardware-specific architecture, and Together and Fireworks bundle the model with feature breadth (function calling, structured output, fine-tuning, dedicated capacity options) that the cheapest providers do not offer at their advertised price.

Throughput is the second axis, and it is more correlated with price than naive cost analysis suggests. Groq's 302 tokens per second on Llama 3.3 70B is roughly 2x the throughput delivered by Vertex or Bedrock and 4-6x the throughput observed on the cheapest hosts (where output rates are not always published, but practical measurements cluster around 50-80 tok/s for the budget tier). For latency-sensitive interactive workloads, paying 6x for Groq versus DeepInfra is often the right call. For batch enrichment, the cost spread dominates and the throughput is irrelevant.

A note on Bedrock and Vertex: neither publishes Llama 3.3 70B pricing on its public-facing comparison pages with the granularity of an independent host, and the actual rate visible inside the AWS or Google Cloud console is typically 20-35% above the cheapest comparable serverless option. The premium funds the integration value (IAM, VPC, audit logging, single bill) that regulated industries require, but it is rarely competitive on pure inference cost.

Rate-limit tiers as deferred cost

The published price assumes the team has earned the right to pay it. On Anthropic and OpenAI, that right is gated by a tier ladder calibrated to cumulative spend, not just current spending capacity. The Anthropic structure as of May 2026 illustrates the pattern.

TierCumulative credit purchaseMonthly spend capSonnet 4.x RPMSonnet 4.x ITPMSonnet 4.x OTPM
Tier 1$5$1005030,0008,000
Tier 2$40$5001,000450,00090,000
Tier 3$200$1,0002,000800,000160,000
Tier 4$400$200,0004,0002,000,000400,000
Customsales contactunlimitedcustomcustomcustom

A Tier 1 Anthropic account is capped at 30,000 input tokens per minute on Sonnet 4.6, which is roughly one 30K-context request per minute and nothing more. A team that has provisioned a Sonnet 4.6 deployment and pointed production traffic at it will hit 429 errors on the second concurrent request, regardless of how much it is willing to spend on the bill. Reaching Tier 4 requires $400 of cumulative credit purchase plus payment history, which translates to several weeks of production usage at moderate volume before the rate-limit ceiling stops being the bottleneck. OpenAI's tier system follows a similar logic with five tiers gated at $5, $40, $200, $1,000, and higher cumulative spend, though the exact threshold structure has shifted twice in the last twelve months.

The latent cost is the friction. A team forecasting an LLM budget at $20,000 per month often discovers that the first $1,000 of that spend is paid simply to qualify for a tier where the rest of the spend is mechanically possible. For workloads with steady demand, this is a one-time tax that amortizes quickly. For workloads with bursty traffic patterns or that need to scale to a regulatory deadline, the tier ladder can become a hard blocker that no amount of budget can purchase around without an enterprise sales conversation.

The dimension that almost never appears in cost analyses is what counts toward the rate limit. On all current Claude 4.x models, cache-read tokens do not count against ITPM. A team at Tier 4 with a 2,000,000 ITPM ceiling and an 80% cache hit rate has effective throughput of 10,000,000 input tokens per minute, with the cached 80% billed at 10% of the headline rate on top of that. The same architectural decision (structuring prompts so that 70-90% of the input is reusable) reduces the rate-limit constraint by roughly 5x while reducing the bill by roughly 5-8x.

There is no equivalent mechanism on OpenAI or Google, where cached and uncached tokens both consume rate-limit budget. For workloads on Anthropic, prompt-cache discipline is simultaneously a cost optimization and a throughput one, and the throughput effect is often the more consequential of the two for teams approaching their tier ceiling.

Region surcharges, context surcharges, and tool side-charges

Three categories of side-charges materially change the effective bill and rarely appear in a top-line provider comparison.

Regional and data-residency uplifts. Anthropic's 1P API charges a 1.1x multiplier on every token category (input, output, cache reads, cache writes, batch) when the request specifies inference_geo: "us" for guaranteed US-only inference on Claude Opus 4.6 and newer models. Anthropic on Bedrock and Vertex applies the same 1.1x premium for regional or multi-region endpoints on Claude 4.5 and beyond. OpenAI's regional processing endpoints carry a similar 10% uplift. For a workload spending $30,000 per month with strict US data-residency requirements, the multiplier alone is $3,000.

Context-tier surcharges. Google Gemini 2.5 Pro and Gemini 3.1 Pro Preview both double their entire pricing structure (input, output, batch) for prompts above 200K tokens. A request with a 250K-token codebase context and 15K tokens of output costs effectively 2x the headline rate. Anthropic Opus 4.7, Opus 4.6, and Sonnet 4.6 take the opposite approach: the full 1M-token context window is billed at the standard rate with no surcharge, which makes Anthropic structurally cheaper than Gemini for long-context workloads despite Gemini's lower headline price. OpenAI does not surface a context-tier surcharge on its public pricing page.

Tool-use side-charges. On Anthropic, web search is billed at $10 per 1,000 searches in addition to standard token costs, code execution is free up to 1,550 hours per month per organization and then $0.05/hour per container, and the tool-use system prompt automatically adds 313-346 tokens to every request that uses tools. An agent making 3-5 tool calls per turn over thousands of turns per day routinely sees side-charges add 20-40% to the headline token bill, and that overhead does not appear in any top-line comparison. OpenAI charges separately for file search, web search, and code interpreter; Google bundles some tools into the Gemini API at no additional cost but charges for others. The shape of the agent's tool graph is therefore a primary cost driver in agentic workloads, alongside the model selection itself.

The self-hosted crossover

The decision to self-host LLM inference is sometimes framed as a margin question (we want to capture what we pay providers) and sometimes as a control question (we need on-premise data residency). The economics rarely favor self-hosting for general-purpose workloads, and the crossover point is higher than most teams' first-pass calculation suggests. The May 2026 numbers are concrete enough to evaluate directly.

A 2x H100 80GB instance is sufficient to serve Llama 3.1 70B FP16 with reasonable batching. Per-hour costs on that instance vary by roughly 4x across hosting providers.

ProviderForm factor$/hrMonthly (24/7)
Runpod2x H100 PCIe$3.98$2,866
CUDO Compute2x H100 80GB$4.50$3,240
Lambda Labs2x H100 SXM$8.58$6,178
AWS p5.4xlarge2x H100 (single instance)$13.76$9,907
Azure NCCadsv52x H100$15.78$11,362

Sources: getdeploying.com H100 cloud comparison, individual provider pricing pages.

Even the cheapest option costs roughly $2,900 per month before any operational overhead. To compare against serverless, assume an even input/output mix and the cheapest comparable serverless rate ($0.10 input, $0.32 output on DeepInfra).

Monthly volumeServerless costvs. cheapest self-hostvs. AWS self-host
50M tokens~$10.50self-host 273x moreself-host 943x more
500M tokens~$105self-host 27x moreself-host 94x more
5B tokens~$1,050self-host 2.7x moreself-host 9.4x more
50B tokens~$10,500self-host 0.27x (3.7x cheaper)self-host 0.94x (parity)
500B tokens~$105,000self-host 0.027x (37x cheaper)self-host 0.094x (10x cheaper)

The crossover for a 70B-class open-source model on the cheapest neocloud is somewhere between 10 billion and 50 billion tokens per month of steady traffic. Below that threshold, self-hosting is mechanically more expensive than calling DeepInfra. The crossover for a hyperscaler-hosted self-deploy (AWS, Azure) is closer to 50-100B tokens per month, because the per-GPU rate is 3-4x higher.

Log-log line chart of monthly cost versus monthly token volume for three lanes: DeepInfra serverless inference at $0.21 per million tokens blended (rising diagonal), self-host Runpod 2x H100 PCIe at $2,866 per month flat, and self-host AWS 2x H100 at $9,907 per month flat. Crossover points are marked at approximately 13.6 billion tokens per month (Runpod) and 47 billion tokens per month (AWS).

What the table does not show is the operational cost. Engineering time to deploy, monitor, scale, and update a self-hosted inference stack runs roughly $10,000 to $30,000 per month in loaded engineering cost for a small team, and significantly more if reliability requirements demand on-call rotations and multi-region failover. Cold-start exposure on bursty traffic patterns can add another 20-40% in over-provisioning. Model updates require a redeploy every time the underlying weights change, which on the open-source frontier is roughly every 6-10 weeks.

A practical guideline that holds across most enterprise contexts: do not self-host below 1B tokens per month, regardless of how cheap the silicon looks. Above 50B tokens per month of steady traffic, the case strengthens, and above 500B tokens per month it becomes hard to justify not running a hybrid stack with self-hosted as the primary lane and serverless as the burst overflow. Between those bounds, the right architecture is almost always serverless with multi-provider routing, because the cost difference is dominated by operational expense rather than per-token rate.

How to actually estimate enterprise inference cost

The four-factor framework from the hidden costs post maps onto the data above in a specific way. A defensible enterprise budget produces an estimate in roughly this order.

Step 1: Compute the effective input rate after caching. Decompose the average request into stable prefix (system prompt, tools, retrieval context) and variable tail (user query). For a workload with a 4K-token system prompt and 200-token average user message, roughly 95% of input is reusable. At an 80% cache hit rate that converts to an effective input rate of 0.2 × base + 0.8 × (0.1 × base) = 0.28 × base. The headline rate has dropped by a factor of 3.5 before any other optimization.

Step 2: Multiply by the output ratio appropriate to the workload. Reasoning-heavy and agentic workloads run output-to-input ratios of 1:1 to 5:1, while extraction and classification workloads run closer to 1:10. The output rate is typically 4-6x the input rate, so for a workload with a 1:1 ratio, output dominates the bill by roughly 5x.

Step 3: Apply tier headroom and side-charge stack. Verify that the projected per-minute throughput fits within the rate-limit ceiling at the tier the team can realistically reach in the first month of operation. Add region multipliers (1.1x for US-only on Anthropic, comparable on Bedrock and Vertex), context-tier surcharges (2x for Gemini Pro above 200K), and tool-use overhead (20-40% on agentic workloads). The fully loaded rate is typically 1.4-1.6x the rate from Step 2.

Step 4: Compare across providers using the fully loaded rate, not the published rate. A provider that is 10% cheaper on the headline rate but 30% more expensive on the cache-read multiplier will lose to the more expensive provider for any cache-friendly workload. Reasoning workloads, long-context workloads, and tool-heavy agentic workloads each have a different optimal provider, and the answer is rarely the one with the cheapest landing-page price.

The framework remains the one from the April post; the May 2026 data above provides the calibration. A team that walks through the four steps with a real workload almost always arrives at a budget that is 30-50% lower than the naive forecast and within 10% of actual production spend, which is the level of accuracy that lets a finance team plan around AI infrastructure rather than continuously revise it.

What this means for inference architecture

Three structural conclusions emerge from the May 2026 data.

The first is that provider pricing has flattened enough at the headline level that the optimization surface has moved to the surrounding mechanics. A few years ago the dominant question was whether to use OpenAI or a competitor at all. In 2026 the dominant question is which mechanics each provider supports well (caching depth, batch tier availability, tool integration cost, rate-limit ceiling at your tier), and the answer is workload-dependent. A team that picks one provider and standardizes on it is paying for organizational simplicity at a cost that is now visible in the bill.

The second is that the OSS-hosted model market has bifurcated. Cheap, throughput-tier providers (DeepInfra, Nebius, Novita) operate near commodity economics for batch and high-volume workloads. Premium-tier providers (Together, Fireworks, Replicate) charge more but offer feature breadth that matters for specific workload types. Specialized hardware providers (Groq, SambaNova) deliver throughput that the others cannot at any price. The right answer is rarely a single provider, and routing logic that picks per-workload is increasingly a structural cost optimization, not a feature.

The third is that self-hosting is a more specialized decision than the cost narrative suggests. The crossover threshold for a 70B-class model is 10-50B tokens per month of steady traffic, which describes a fairly narrow band of mid-stage SaaS companies. Below that, serverless wins decisively on cost-per-token. Above 100B tokens per month, the question stops being "should we self-host" and becomes "what hybrid architecture optimally splits steady traffic from burst." The middle band is where most strategic mistakes happen, in both directions: companies that self-host too early and companies that delay too long.

The pricing page has become a less useful document than it used to be for budgeting an enterprise workload. The rate-limit tier policies, the cache architecture documentation, and the side-charge schedules are where the economics of enterprise inference in 2026 actually play out, and a budget that is built without consulting those documents tends to miss its target by 30-50% in the first month of production.


Inferbase operates a unified inference API across the major proprietary and OSS-hosted providers, with routing logic that handles the cache, retry, and provider-selection mechanics described in this post. The playground supports side-by-side cost comparisons across providers for a given prompt without committing to any of them. For teams running an enterprise inference workload, the four-step estimation framework above applies regardless of whether the routing layer is built in-house or sourced; what matters is that the bill is being modeled at the level of the mechanics rather than the rate card.

The published price remains the obvious starting point for any cost forecast. It has also become a less reliable predictor of what an enterprise team will actually spend, and a forecast built around it alone has roughly even odds of being off by a factor of two over the next twelve months.

Frequently asked questions

LLM pricinginference costenterprise AIcost optimizationGPU economics

Have thoughts on this article?

We would love to hear your feedback, questions, or experience with these topics. Reach out on social media or drop us a message.

Related Articles

Stay up to date

Get notified when we publish new articles on AI model selection, cost optimization, and infrastructure planning.

Start building with the right model.

From model selection to production, one platform, no fragmentation.