What's the typical accuracy of the Inferbase GPU sizing engine vs measured vLLM throughput?

Within ±15-25% on decode throughput for the configurations we have validation data for, with the engine generally landing on the conservative side (under-predicting by 5-10%). Memory predictions are typically within ±5%. The remaining error is workload-dependent: batch composition variance, prefix cache hit rate, scheduler tail behavior, and small differences between published vLLM benchmark configurations and the specific framework version + flags used in production. Public sizing calculators that don't model paged attention, FP8 KV, or MoE structure are typically off by 50-100%+, so the structural piece dominates the error budget. Once those are right, the remaining gap is small.

Where does the roofline model under- or over-estimate?

Under-estimates: situations where vLLM's continuous batching scheduler maintains higher effective batch sizes than the simple memory headroom calculation suggests, particularly on workloads with high prefix cache hit rates. Over-estimates: configurations where TP communication overhead is larger than the assumed 60-80% efficiency, particularly multi-node deployments with sub-optimal interconnect, or when the GPU has memory bandwidth bottlenecks the roofline doesn't model (DRAM ECC overhead, NUMA effects on multi-socket nodes). For most single-node deployments on standard interconnects (NVLink, NVSwitch), the gap is consistently within the ±15-25% range.

Why doesn't the engine match measured throughput exactly?

Because measured throughput depends on factors that aren't modelable from first principles. The roofline model captures memory bandwidth bottleneck and compute ceiling, the two physics constraints. But real throughput is also shaped by: vLLM's scheduler decisions (how it interleaves prefill and decode), prefix cache hit rate (which can dramatically change effective input size), KV cache fragmentation under load, and CUDA kernel launch overhead at small batch sizes. A first-principles model can get within ~20% of these effects in aggregate; closing the remaining gap requires per-deployment empirical calibration, which the engine doesn't do.

How were the comparison numbers in this post sourced?

Published vLLM benchmark configurations from the vLLM project's official benchmark harness, plus a small number of internal measurements on rented neocloud instances (Runpod and Lambda) using vLLM 0.6.5 with default settings. We don't claim these are reproducible to the second decimal; they're representative numbers for common configurations. The comparison methodology: take the vLLM-published config (model, GPU, batch, context), enter the same settings into the Inferbase calculator, and compare predicted decode throughput against the measured numbers.

Should I trust the calculator output for production sizing decisions?

Trust it for the structural piece: VRAM headroom, GPU count, parallelism plan, ballpark throughput. Verify the throughput number with a small load test before committing to a production deployment, especially if you're at the boundary of what fits on the cheapest GPU configuration. The calculator is good enough to save you from order-of-magnitude mistakes (provisioning a single H100 when you need 4× H100, or paying for an 8× cluster when 2× is enough). It's not a replacement for empirical measurement when 10-25% accuracy matters for the decision.

How Close Are Roofline Estimates to Real vLLM Benchmarks?

The Inferbase GPU Capacity Planner uses a roofline model to predict throughput from first principles: memory bandwidth as the dominant bottleneck for decode, compute FLOPS as the ceiling, with a few correction factors for parallelism overhead and framework efficiency. The natural question for anyone using a calculator like this for production sizing is: how close are the predictions to what you actually measure when you deploy?

This post walks through five common configurations, comparing the engine's predicted decode throughput against measured numbers from published vLLM benchmarks plus a handful of internal measurements on rented neocloud instances. The goal is to be honest about where the model is accurate and where it isn't, and to make the limitations explicit so you know what error margin to budget for.

How the comparison was set up

For each configuration, we entered the same parameters into both the Inferbase engine and the published vLLM benchmark spec:

Model parameters (with active/total separation for MoE models)
Quantization (FP16, FP8, INT4)
KV cache precision and strategy
Tensor parallelism degree
Batch size
Context length

The Inferbase prediction is the decode_tokens_per_sec.expected value from runExtendedSizingEngine. The vLLM measured number is the average decode throughput from the published benchmark log, normalized to per-second tokens at the same batch size.

We're comparing decode throughput specifically because that's what dominates per-token cost in production agent and chat workloads. Prefill throughput comparisons are noisier because vLLM's continuous batching interleaves prefill and decode in ways that don't map cleanly to a single measured number.

Configuration 1: Llama 3.1 70B on 2× H100 SXM, FP8, batch=8

The most common production self-host configuration for 70B-class models in 2026.

Metric	Inferbase predicted	vLLM measured	Delta
Decode throughput	175 tok/s	195 tok/s	-10%
Memory used per GPU	56 GB	58 GB	-3%
TTFT (1K input)	285 ms	310 ms	-8%

The engine slightly under-predicts throughput, a typical pattern. The roofline model assumes 80% memory bandwidth utilization for the "expected" point, which is conservative for steady-state vLLM serving on a kept-warm cluster. In practice, vLLM's continuous batching pushes utilization to ~85-88% in our measurements. Memory prediction is dead-on. TTFT is within rounding distance.

You can run this configuration in the calculator and verify the same numbers.

Configuration 2: Llama 4 Scout (109B/17B) on 4× H100 SXM, FP8, batch=16

The MoE configuration that broke most public calculators (covered in our post on Scout sizing).

Metric	Inferbase predicted	vLLM measured	Delta
Decode throughput	7,200 tok/s	8,100 tok/s	-11%
Memory used per GPU	54 GB	56 GB	-4%
TTFT (1K input)	220 ms	245 ms	-10%

Same pattern as Configuration 1: engine slightly conservative on throughput, accurate on memory. The 11% gap on throughput is consistent with the FP8 + paged attention + MoE configuration; vLLM's MoE-specific routing optimizations (kernel fusion, expert reordering) get a few percent more out of the hardware than the roofline math accounts for. The error margin stays within the documented ±15-25% band.

This is also the configuration where a public calculator would have been off by a factor of 6 (either treating Scout as dense-17B or dense-109B). The structural correctness of MoE math swamps the 11% remaining error.

Configuration 3: DeepSeek V3 (671B/37B) on 8× H100 SXM, FP8, batch=8

Metric	Inferbase predicted	vLLM measured	Delta
Decode throughput	1,580 tok/s	1,700 tok/s	-7%
Memory used per GPU	78 GB	79 GB	-1%
TTFT (1K input)	320 ms	360 ms	-11%

DeepSeek V3 is the largest model where we have validated comparison data. The engine's roofline math holds up at this scale: the prediction is within 7% of measured decode and within 1% on memory. The TP=8 communication overhead is well-modeled by the engine's 80% efficiency assumption.

Configuration 4: Mistral 7B on 1× A100 80GB, FP16, batch=32

A common "small model, single GPU, high batch" workload: embedding pipelines, classification servers, small chat deployments.

Metric	Inferbase predicted	vLLM measured	Delta
Decode throughput	6,400 tok/s	5,500 tok/s	+16%
Memory used per GPU	26 GB	28 GB	-7%
TTFT (1K input)	95 ms	130 ms	-27%

This is the configuration where the engine is least accurate. Throughput prediction is too optimistic by 16%. The roofline assumes the GPU can saturate memory bandwidth at this batch size, but Mistral 7B at FP16 batch=32 is small enough that vLLM hits a different bottleneck: the scheduler can't keep the kernel queue full because each step takes only ~6 ms. Kernel launch overhead becomes a meaningful fraction of step time, which the roofline doesn't model. TTFT is more under-predicted (27%) for the same reason: small models on big GPUs have proportionally more launch-overhead wasted time.

This is a known limitation. The engine works well for memory-bandwidth-bound workloads (which describes most modern serving) but loses accuracy when models get small enough that compute or kernel-launch overhead dominates. Models below ~7B at high batch sizes on H100/A100-class GPUs are the regime where the prediction error grows.

Configuration 5: Llama 3.3 70B on 1× B200, FP8, batch=16

A newer-hardware configuration with limited validation data.

Metric	Inferbase predicted	vLLM measured	Delta
Decode throughput	380 tok/s	340 tok/s	+12%
Memory used per GPU	78 GB	80 GB	-3%
TTFT (1K input)	145 ms	160 ms	-9%

B200 is where the error margin grows in the other direction. The roofline model assumes B200's 8 TB/s bandwidth converts directly to throughput at the same efficiency as H100/H200. In practice, vLLM 0.6 doesn't yet have the same kernel optimizations for Blackwell that it has for Hopper. Getting full bandwidth utilization on B200 requires vLLM 0.7+ (released Q2 2026). On vLLM 0.6, B200 throughput is closer to H200 + 25-30% rather than the +60-80% the roofline suggests.

We document this caveat in the calculator's reasoning panel for B200 selections; the engine notes "Blackwell-specific framework optimizations may not be fully realized in vLLM 0.6" when the recommendation lands on B200.

Where the engine is reliably accurate

Synthesizing across the configurations:

Memory predictions: ±5% in essentially all configurations we have validation data for. Memory math is closer to deterministic; you either have the bytes or you don't.
Decode throughput on memory-bandwidth-bound workloads (≥7B params, batch ≤ memory ceiling): ±15-20%. This is the regime that covers most production serving.
Decode throughput on compute-bound or scheduler-bound workloads (small models, large batches, or new hardware with immature kernels): ±25-40%. Use measurements, not predictions, when the decision is sensitive at this margin.
TTFT: ±20-30%. Prefill is harder to model accurately because it interleaves with decode in the scheduler.

Where the engine is less reliable

A few specific configurations where you should validate empirically before committing:

Multi-node deployments with non-standard interconnect. The engine assumes NVLink or comparable bandwidth between TP-coordinated GPUs. For deployments using PCIe-only TP (no NVLink between cards), the engine applies a 60% efficiency penalty, but the actual penalty varies from 50% to 75% depending on workload. Test before deploying multi-node TP at scale.

B200 / GB200 deployments on vLLM 0.6 or earlier. Blackwell kernel optimizations are still maturing in late 2026. The engine over-predicts B200 throughput on older vLLM versions.

Workloads with very high prefix cache hit rates. vLLM's prefix caching effectively reduces the per-request input size, which compounds with batch in ways the roofline doesn't model. For agent workloads with 90%+ shared system-prompt prefix, actual throughput can be 1.5-2× the engine's prediction.

Workloads with very long contexts (>200K). The KV cache management overhead at 200K+ contexts becomes a meaningful fraction of memory and bandwidth budget; the engine under-predicts the headroom pressure at this scale.

How to use the calculator with these limits in mind

The right mental model for the calculator's output:

Use it for structural decisions: how many GPUs, what precision, what parallelism plan, whether the workload fits on cloud A or cloud B. These are the decisions where the calculator is reliably right.
Use it for ballpark cost estimates: within ±20% is good enough to compare "self-host X" vs "API provider Y" at order-of-magnitude scale.
Don't use it for last-mile capacity planning at <10% accuracy. If the difference between "fits on 2× H100" and "needs 3× H100" matters for the budget, run a small load test.

The methodology post walks through the underlying math. The companion post on calculator gaps covers the structural pieces (paged attention, FP8 KV, MoE) where most public calculators fail outright. Both are useful context for understanding what level of accuracy to expect from any sizing tool.

What we're working toward

The engine's accuracy improves as we gather more validation data. The current ±15-25% band on decode throughput could probably be reduced to ±10% with empirical calibration of two parameters: per-GPU bandwidth utilization (currently a flat 80% assumption for the "expected" point, but real vLLM utilization varies by GPU generation and batch size) and TP communication efficiency (currently fixed by interconnect type, but actual efficiency varies by model architecture). Both are on the roadmap.

In the meantime, the gap between the engine's first-principles prediction and measured reality is small enough for most production decisions. The bigger gap is between any calculator-based prediction and the order-of-magnitude errors public calculators produce when they don't model 2026 serving stack realities: paged attention, FP8 KV, and MoE memory math. Closing those structural gaps is where the largest wins are.

How Close Are Roofline Estimates to Real vLLM Benchmarks?

How the comparison was set up

Configuration 1: Llama 3.1 70B on 2× H100 SXM, FP8, batch=8

Configuration 2: Llama 4 Scout (109B/17B) on 4× H100 SXM, FP8, batch=16

Configuration 3: DeepSeek V3 (671B/37B) on 8× H100 SXM, FP8, batch=8

Configuration 4: Mistral 7B on 1× A100 80GB, FP16, batch=32

Configuration 5: Llama 3.3 70B on 1× B200, FP8, batch=16

Where the engine is reliably accurate

Where the engine is less reliable

How to use the calculator with these limits in mind

What we're working toward

Frequently asked questions

Have thoughts on this article?

Related Articles

Why Most GPU Memory Calculators Are Wrong About KV Cache

The Hidden Costs of LLM APIs: What Token Price Tables Don't Show

Claude Opus 4.7: Output Verification, High-Resolution Vision, and Anthropic's Agentic Ambitions

Stay up to date

Start building with the right model.