Skip to main content

How Close Are Roofline Estimates to Real vLLM Benchmarks?

Inferbase Team9 min read

The Inferbase GPU Capacity Planner uses a roofline model to predict throughput from first principles: memory bandwidth as the dominant bottleneck for decode, compute FLOPS as the ceiling, with a few correction factors for parallelism overhead and framework efficiency. The natural question for anyone using a calculator like this for production sizing is: how close are the predictions to what you actually measure when you deploy?

This post walks through five common configurations, comparing the engine's predicted decode throughput against measured numbers from published vLLM benchmarks plus a handful of internal measurements on rented neocloud instances. The goal is to be honest about where the model is accurate and where it isn't, and to make the limitations explicit so you know what error margin to budget for.

How the comparison was set up

For each configuration, we entered the same parameters into both the Inferbase engine and the published vLLM benchmark spec:

  • Model parameters (with active/total separation for MoE models)
  • Quantization (FP16, FP8, INT4)
  • KV cache precision and strategy
  • Tensor parallelism degree
  • Batch size
  • Context length

The Inferbase prediction is the decode_tokens_per_sec.expected value from runExtendedSizingEngine. The vLLM measured number is the average decode throughput from the published benchmark log, normalized to per-second tokens at the same batch size.

We're comparing decode throughput specifically because that's what dominates per-token cost in production agent and chat workloads. Prefill throughput comparisons are noisier because vLLM's continuous batching interleaves prefill and decode in ways that don't map cleanly to a single measured number.

Configuration 1: Llama 3.1 70B on 2× H100 SXM, FP8, batch=8

The most common production self-host configuration for 70B-class models in 2026.

MetricInferbase predictedvLLM measuredDelta
Decode throughput175 tok/s195 tok/s-10%
Memory used per GPU56 GB58 GB-3%
TTFT (1K input)285 ms310 ms-8%

The engine slightly under-predicts throughput, a typical pattern. The roofline model assumes 80% memory bandwidth utilization for the "expected" point, which is conservative for steady-state vLLM serving on a kept-warm cluster. In practice, vLLM's continuous batching pushes utilization to ~85-88% in our measurements. Memory prediction is dead-on. TTFT is within rounding distance.

You can run this configuration in the calculator and verify the same numbers.

Configuration 2: Llama 4 Scout (109B/17B) on 4× H100 SXM, FP8, batch=16

The MoE configuration that broke most public calculators (covered in our post on Scout sizing).

MetricInferbase predictedvLLM measuredDelta
Decode throughput7,200 tok/s8,100 tok/s-11%
Memory used per GPU54 GB56 GB-4%
TTFT (1K input)220 ms245 ms-10%

Same pattern as Configuration 1: engine slightly conservative on throughput, accurate on memory. The 11% gap on throughput is consistent with the FP8 + paged attention + MoE configuration; vLLM's MoE-specific routing optimizations (kernel fusion, expert reordering) get a few percent more out of the hardware than the roofline math accounts for. The error margin stays within the documented ±15-25% band.

This is also the configuration where a public calculator would have been off by a factor of 6 (either treating Scout as dense-17B or dense-109B). The structural correctness of MoE math swamps the 11% remaining error.

Configuration 3: DeepSeek V3 (671B/37B) on 8× H100 SXM, FP8, batch=8

MetricInferbase predictedvLLM measuredDelta
Decode throughput1,580 tok/s1,700 tok/s-7%
Memory used per GPU78 GB79 GB-1%
TTFT (1K input)320 ms360 ms-11%

DeepSeek V3 is the largest model where we have validated comparison data. The engine's roofline math holds up at this scale: the prediction is within 7% of measured decode and within 1% on memory. The TP=8 communication overhead is well-modeled by the engine's 80% efficiency assumption.

Configuration 4: Mistral 7B on 1× A100 80GB, FP16, batch=32

A common "small model, single GPU, high batch" workload: embedding pipelines, classification servers, small chat deployments.

MetricInferbase predictedvLLM measuredDelta
Decode throughput6,400 tok/s5,500 tok/s+16%
Memory used per GPU26 GB28 GB-7%
TTFT (1K input)95 ms130 ms-27%

This is the configuration where the engine is least accurate. Throughput prediction is too optimistic by 16%. The roofline assumes the GPU can saturate memory bandwidth at this batch size, but Mistral 7B at FP16 batch=32 is small enough that vLLM hits a different bottleneck: the scheduler can't keep the kernel queue full because each step takes only ~6 ms. Kernel launch overhead becomes a meaningful fraction of step time, which the roofline doesn't model. TTFT is more under-predicted (27%) for the same reason: small models on big GPUs have proportionally more launch-overhead wasted time.

This is a known limitation. The engine works well for memory-bandwidth-bound workloads (which describes most modern serving) but loses accuracy when models get small enough that compute or kernel-launch overhead dominates. Models below ~7B at high batch sizes on H100/A100-class GPUs are the regime where the prediction error grows.

Configuration 5: Llama 3.3 70B on 1× B200, FP8, batch=16

A newer-hardware configuration with limited validation data.

MetricInferbase predictedvLLM measuredDelta
Decode throughput380 tok/s340 tok/s+12%
Memory used per GPU78 GB80 GB-3%
TTFT (1K input)145 ms160 ms-9%

B200 is where the error margin grows in the other direction. The roofline model assumes B200's 8 TB/s bandwidth converts directly to throughput at the same efficiency as H100/H200. In practice, vLLM 0.6 doesn't yet have the same kernel optimizations for Blackwell that it has for Hopper. Getting full bandwidth utilization on B200 requires vLLM 0.7+ (released Q2 2026). On vLLM 0.6, B200 throughput is closer to H200 + 25-30% rather than the +60-80% the roofline suggests.

We document this caveat in the calculator's reasoning panel for B200 selections; the engine notes "Blackwell-specific framework optimizations may not be fully realized in vLLM 0.6" when the recommendation lands on B200.

Log-log scatter plot of Inferbase predicted decode throughput versus vLLM measured throughput across the five configurations above. The dashed green diagonal shows perfect prediction (y = x). Two shaded bands mark ±15% and ±25% accuracy. All five points land inside the ±25% band; four of five inside the ±15% band. The H100/H200 configurations (Llama 3.1 70B, Llama 4 Scout, DeepSeek V3) cluster slightly below the diagonal — the engine is conservative on memory-bandwidth-bound decode. Mistral 7B on A100 and Llama 3.3 70B on B200 sit above — the engine is optimistic when small models hit kernel-launch overhead or when Blackwell kernels are immature.

Where the engine is reliably accurate

Synthesizing across the configurations:

  • Memory predictions: ±5% in essentially all configurations we have validation data for. Memory math is closer to deterministic; you either have the bytes or you don't.
  • Decode throughput on memory-bandwidth-bound workloads (≥7B params, batch ≤ memory ceiling): ±15-20%. This is the regime that covers most production serving.
  • Decode throughput on compute-bound or scheduler-bound workloads (small models, large batches, or new hardware with immature kernels): ±25-40%. Use measurements, not predictions, when the decision is sensitive at this margin.
  • TTFT: ±20-30%. Prefill is harder to model accurately because it interleaves with decode in the scheduler.

Where the engine is less reliable

A few specific configurations where you should validate empirically before committing:

Multi-node deployments with non-standard interconnect. The engine assumes NVLink or comparable bandwidth between TP-coordinated GPUs. For deployments using PCIe-only TP (no NVLink between cards), the engine applies a 60% efficiency penalty, but the actual penalty varies from 50% to 75% depending on workload. Test before deploying multi-node TP at scale.

B200 / GB200 deployments on vLLM 0.6 or earlier. Blackwell kernel optimizations are still maturing in late 2026. The engine over-predicts B200 throughput on older vLLM versions.

Workloads with very high prefix cache hit rates. vLLM's prefix caching effectively reduces the per-request input size, which compounds with batch in ways the roofline doesn't model. For agent workloads with 90%+ shared system-prompt prefix, actual throughput can be 1.5-2× the engine's prediction.

Workloads with very long contexts (>200K). The KV cache management overhead at 200K+ contexts becomes a meaningful fraction of memory and bandwidth budget; the engine under-predicts the headroom pressure at this scale.

How to use the calculator with these limits in mind

The right mental model for the calculator's output:

  1. Use it for structural decisions: how many GPUs, what precision, what parallelism plan, whether the workload fits on cloud A or cloud B. These are the decisions where the calculator is reliably right.
  2. Use it for ballpark cost estimates: within ±20% is good enough to compare "self-host X" vs "API provider Y" at order-of-magnitude scale.
  3. Don't use it for last-mile capacity planning at <10% accuracy. If the difference between "fits on 2× H100" and "needs 3× H100" matters for the budget, run a small load test.

The methodology post walks through the underlying math. The companion post on calculator gaps covers the structural pieces (paged attention, FP8 KV, MoE) where most public calculators fail outright. Both are useful context for understanding what level of accuracy to expect from any sizing tool.

What we're working toward

The engine's accuracy improves as we gather more validation data. The current ±15-25% band on decode throughput could probably be reduced to ±10% with empirical calibration of two parameters: per-GPU bandwidth utilization (currently a flat 80% assumption for the "expected" point, but real vLLM utilization varies by GPU generation and batch size) and TP communication efficiency (currently fixed by interconnect type, but actual efficiency varies by model architecture). Both are on the roadmap.

In the meantime, the gap between the engine's first-principles prediction and measured reality is small enough for most production decisions. The bigger gap is between any calculator-based prediction and the order-of-magnitude errors public calculators produce when they don't model 2026 serving stack realities: paged attention, FP8 KV, and MoE memory math. Closing those structural gaps is where the largest wins are.

Frequently asked questions

vLLMbenchmarksGPU sizingroofline modelinference

Have thoughts on this article?

We would love to hear your feedback, questions, or experience with these topics. Reach out on social media or drop us a message.

Related Articles

Stay up to date

Get notified when we publish new articles on AI model selection, cost optimization, and infrastructure planning.

Start building with the right model.

From model selection to production, one platform, no fragmentation.