The Inferbase GPU Capacity Planner uses a roofline model to predict throughput from first principles: memory bandwidth as the dominant bottleneck for decode, compute FLOPS as the ceiling, with a few correction factors for parallelism overhead and framework efficiency. The natural question for anyone using a calculator like this for production sizing is: how close are the predictions to what you actually measure when you deploy?
This post walks through five common configurations, comparing the engine's predicted decode throughput against measured numbers from published vLLM benchmarks plus a handful of internal measurements on rented neocloud instances. The goal is to be honest about where the model is accurate and where it isn't, and to make the limitations explicit so you know what error margin to budget for.
How the comparison was set up
For each configuration, we entered the same parameters into both the Inferbase engine and the published vLLM benchmark spec:
- Model parameters (with active/total separation for MoE models)
- Quantization (FP16, FP8, INT4)
- KV cache precision and strategy
- Tensor parallelism degree
- Batch size
- Context length
The Inferbase prediction is the decode_tokens_per_sec.expected value from runExtendedSizingEngine. The vLLM measured number is the average decode throughput from the published benchmark log, normalized to per-second tokens at the same batch size.
We're comparing decode throughput specifically because that's what dominates per-token cost in production agent and chat workloads. Prefill throughput comparisons are noisier because vLLM's continuous batching interleaves prefill and decode in ways that don't map cleanly to a single measured number.
Configuration 1: Llama 3.1 70B on 2× H100 SXM, FP8, batch=8
The most common production self-host configuration for 70B-class models in 2026.
| Metric | Inferbase predicted | vLLM measured | Delta |
|---|---|---|---|
| Decode throughput | 175 tok/s | 195 tok/s | -10% |
| Memory used per GPU | 56 GB | 58 GB | -3% |
| TTFT (1K input) | 285 ms | 310 ms | -8% |
The engine slightly under-predicts throughput, a typical pattern. The roofline model assumes 80% memory bandwidth utilization for the "expected" point, which is conservative for steady-state vLLM serving on a kept-warm cluster. In practice, vLLM's continuous batching pushes utilization to ~85-88% in our measurements. Memory prediction is dead-on. TTFT is within rounding distance.
You can run this configuration in the calculator and verify the same numbers.
Configuration 2: Llama 4 Scout (109B/17B) on 4× H100 SXM, FP8, batch=16
The MoE configuration that broke most public calculators (covered in our post on Scout sizing).
| Metric | Inferbase predicted | vLLM measured | Delta |
|---|---|---|---|
| Decode throughput | 7,200 tok/s | 8,100 tok/s | -11% |
| Memory used per GPU | 54 GB | 56 GB | -4% |
| TTFT (1K input) | 220 ms | 245 ms | -10% |
Same pattern as Configuration 1: engine slightly conservative on throughput, accurate on memory. The 11% gap on throughput is consistent with the FP8 + paged attention + MoE configuration; vLLM's MoE-specific routing optimizations (kernel fusion, expert reordering) get a few percent more out of the hardware than the roofline math accounts for. The error margin stays within the documented ±15-25% band.
This is also the configuration where a public calculator would have been off by a factor of 6 (either treating Scout as dense-17B or dense-109B). The structural correctness of MoE math swamps the 11% remaining error.
Configuration 3: DeepSeek V3 (671B/37B) on 8× H100 SXM, FP8, batch=8
| Metric | Inferbase predicted | vLLM measured | Delta |
|---|---|---|---|
| Decode throughput | 1,580 tok/s | 1,700 tok/s | -7% |
| Memory used per GPU | 78 GB | 79 GB | -1% |
| TTFT (1K input) | 320 ms | 360 ms | -11% |
DeepSeek V3 is the largest model where we have validated comparison data. The engine's roofline math holds up at this scale: the prediction is within 7% of measured decode and within 1% on memory. The TP=8 communication overhead is well-modeled by the engine's 80% efficiency assumption.
Configuration 4: Mistral 7B on 1× A100 80GB, FP16, batch=32
A common "small model, single GPU, high batch" workload: embedding pipelines, classification servers, small chat deployments.
| Metric | Inferbase predicted | vLLM measured | Delta |
|---|---|---|---|
| Decode throughput | 6,400 tok/s | 5,500 tok/s | +16% |
| Memory used per GPU | 26 GB | 28 GB | -7% |
| TTFT (1K input) | 95 ms | 130 ms | -27% |
This is the configuration where the engine is least accurate. Throughput prediction is too optimistic by 16%. The roofline assumes the GPU can saturate memory bandwidth at this batch size, but Mistral 7B at FP16 batch=32 is small enough that vLLM hits a different bottleneck: the scheduler can't keep the kernel queue full because each step takes only ~6 ms. Kernel launch overhead becomes a meaningful fraction of step time, which the roofline doesn't model. TTFT is more under-predicted (27%) for the same reason: small models on big GPUs have proportionally more launch-overhead wasted time.
This is a known limitation. The engine works well for memory-bandwidth-bound workloads (which describes most modern serving) but loses accuracy when models get small enough that compute or kernel-launch overhead dominates. Models below ~7B at high batch sizes on H100/A100-class GPUs are the regime where the prediction error grows.
Configuration 5: Llama 3.3 70B on 1× B200, FP8, batch=16
A newer-hardware configuration with limited validation data.
| Metric | Inferbase predicted | vLLM measured | Delta |
|---|---|---|---|
| Decode throughput | 380 tok/s | 340 tok/s | +12% |
| Memory used per GPU | 78 GB | 80 GB | -3% |
| TTFT (1K input) | 145 ms | 160 ms | -9% |
B200 is where the error margin grows in the other direction. The roofline model assumes B200's 8 TB/s bandwidth converts directly to throughput at the same efficiency as H100/H200. In practice, vLLM 0.6 doesn't yet have the same kernel optimizations for Blackwell that it has for Hopper. Getting full bandwidth utilization on B200 requires vLLM 0.7+ (released Q2 2026). On vLLM 0.6, B200 throughput is closer to H200 + 25-30% rather than the +60-80% the roofline suggests.
We document this caveat in the calculator's reasoning panel for B200 selections; the engine notes "Blackwell-specific framework optimizations may not be fully realized in vLLM 0.6" when the recommendation lands on B200.
Where the engine is reliably accurate
Synthesizing across the configurations:
- Memory predictions: ±5% in essentially all configurations we have validation data for. Memory math is closer to deterministic; you either have the bytes or you don't.
- Decode throughput on memory-bandwidth-bound workloads (≥7B params, batch ≤ memory ceiling): ±15-20%. This is the regime that covers most production serving.
- Decode throughput on compute-bound or scheduler-bound workloads (small models, large batches, or new hardware with immature kernels): ±25-40%. Use measurements, not predictions, when the decision is sensitive at this margin.
- TTFT: ±20-30%. Prefill is harder to model accurately because it interleaves with decode in the scheduler.
Where the engine is less reliable
A few specific configurations where you should validate empirically before committing:
Multi-node deployments with non-standard interconnect. The engine assumes NVLink or comparable bandwidth between TP-coordinated GPUs. For deployments using PCIe-only TP (no NVLink between cards), the engine applies a 60% efficiency penalty, but the actual penalty varies from 50% to 75% depending on workload. Test before deploying multi-node TP at scale.
B200 / GB200 deployments on vLLM 0.6 or earlier. Blackwell kernel optimizations are still maturing in late 2026. The engine over-predicts B200 throughput on older vLLM versions.
Workloads with very high prefix cache hit rates. vLLM's prefix caching effectively reduces the per-request input size, which compounds with batch in ways the roofline doesn't model. For agent workloads with 90%+ shared system-prompt prefix, actual throughput can be 1.5-2× the engine's prediction.
Workloads with very long contexts (>200K). The KV cache management overhead at 200K+ contexts becomes a meaningful fraction of memory and bandwidth budget; the engine under-predicts the headroom pressure at this scale.
How to use the calculator with these limits in mind
The right mental model for the calculator's output:
- Use it for structural decisions: how many GPUs, what precision, what parallelism plan, whether the workload fits on cloud A or cloud B. These are the decisions where the calculator is reliably right.
- Use it for ballpark cost estimates: within ±20% is good enough to compare "self-host X" vs "API provider Y" at order-of-magnitude scale.
- Don't use it for last-mile capacity planning at <10% accuracy. If the difference between "fits on 2× H100" and "needs 3× H100" matters for the budget, run a small load test.
The methodology post walks through the underlying math. The companion post on calculator gaps covers the structural pieces (paged attention, FP8 KV, MoE) where most public calculators fail outright. Both are useful context for understanding what level of accuracy to expect from any sizing tool.
What we're working toward
The engine's accuracy improves as we gather more validation data. The current ±15-25% band on decode throughput could probably be reduced to ±10% with empirical calibration of two parameters: per-GPU bandwidth utilization (currently a flat 80% assumption for the "expected" point, but real vLLM utilization varies by GPU generation and batch size) and TP communication efficiency (currently fixed by interconnect type, but actual efficiency varies by model architecture). Both are on the roadmap.
In the meantime, the gap between the engine's first-principles prediction and measured reality is small enough for most production decisions. The bigger gap is between any calculator-based prediction and the order-of-magnitude errors public calculators produce when they don't model 2026 serving stack realities: paged attention, FP8 KV, and MoE memory math. Closing those structural gaps is where the largest wins are.