Skip to main content

How Our GPU Capacity Planning Calculator Works

Inferbase TeamMarch 31, 202610 min read

Our GPU Capacity Planning tool estimates VRAM requirements, throughput, latency, and batch capacity for self-hosted AI model deployments. This document explains exactly how every number is calculated, what assumptions we make, and where estimates may differ from your real-world results.

1. Model Weight Memory

Formula:

Model Weights (GB) = Parameters × Bytes per Parameter / 1,073,741,824

This is exact, no estimation involved. Each parameter is stored at the selected precision:

PrecisionBytes per ParameterExample: 70B Model
FP324260.8 GB
FP16 / BF162130.4 GB
FP8165.2 GB
INT8165.2 GB
INT40.532.6 GB

For fine-tuning use cases, additional memory is added for optimizer states:

  • LoRA/QLoRA: +0.5 bytes per parameter (adapter weights + optimizer for ~5-10% of parameters)
  • Full fine-tuning: +8 bytes per parameter (Adam optimizer: 4 bytes for momentum + 4 bytes for variance, both in FP32)

2. Architecture Estimation

When you select a model from our catalog, we know its parameter count. To calculate KV cache, we also need the model's internal architecture (layers, attention heads, head dimensions). We estimate these by interpolating between known architectures:

ParametersLayersHeadsKV Heads (GQA)Head DimHidden Dim
0.5B241616641,024
3B282481283,072
7B323281284,096
13B4040401285,120
70B806481288,192
405B126128812816,384

These anchors are based on published Llama, Qwen, and Mistral architectures. For models between anchors, we interpolate linearly. Models above 3B use Grouped Query Attention (GQA) with 8 KV heads, which significantly reduces KV cache size compared to Multi-Head Attention.

Limitation: Non-standard architectures (Mixture of Experts models like Mixtral, or models with unusual head configurations) may have different actual values. The estimation is designed for dense transformer models.

3. KV Cache Memory

During inference, the model stores Key and Value tensors for every token in the context. This is the KV cache, and it determines how many concurrent requests your GPU can handle.

Formula:

KV Cache per Token (bytes) = 2 × Layers × KV Heads × Head Dim × Bytes per Element

The "2" accounts for both Key and Value tensors.

KV cache precision is independent of model quantization. By default, KV cache is stored in FP16 (2 bytes per element) even when the model weights are quantized to INT4. You can optionally select FP8 or INT8 KV cache precision to halve the KV cache size, if your framework supports it.

Per-request KV cache:

KV Cache per Request (GB) = KV Bytes per Token × Context Length / 1,073,741,824

Example: Llama 70B with 4,096 context length:

  • KV bytes per token = 2 × 80 layers × 8 KV heads × 128 head dim × 2 bytes = 327,680 bytes
  • Per request = 327,680 × 4,096 / 1,073,741,824 = 1.25 GB

This means each concurrent user adds ~1.25 GB of VRAM usage. At 128K context, this jumps to ~40 GB per request.

KV Cache Strategy

The KV cache strategy applies a memory efficiency multiplier:

StrategyEfficiencyEffect
Standard1.0×Pre-allocated contiguous blocks
PagedAttention0.85×Virtual memory paging, ~15% savings
Prefix Caching0.70×Shared prefixes across requests, ~30% savings

4. Activation Memory

Activation memory is the temporary VRAM used during the forward pass for intermediate computations (attention scores, FFN intermediates). For inference, these are transient, allocated during each step and freed immediately.

Formula:

Activation Memory (GB) = Model Weights (GB) × Activation Fraction

The activation fraction depends on the use case:

Use CaseFractionReasoning
Inference3%Transient per decode step, single token at a time
Batch Processing4%Slightly higher due to larger effective batch
LoRA/QLoRA15%Gradients for adapter parameters held during backward pass
Full Fine-tuning40%Full gradients + backward pass activations for all layers
Embeddings2%Forward pass only, no autoregressive generation

Why a percentage and not an exact calculation? Exact activation memory depends on batch size, sequence length, attention implementation, and framework internals. Modern frameworks (vLLM, TGI) manage activations dynamically through CUDA's caching allocator. The percentage is a conservative upper bound for the steady-state overhead.

5. Framework Overhead

Each serving framework has a baseline memory footprint for CUDA contexts, runtime state, tokenizer caches, and internal data structures.

Formula:

Overhead (GB) = (Weights + KV Cache + Activations) × (Overhead Factor - 1)

We use the larger of the quantization or framework overhead factor:

FrameworkOverhead FactorNotes
vLLM1.02Well-optimized, minimal runtime overhead
TGI1.03Hugging Face runtime
TensorRT-LLM1.02NVIDIA optimized
SGLang1.02Lightweight runtime
Ollama1.08Higher overhead for simplicity features

Why multiplicative instead of fixed? Real overhead has both fixed (CUDA context ~0.5 GB) and variable (memory fragmentation ~2-5%) components. At the model sizes people care about (7B+), the multiplicative approach approximates both components. For very small models (<1B), this slightly underestimates.

6. Total VRAM and Usable Fraction

Total VRAM Needed = Weights + KV Cache (batched) + Activations + Overhead

Not all GPU VRAM is available for model serving. The OS, CUDA driver, and display manager reserve memory.

Usable VRAM = GPU VRAM × 0.88 (88%)

This is a conservative margin. vLLM's default gpu_memory_utilization is 0.95, but we use 0.88 to account for:

  • CUDA context and driver overhead (~2-3%)
  • Memory fragmentation (~3-5%)
  • Headroom for allocation spikes (~2-4%)

7. Tensor Parallelism (TP) Determination

When a model doesn't fit on a single GPU, we split it across multiple GPUs using tensor parallelism.

Logic: Find the minimum TP degree (1, 2, 4, 8, 16) where the per-GPU memory fits:

Per-GPU Fixed Memory = (Model Weights + Activations) / TP Degree
Per-GPU KV Headroom  = Usable VRAM / Overhead - Per-GPU Fixed Memory
Max Batch Size       = KV Headroom / (KV per Request / TP Degree)

The overhead is factored into the VRAM budget upfront to prevent situations where the model fits but the batched KV cache plus overhead pushes it over.

TP with NVLink vs PCIe: When TP requires inter-GPU communication but the GPU lacks high-bandwidth interconnect (NVLink, Infinity Fabric), we apply a 60% bandwidth penalty. This reflects the real-world performance impact of TP over PCIe, where each layer requires an all-reduce synchronization that becomes a bottleneck on slow interconnects.

8. Batch Capacity

Formula:

KV Headroom (GB)   = Usable VRAM per GPU - Fixed Memory per GPU
Max Batch Size     = floor(KV Headroom / KV Cache per Request per GPU)
Effective Batch    = min(Max Batch, Concurrency)

This tells you how many concurrent requests the GPU can handle simultaneously. Each additional request in the batch requires its own KV cache allocation.

9. Throughput: Roofline Model

We use the roofline model to estimate throughput. This is a physics-based calculation, not a benchmark extrapolation.

Decode Throughput (Autoregressive Generation)

Each decode step reads all model weights from GPU memory and produces one token per batch member.

Two regimes:

Memory-bound throughput = Batch Size × Bandwidth / Model Size per GPU
Compute-bound ceiling   = FLOPS / (2 × Parameters per GPU)
Actual throughput       = min(Memory-bound, Compute-bound)
  • Memory-bound (most common): Throughput scales linearly with batch size. The GPU spends most of its time reading weights, and batching amortizes this cost.
  • Compute-bound (very large batches): Throughput hits a ceiling determined by the GPU's floating-point operations per second.

The crossover batch size is FLOPS / Bandwidth. For A100: ~153. For H100: ~591. Most production workloads stay in the memory-bound regime.

Bandwidth utilization: We apply a practical utilization factor to theoretical peak bandwidth:

  • Pessimistic: 70% (worst case, fragmented access patterns)
  • Expected: 80% (typical for LLM inference)
  • Optimistic: 90% (best case, optimal memory access)

This gives the min/expected/max range in our estimates.

Example: Llama 8B FP16 on A100 80GB (single GPU, batch=1):

  • Bandwidth = 2,039 GB/s × 80% = 1,631 GB/s
  • Model size = 14.9 GB
  • Throughput = 1 × 1,631 / 14.9 = 109 tokens/sec
  • Real-world vLLM benchmark: 80-120 tokens/sec ✓

Prefill Throughput (Prompt Processing)

Prefill is compute-bound. All input tokens are processed in parallel:

Prefill throughput = FLOPS × TP / (2 × Total Parameters)

This represents how fast the GPU can process your input prompt before generation begins. With tensor parallelism, each GPU handles 1/TP of the computation, so total prefill throughput scales with TP.

10. Latency Estimation

Latency is derived from throughput, not from arbitrary multipliers.

Time to First Token (TTFT)

TTFT = Input Tokens / Prefill Throughput

This is how long the user waits before seeing the first generated token. Longer prompts = longer TTFT.

Time Between Tokens (TBT)

TBT = Concurrency / Decode Throughput

In continuous batching, the GPU's decode bandwidth is shared across all concurrent requests. Each user gets one token per decode step, but the step serves all users simultaneously.

Total Latency

Total = TTFT + Output Tokens × TBT

Percentiles

  • P50: Uses expected throughput (80% bandwidth utilization)
  • P95: Uses pessimistic throughput (70% bandwidth utilization)
  • P99: P95 × 1.3 (small additional variance for tail latency from scheduling, memory allocation, and GC pauses)

These are not true statistical percentiles. They represent the performance range based on bandwidth utilization variance. Real percentiles depend on request arrival patterns, sequence length distribution, and framework scheduling behavior.

11. Scoring and Ranking

GPUs are ranked using multiple criteria:

Default (Best Fit):

  1. Meets latency target: GPUs that pass rank above those that fail
  2. Fewer GPUs: single GPU setups rank above multi-GPU (simpler to operate)
  3. Higher VRAM utilization: 85% utilization is a better fit than 20%
  4. Higher throughput: tiebreaker

Alternative rankings (user-selectable):

  • Utilization: highest VRAM usage first
  • Lowest Latency: fastest response time
  • Throughput: highest tokens per second
  • Lowest Cost: cheapest per hour

Known Limitations

  1. Architecture estimation: We interpolate from known model architectures. Models with non-standard designs (MoE, very deep/shallow networks) may have different KV cache requirements.

  2. No data parallelism modeling: The tool calculates capacity for a single serving instance. For workloads that exceed a single instance's capacity, you need multiple replicas. The tool doesn't automatically recommend this yet.

  3. Throughput is theoretical: The roofline model gives the hardware's theoretical capability. Real-world throughput depends on framework implementation, CUDA kernel efficiency, attention algorithms, and many other factors. Expect 70-90% of our estimate in practice.

  4. Latency percentiles are approximate: True P95/P99 latency depends on queuing theory, request arrival patterns, and framework scheduling, factors we don't model. Our estimates are based on bandwidth utilization variance, not statistical distributions.

  5. No memory fragmentation modeling: CUDA memory fragmentation can reduce effective VRAM by 5-15% over time. Our 88% usable VRAM fraction partially accounts for this, but long-running services may see higher fragmentation.

  6. Pricing data is point-in-time: Cloud GPU pricing changes frequently. Our estimates use the latest data from our providers database but may not reflect current spot prices or reserved instance discounts.

Build Your AI Stack with Confidence

Compare pricing, benchmarks, and GPU requirements for any model. Free to use.