Our GPU Capacity Planning tool estimates VRAM requirements, throughput, latency, and batch capacity for self-hosted AI model deployments. This document explains exactly how every number is calculated, what assumptions we make, and where estimates may differ from your real-world results.
1. Model Weight Memory
Formula:
Model Weights (GB) = Parameters × Bytes per Parameter / 1,073,741,824
This is exact, no estimation involved. Each parameter is stored at the selected precision:
| Precision | Bytes per Parameter | Example: 70B Model |
|---|---|---|
| FP32 | 4 | 260.8 GB |
| FP16 / BF16 | 2 | 130.4 GB |
| FP8 | 1 | 65.2 GB |
| INT8 | 1 | 65.2 GB |
| INT4 | 0.5 | 32.6 GB |
For fine-tuning use cases, additional memory is added for optimizer states:
- LoRA/QLoRA: +0.5 bytes per parameter (adapter weights + optimizer for ~5-10% of parameters)
- Full fine-tuning: +8 bytes per parameter (Adam optimizer: 4 bytes for momentum + 4 bytes for variance, both in FP32)
2. Architecture Estimation
When you select a model from our catalog, we know its parameter count. To calculate KV cache, we also need the model's internal architecture (layers, attention heads, head dimensions). We estimate these by interpolating between known architectures:
| Parameters | Layers | Heads | KV Heads (GQA) | Head Dim | Hidden Dim |
|---|---|---|---|---|---|
| 0.5B | 24 | 16 | 16 | 64 | 1,024 |
| 3B | 28 | 24 | 8 | 128 | 3,072 |
| 7B | 32 | 32 | 8 | 128 | 4,096 |
| 13B | 40 | 40 | 40 | 128 | 5,120 |
| 70B | 80 | 64 | 8 | 128 | 8,192 |
| 405B | 126 | 128 | 8 | 128 | 16,384 |
These anchors are based on published Llama, Qwen, and Mistral architectures. For models between anchors, we interpolate linearly. Models above 3B use Grouped Query Attention (GQA) with 8 KV heads, which significantly reduces KV cache size compared to Multi-Head Attention.
Limitation: Non-standard architectures (Mixture of Experts models like Mixtral, or models with unusual head configurations) may have different actual values. The estimation is designed for dense transformer models.
3. KV Cache Memory
During inference, the model stores Key and Value tensors for every token in the context. This is the KV cache, and it determines how many concurrent requests your GPU can handle.
Formula:
KV Cache per Token (bytes) = 2 × Layers × KV Heads × Head Dim × Bytes per Element
The "2" accounts for both Key and Value tensors.
KV cache precision is independent of model quantization. By default, KV cache is stored in FP16 (2 bytes per element) even when the model weights are quantized to INT4. You can optionally select FP8 or INT8 KV cache precision to halve the KV cache size, if your framework supports it.
Per-request KV cache:
KV Cache per Request (GB) = KV Bytes per Token × Context Length / 1,073,741,824
Example: Llama 70B with 4,096 context length:
- KV bytes per token = 2 × 80 layers × 8 KV heads × 128 head dim × 2 bytes = 327,680 bytes
- Per request = 327,680 × 4,096 / 1,073,741,824 = 1.25 GB
This means each concurrent user adds ~1.25 GB of VRAM usage. At 128K context, this jumps to ~40 GB per request.
KV Cache Strategy
The KV cache strategy applies a memory efficiency multiplier:
| Strategy | Efficiency | Effect |
|---|---|---|
| Standard | 1.0× | Pre-allocated contiguous blocks |
| PagedAttention | 0.85× | Virtual memory paging, ~15% savings |
| Prefix Caching | 0.70× | Shared prefixes across requests, ~30% savings |
4. Activation Memory
Activation memory is the temporary VRAM used during the forward pass for intermediate computations (attention scores, FFN intermediates). For inference, these are transient, allocated during each step and freed immediately.
Formula:
Activation Memory (GB) = Model Weights (GB) × Activation Fraction
The activation fraction depends on the use case:
| Use Case | Fraction | Reasoning |
|---|---|---|
| Inference | 3% | Transient per decode step, single token at a time |
| Batch Processing | 4% | Slightly higher due to larger effective batch |
| LoRA/QLoRA | 15% | Gradients for adapter parameters held during backward pass |
| Full Fine-tuning | 40% | Full gradients + backward pass activations for all layers |
| Embeddings | 2% | Forward pass only, no autoregressive generation |
Why a percentage and not an exact calculation? Exact activation memory depends on batch size, sequence length, attention implementation, and framework internals. Modern frameworks (vLLM, TGI) manage activations dynamically through CUDA's caching allocator. The percentage is a conservative upper bound for the steady-state overhead.
5. Framework Overhead
Each serving framework has a baseline memory footprint for CUDA contexts, runtime state, tokenizer caches, and internal data structures.
Formula:
Overhead (GB) = (Weights + KV Cache + Activations) × (Overhead Factor - 1)
We use the larger of the quantization or framework overhead factor:
| Framework | Overhead Factor | Notes |
|---|---|---|
| vLLM | 1.02 | Well-optimized, minimal runtime overhead |
| TGI | 1.03 | Hugging Face runtime |
| TensorRT-LLM | 1.02 | NVIDIA optimized |
| SGLang | 1.02 | Lightweight runtime |
| Ollama | 1.08 | Higher overhead for simplicity features |
Why multiplicative instead of fixed? Real overhead has both fixed (CUDA context ~0.5 GB) and variable (memory fragmentation ~2-5%) components. At the model sizes people care about (7B+), the multiplicative approach approximates both components. For very small models (<1B), this slightly underestimates.
6. Total VRAM and Usable Fraction
Total VRAM Needed = Weights + KV Cache (batched) + Activations + Overhead
Not all GPU VRAM is available for model serving. The OS, CUDA driver, and display manager reserve memory.
Usable VRAM = GPU VRAM × 0.88 (88%)
This is a conservative margin. vLLM's default gpu_memory_utilization is 0.95, but we use 0.88 to account for:
- CUDA context and driver overhead (~2-3%)
- Memory fragmentation (~3-5%)
- Headroom for allocation spikes (~2-4%)
7. Tensor Parallelism (TP) Determination
When a model doesn't fit on a single GPU, we split it across multiple GPUs using tensor parallelism.
Logic: Find the minimum TP degree (1, 2, 4, 8, 16) where the per-GPU memory fits:
Per-GPU Fixed Memory = (Model Weights + Activations) / TP Degree
Per-GPU KV Headroom = Usable VRAM / Overhead - Per-GPU Fixed Memory
Max Batch Size = KV Headroom / (KV per Request / TP Degree)
The overhead is factored into the VRAM budget upfront to prevent situations where the model fits but the batched KV cache plus overhead pushes it over.
TP with NVLink vs PCIe: When TP requires inter-GPU communication but the GPU lacks high-bandwidth interconnect (NVLink, Infinity Fabric), we apply a 60% bandwidth penalty. This reflects the real-world performance impact of TP over PCIe, where each layer requires an all-reduce synchronization that becomes a bottleneck on slow interconnects.
8. Batch Capacity
Formula:
KV Headroom (GB) = Usable VRAM per GPU - Fixed Memory per GPU
Max Batch Size = floor(KV Headroom / KV Cache per Request per GPU)
Effective Batch = min(Max Batch, Concurrency)
This tells you how many concurrent requests the GPU can handle simultaneously. Each additional request in the batch requires its own KV cache allocation.
9. Throughput: Roofline Model
We use the roofline model to estimate throughput. This is a physics-based calculation, not a benchmark extrapolation.
Decode Throughput (Autoregressive Generation)
Each decode step reads all model weights from GPU memory and produces one token per batch member.
Two regimes:
Memory-bound throughput = Batch Size × Bandwidth / Model Size per GPU
Compute-bound ceiling = FLOPS / (2 × Parameters per GPU)
Actual throughput = min(Memory-bound, Compute-bound)
- Memory-bound (most common): Throughput scales linearly with batch size. The GPU spends most of its time reading weights, and batching amortizes this cost.
- Compute-bound (very large batches): Throughput hits a ceiling determined by the GPU's floating-point operations per second.
The crossover batch size is FLOPS / Bandwidth. For A100: ~153. For H100: ~591. Most production workloads stay in the memory-bound regime.
Bandwidth utilization: We apply a practical utilization factor to theoretical peak bandwidth:
- Pessimistic: 70% (worst case, fragmented access patterns)
- Expected: 80% (typical for LLM inference)
- Optimistic: 90% (best case, optimal memory access)
This gives the min/expected/max range in our estimates.
Example: Llama 8B FP16 on A100 80GB (single GPU, batch=1):
- Bandwidth = 2,039 GB/s × 80% = 1,631 GB/s
- Model size = 14.9 GB
- Throughput = 1 × 1,631 / 14.9 = 109 tokens/sec
- Real-world vLLM benchmark: 80-120 tokens/sec ✓
Prefill Throughput (Prompt Processing)
Prefill is compute-bound. All input tokens are processed in parallel:
Prefill throughput = FLOPS × TP / (2 × Total Parameters)
This represents how fast the GPU can process your input prompt before generation begins. With tensor parallelism, each GPU handles 1/TP of the computation, so total prefill throughput scales with TP.
10. Latency Estimation
Latency is derived from throughput, not from arbitrary multipliers.
Time to First Token (TTFT)
TTFT = Input Tokens / Prefill Throughput
This is how long the user waits before seeing the first generated token. Longer prompts = longer TTFT.
Time Between Tokens (TBT)
TBT = Concurrency / Decode Throughput
In continuous batching, the GPU's decode bandwidth is shared across all concurrent requests. Each user gets one token per decode step, but the step serves all users simultaneously.
Total Latency
Total = TTFT + Output Tokens × TBT
Percentiles
- P50: Uses expected throughput (80% bandwidth utilization)
- P95: Uses pessimistic throughput (70% bandwidth utilization)
- P99: P95 × 1.3 (small additional variance for tail latency from scheduling, memory allocation, and GC pauses)
These are not true statistical percentiles. They represent the performance range based on bandwidth utilization variance. Real percentiles depend on request arrival patterns, sequence length distribution, and framework scheduling behavior.
11. Scoring and Ranking
GPUs are ranked using multiple criteria:
Default (Best Fit):
- Meets latency target: GPUs that pass rank above those that fail
- Fewer GPUs: single GPU setups rank above multi-GPU (simpler to operate)
- Higher VRAM utilization: 85% utilization is a better fit than 20%
- Higher throughput: tiebreaker
Alternative rankings (user-selectable):
- Utilization: highest VRAM usage first
- Lowest Latency: fastest response time
- Throughput: highest tokens per second
- Lowest Cost: cheapest per hour
Known Limitations
-
Architecture estimation: We interpolate from known model architectures. Models with non-standard designs (MoE, very deep/shallow networks) may have different KV cache requirements.
-
No data parallelism modeling: The tool calculates capacity for a single serving instance. For workloads that exceed a single instance's capacity, you need multiple replicas. The tool doesn't automatically recommend this yet.
-
Throughput is theoretical: The roofline model gives the hardware's theoretical capability. Real-world throughput depends on framework implementation, CUDA kernel efficiency, attention algorithms, and many other factors. Expect 70-90% of our estimate in practice.
-
Latency percentiles are approximate: True P95/P99 latency depends on queuing theory, request arrival patterns, and framework scheduling, factors we don't model. Our estimates are based on bandwidth utilization variance, not statistical distributions.
-
No memory fragmentation modeling: CUDA memory fragmentation can reduce effective VRAM by 5-15% over time. Our 88% usable VRAM fraction partially accounts for this, but long-running services may see higher fragmentation.
-
Pricing data is point-in-time: Cloud GPU pricing changes frequently. Our estimates use the latest data from our providers database but may not reflect current spot prices or reserved instance discounts.