Skip to main content

How Our GPU Capacity Planning Calculator Works

Inferbase Team14 min read

Deploying a self-hosted AI model requires answering a series of interdependent infrastructure questions: how much VRAM the model needs, how many concurrent requests a given GPU can sustain, what latency the end user will experience, and whether a single GPU suffices or the workload must be distributed across several. These questions are not independent of one another, as the memory budget constrains batch capacity, which in turn determines throughput and latency. The GPU Capacity Planning tool is designed to solve them together.

This document provides a detailed explanation of every formula, assumption, and estimation method used by the calculator. Where our estimates diverge from real-world results, the reasons are stated explicitly. For a practical walkthrough of using the tool to size GPUs for specific models, see our GPU sizing guide for LLM inference.

1. Model weight memory

The memory required to store a model's parameters is deterministic. Each parameter occupies a fixed number of bytes determined by the selected precision, and the total is a straightforward multiplication:

Model Weights (GB) = Parameters x Bytes per Parameter / 1,073,741,824
PrecisionBytes per ParameterExample: 70B Model
FP324260.8 GB
FP16 / BF162130.4 GB
FP8165.2 GB
INT8165.2 GB
INT40.532.6 GB

For fine-tuning use cases, the optimizer maintains additional state that must also reside in VRAM:

  • LoRA/QLoRA: +0.5 bytes per parameter (adapter weights and optimizer state for approximately 5-10% of parameters)
  • Full fine-tuning: +8 bytes per parameter (Adam optimizer requires 4 bytes for momentum and 4 bytes for variance, both stored in FP32)

2. Architecture estimation

The model's parameter count alone is not sufficient to calculate KV cache memory. That calculation also requires knowledge of the model's internal architecture: the number of layers, attention heads, head dimensions, and whether the model uses Grouped Query Attention. When a user selects a model from the Inferbase catalog, we know its parameter count but may not have its full architecture specification. In those cases, we estimate the architecture by interpolating between known reference points:

ParametersLayersHeadsKV Heads (GQA)Head DimHidden Dim
0.5B241616641,024
3B282481283,072
7B323281284,096
13B4040401285,120
70B806481288,192
405B126128812816,384

These anchor points are derived from published Llama, Qwen, and Mistral architectures. For models whose parameter count falls between two anchors, the tool interpolates linearly. Models above 3B parameters are assumed to use Grouped Query Attention (GQA) with 8 KV heads, which substantially reduces KV cache size compared to Multi-Head Attention.

Limitation: This estimation is designed for dense transformer models. Non-standard architectures, such as Mixture of Experts models (Mixtral) or models with unusual head configurations, may have significantly different actual values.

3. KV cache memory

During autoregressive inference, the model stores Key and Value tensors for every token in the context window. This accumulated state is the KV cache, and its size is the primary factor determining how many concurrent requests a GPU can serve.

The per-token KV cache size is calculated as:

KV Cache per Token (bytes) = 2 x Layers x KV Heads x Head Dim x Bytes per Element

The factor of 2 accounts for both Key and Value tensors.

An important distinction: KV cache precision is independent of model weight quantization. By default, the KV cache is stored in FP16 (2 bytes per element) even when model weights are quantized to INT4. Users can optionally select FP8 or INT8 KV cache precision to halve the cache size, provided their serving framework supports it.

The total KV cache memory for a single request is then:

KV Cache per Request (GB) = KV Bytes per Token x Context Length / 1,073,741,824

To illustrate the scale of this cost, consider Llama 70B with a 4,096 token context length:

  • KV bytes per token = 2 x 80 layers x 8 KV heads x 128 head dim x 2 bytes = 327,680 bytes
  • Per request = 327,680 x 4,096 / 1,073,741,824 = 1.25 GB

Each concurrent user therefore adds approximately 1.25 GB of VRAM usage. At 128K context, this jumps to approximately 40 GB per request, which is why long-context workloads are particularly constrained by KV cache memory.

KV cache strategy

The serving framework's KV cache management strategy affects actual memory consumption. The tool applies a memory efficiency multiplier based on the selected strategy:

StrategyEfficiencyEffect
Standard1.0xPre-allocated contiguous blocks
PagedAttention0.85xVirtual memory paging, approximately 15% savings
Prefix Caching0.70xShared prefixes across requests, approximately 30% savings

4. Activation memory

Beyond model weights and the KV cache, the GPU must also allocate VRAM for activation memory: the intermediate tensors produced during the forward pass (attention scores, feed-forward network intermediates). For inference workloads, these allocations are transient, created during each decode step and freed immediately after.

The tool estimates activation memory as a fraction of model weight memory:

Activation Memory (GB) = Model Weights (GB) x Activation Fraction

The fraction varies by use case because different workloads hold different amounts of intermediate state:

Use CaseFractionReasoning
Inference3%Transient per decode step, single token at a time
Batch Processing4%Slightly higher due to larger effective batch
LoRA/QLoRA15%Gradients for adapter parameters held during backward pass
Full Fine-tuning40%Full gradients and backward pass activations for all layers
Embeddings2%Forward pass only, no autoregressive generation

The use of a percentage rather than an exact calculation is a deliberate simplification. Exact activation memory depends on batch size, sequence length, the specific attention implementation, and framework internals. Modern serving frameworks such as vLLM and TGI manage activations dynamically through CUDA's caching allocator, making the precise value difficult to predict from architecture parameters alone. The percentages represent conservative upper bounds for steady-state overhead.

5. Framework overhead

Each serving framework carries a baseline memory footprint for CUDA contexts, runtime state, tokenizer caches, and internal data structures. The tool models this as a multiplicative factor applied to the sum of all other memory components:

Overhead (GB) = (Weights + KV Cache + Activations) x (Overhead Factor - 1)

The tool selects the larger of the quantization overhead factor or the framework overhead factor:

FrameworkOverhead FactorNotes
vLLM1.02Well-optimized, minimal runtime overhead
TGI1.03Hugging Face runtime
TensorRT-LLM1.02NVIDIA optimized
SGLang1.02Lightweight runtime
Ollama1.08Higher overhead for user-facing simplicity features

A multiplicative model was chosen over a fixed-overhead model because real-world overhead has both fixed and variable components. The CUDA context itself is approximately 0.5 GB (fixed), while memory fragmentation accounts for 2-5% (variable). At the model sizes where GPU capacity planning is most relevant (7B parameters and above), the multiplicative approach approximates both components adequately. For very small models under 1B parameters, this approach slightly underestimates overhead.

6. Total VRAM and usable fraction

The total VRAM requirement is the sum of all components:

Total VRAM Needed = Weights + KV Cache (batched) + Activations + Overhead

However, not all of a GPU's advertised VRAM is available for model serving. The operating system, CUDA driver, and display manager each reserve memory. The tool accounts for this with a fixed utilization ceiling:

Usable VRAM = GPU VRAM x 0.88 (88%)

This is intentionally more conservative than the default gpu_memory_utilization of 0.95 used by vLLM. The 12% margin accounts for three sources of unavailable memory:

  • CUDA context and driver overhead (approximately 2-3%)
  • Memory fragmentation (approximately 3-5%)
  • Headroom for allocation spikes during request bursts (approximately 2-4%)

7. Tensor parallelism (TP) determination

When a model's memory requirements exceed what a single GPU can provide, the tool distributes the model across multiple GPUs using tensor parallelism. The goal is to find the minimum TP degree (from the set 1, 2, 4, 8, 16) at which the per-GPU memory budget is sufficient:

Per-GPU Fixed Memory = (Model Weights + Activations) / TP Degree
Per-GPU KV Headroom  = Usable VRAM / Overhead - Per-GPU Fixed Memory
Max Batch Size       = KV Headroom / (KV per Request / TP Degree)

The overhead factor is applied to the VRAM budget rather than the memory consumption. This prevents a scenario where the model weights and activations fit within the GPU, but the combined KV cache and framework overhead push total usage beyond the available VRAM.

TP with NVLink vs PCIe: Tensor parallelism requires inter-GPU communication at every transformer layer (an all-reduce operation). When the selected GPU lacks a high-bandwidth interconnect such as NVLink or Infinity Fabric, the tool applies a 60% bandwidth penalty to throughput estimates. This reflects the real-world performance degradation of tensor parallelism over PCIe, where the all-reduce synchronization at each layer becomes a throughput bottleneck.

8. Batch capacity

Once the fixed memory components (weights, activations) are allocated, the remaining VRAM determines how many concurrent requests the GPU can serve. Each additional request in the batch requires its own KV cache allocation:

KV Headroom (GB)   = Usable VRAM per GPU - Fixed Memory per GPU
Max Batch Size     = floor(KV Headroom / KV Cache per Request per GPU)
Effective Batch    = min(Max Batch, Concurrency)

9. Throughput: roofline model

The tool estimates throughput using the roofline model, a physics-based analytical framework that bounds performance by the hardware's memory bandwidth and compute capacity. This approach derives estimates from the GPU's specifications rather than extrapolating from benchmarks, which makes the results hardware-portable but necessarily theoretical.

Decode throughput (autoregressive generation)

Each decode step reads all model weights from GPU memory and produces one token per batch member. Depending on the batch size, throughput is limited by one of two hardware constraints:

Memory-bound throughput = Batch Size x Bandwidth / Model Size per GPU
Compute-bound ceiling   = FLOPS / (2 x Parameters per GPU)
Actual throughput       = min(Memory-bound, Compute-bound)
  • Memory-bound (most common): throughput scales linearly with batch size because the GPU spends most of its time reading weights from HBM. Batching amortizes this memory access cost across multiple requests.
  • Compute-bound (very large batches): throughput reaches a ceiling determined by the GPU's floating-point operations per second.

The crossover batch size, where the workload transitions from memory-bound to compute-bound, is FLOPS / Bandwidth. For the A100, this crossover occurs at approximately 153 concurrent requests; for the H100, at approximately 591. Most production inference workloads operate well within the memory-bound regime.

The tool applies a practical utilization factor to the GPU's theoretical peak bandwidth to account for real-world memory access patterns:

  • Pessimistic: 70% (fragmented access patterns, worst case)
  • Expected: 80% (typical for LLM inference)
  • Optimistic: 90% (optimal memory access, best case)

These three utilization levels produce the min/expected/max range reported in the tool's output.

Example: Llama 8B FP16 on A100 80GB (single GPU, batch size 1):

  • Effective bandwidth = 2,039 GB/s x 80% = 1,631 GB/s
  • Model size = 14.9 GB
  • Throughput = 1 x 1,631 / 14.9 = 109 tokens/sec
  • Real-world vLLM benchmarks for this configuration report 80-120 tokens/sec

Prefill throughput (prompt processing)

Unlike decode, prefill is compute-bound because all input tokens are processed in parallel through a single forward pass:

Prefill throughput = FLOPS x TP / (2 x Total Parameters)

This represents how quickly the GPU processes the input prompt before autoregressive generation begins. With tensor parallelism, each GPU handles 1/TP of the computation, so total prefill throughput scales linearly with the TP degree.

10. Latency estimation

Latency estimates are derived directly from the throughput calculations described above, not from empirical multipliers or heuristics.

Time to first token (TTFT)

TTFT = Input Tokens / Prefill Throughput

TTFT measures how long the user waits before the first generated token appears. It is determined entirely by prefill throughput and input length: longer prompts produce proportionally longer TTFT.

Time between tokens (TBT)

TBT = Concurrency / Decode Throughput

Under continuous batching, the GPU's decode bandwidth is shared across all concurrent requests. Each user receives one token per decode step, but the decode step serves all concurrent users simultaneously. Higher concurrency therefore increases TBT for each individual user.

Total latency

Total = TTFT + Output Tokens x TBT

Percentiles

The tool reports three latency levels, each based on a different bandwidth utilization assumption:

  • P50: expected throughput (80% bandwidth utilization)
  • P95: pessimistic throughput (70% bandwidth utilization)
  • P99: P95 x 1.3 (additional variance for tail latency from scheduling delays, memory allocation, and garbage collection pauses)

These are not true statistical percentiles derived from a distribution of observed latencies. They represent the performance range that arises from bandwidth utilization variance alone. True percentiles would require modeling request arrival patterns, sequence length distributions, and framework-specific scheduling behavior, none of which the tool attempts.

11. Scoring and ranking

The tool ranks GPU configurations using a multi-criteria scoring system.

The default ranking ("best fit") applies the following priority order:

  1. Meets latency target: configurations that satisfy the user's latency constraint rank above those that do not
  2. Fewer GPUs: single-GPU configurations rank above multi-GPU setups, reflecting the operational simplicity of fewer devices
  3. Higher VRAM utilization: a configuration using 85% of available VRAM is a better fit than one using 20%, as it indicates the GPU is well-matched to the workload
  4. Higher throughput: used as a tiebreaker when other criteria are equal

Users can also select alternative ranking criteria:

  • Utilization: highest VRAM usage first
  • Lowest latency: fastest response time
  • Throughput: highest tokens per second
  • Lowest cost: cheapest per hour

Known limitations

  1. Architecture estimation: The tool interpolates architectural parameters from a set of known model architectures. Models with non-standard designs (Mixture of Experts, unusually deep or shallow networks) may have KV cache requirements that differ significantly from the interpolated values.

  2. No data parallelism modeling: The tool calculates capacity for a single serving instance. Workloads that exceed a single instance's capacity require multiple replicas, but the tool does not model or recommend replica counts.

  3. Throughput is theoretical: The roofline model provides the hardware's theoretical capability. Real-world throughput depends on framework implementation quality, CUDA kernel efficiency, the specific attention algorithm used, and numerous other runtime factors. In practice, expect 70-90% of the tool's estimate.

  4. Latency percentiles are approximate: True P95/P99 latency is a function of queuing theory, request arrival patterns, and framework scheduling, none of which the tool models. The reported percentiles reflect bandwidth utilization variance only.

  5. No memory fragmentation modeling: CUDA memory fragmentation can reduce effective VRAM by 5-15% over the lifetime of a long-running service. The 88% usable VRAM fraction partially accounts for this, but services that have been running for extended periods may experience higher fragmentation than the model assumes.

  6. Pricing data is point-in-time: Cloud GPU pricing changes frequently. The tool uses the latest available data from the Inferbase providers database, but these figures may not reflect current spot prices or reserved instance discounts.


Try the GPU Capacity Planning Calculator with your own model parameters. For a practical guide on choosing GPUs for production inference, see GPU Sizing Guide for LLM Inference.

Start building with the right model.

Automatically route workloads to the right model for every task, every time.