Running large language models in production requires careful hardware planning. Under-provision and your service crawls; over-provision and you're burning money. This guide helps you get GPU sizing right the first time.
Understanding GPU Memory Requirements
The most critical factor in GPU selection is VRAM — the memory available on the GPU. LLMs must fit their parameters into VRAM to run efficiently.
The Basic Formula
A rough estimate for model memory requirements:
Memory (GB) ≈ Parameters (B) × Bytes per Parameter
For different precisions:
| Precision | Bytes per Param | 7B Model | 13B Model | 70B Model |
|---|---|---|---|---|
| FP32 | 4 | 28 GB | 52 GB | 280 GB |
| FP16/BF16 | 2 | 14 GB | 26 GB | 140 GB |
| INT8 | 1 | 7 GB | 13 GB | 70 GB |
| INT4 | 0.5 | 3.5 GB | 6.5 GB | 35 GB |
💡 Tip: These numbers cover model weights only. You also need memory for the KV cache, activations, and framework overhead — typically 20-40% more.
KV Cache: The Hidden Memory Consumer
For each concurrent request, the model maintains a key-value cache that grows with sequence length:
def estimate_kv_cache_gb(
num_layers: int,
hidden_size: int,
num_kv_heads: int,
sequence_length: int,
batch_size: int,
precision_bytes: int = 2,
) -> float:
"""Estimate KV cache memory in GB."""
cache_per_token = (
2 # key + value
* num_layers
* num_kv_heads
* (hidden_size // num_kv_heads) # head dim
* precision_bytes
)
total_bytes = cache_per_token * sequence_length * batch_size
return total_bytes / (1024 ** 3)
For a 70B model with 32K context and batch size 8, the KV cache alone can consume 40+ GB.
Choosing the Right GPU
Consumer vs. Data Center GPUs
For production inference, data center GPUs are strongly recommended:
| GPU | VRAM | Use Case |
|---|---|---|
| NVIDIA A100 80GB | 80 GB | Standard production inference |
| NVIDIA H100 SXM | 80 GB | High-throughput, fastest inference |
| NVIDIA H200 SXM | 141 GB | Large models without sharding |
| AMD MI300X | 192 GB | Cost-effective for very large models |
| NVIDIA A10G | 24 GB | Small models, budget deployments |
Single GPU vs. Multi-GPU
If your model doesn't fit on a single GPU, you'll need tensor parallelism:
- 7B-13B models: Single A100 or H100
- 34B-70B models: 2-4 GPUs with tensor parallelism
- 70B+ models: 4-8 GPUs, consider pipeline parallelism
⚠️ Warning: Multi-GPU setups add latency from inter-GPU communication. NVLink-connected GPUs (like H100 SXM) minimize this overhead. PCIe connections can become a bottleneck.
Cost Optimization Strategies
1. Quantization
Reducing precision is the most impactful optimization:
- FP16 → INT8: ~50% memory reduction, minimal quality loss
- FP16 → INT4 (GPTQ/AWQ): ~75% memory reduction, slight quality loss
- FP16 → GGUF Q4_K_M: Good balance for CPU+GPU hybrid inference
2. Right-Size Your Context Window
Don't allocate 128K context if your average request uses 2K. Configure your serving framework to limit max sequence length.
3. Batching
Dynamic batching dramatically improves throughput:
Throughput improvement with batching:
Batch 1: 100 tokens/sec
Batch 8: 650 tokens/sec (6.5x)
Batch 32: 1800 tokens/sec (18x)
4. Spot/Preemptible Instances
For non-latency-critical workloads, spot instances can reduce GPU costs by 60-70%.
Quick Sizing Reference
For common models running in FP16 with typical serving overhead:
| Model | Min VRAM | Recommended GPU | Monthly Cost (Cloud) |
|---|---|---|---|
| Llama 3 8B | 18 GB | 1× A10G | ~$400 |
| Mistral 7B | 16 GB | 1× A10G | ~$400 |
| Llama 3 70B | 150 GB | 2× A100 80GB | ~$4,000 |
| Mixtral 8×7B | 90 GB | 2× A100 80GB | ~$4,000 |
| Llama 3 70B (INT4) | 40 GB | 1× A100 80GB | ~$2,000 |
Next Steps
- Estimate your requirements using the formulas above
- Use our GPU sizing calculator for instant recommendations
- Compare GPU options in our infrastructure catalog
- Start small — launch with quantized models and scale up based on quality needs
The right GPU setup balances performance, cost, and headroom for growth. Don't over-invest upfront — modern serving frameworks make it easy to scale horizontally as demand grows.