GPU Sizing Guide for LLM Inference in Production

Inferbase TeamFebruary 28, 20264 min read

Running large language models in production requires careful hardware planning. Under-provision and your service crawls; over-provision and you're burning money. This guide helps you get GPU sizing right the first time.

Understanding GPU Memory Requirements

The most critical factor in GPU selection is VRAM — the memory available on the GPU. LLMs must fit their parameters into VRAM to run efficiently.

The Basic Formula

A rough estimate for model memory requirements:

Memory (GB) ≈ Parameters (B) × Bytes per Parameter

For different precisions:

PrecisionBytes per Param7B Model13B Model70B Model
FP32428 GB52 GB280 GB
FP16/BF16214 GB26 GB140 GB
INT817 GB13 GB70 GB
INT40.53.5 GB6.5 GB35 GB

💡 Tip: These numbers cover model weights only. You also need memory for the KV cache, activations, and framework overhead — typically 20-40% more.

KV Cache: The Hidden Memory Consumer

For each concurrent request, the model maintains a key-value cache that grows with sequence length:

def estimate_kv_cache_gb(
    num_layers: int,
    hidden_size: int,
    num_kv_heads: int,
    sequence_length: int,
    batch_size: int,
    precision_bytes: int = 2,
) -> float:
    """Estimate KV cache memory in GB."""
    cache_per_token = (
        2  # key + value
        * num_layers
        * num_kv_heads
        * (hidden_size // num_kv_heads)  # head dim
        * precision_bytes
    )
    total_bytes = cache_per_token * sequence_length * batch_size
    return total_bytes / (1024 ** 3)

For a 70B model with 32K context and batch size 8, the KV cache alone can consume 40+ GB.

Choosing the Right GPU

Consumer vs. Data Center GPUs

For production inference, data center GPUs are strongly recommended:

GPUVRAMUse Case
NVIDIA A100 80GB80 GBStandard production inference
NVIDIA H100 SXM80 GBHigh-throughput, fastest inference
NVIDIA H200 SXM141 GBLarge models without sharding
AMD MI300X192 GBCost-effective for very large models
NVIDIA A10G24 GBSmall models, budget deployments

Single GPU vs. Multi-GPU

If your model doesn't fit on a single GPU, you'll need tensor parallelism:

  • 7B-13B models: Single A100 or H100
  • 34B-70B models: 2-4 GPUs with tensor parallelism
  • 70B+ models: 4-8 GPUs, consider pipeline parallelism

⚠️ Warning: Multi-GPU setups add latency from inter-GPU communication. NVLink-connected GPUs (like H100 SXM) minimize this overhead. PCIe connections can become a bottleneck.

Cost Optimization Strategies

1. Quantization

Reducing precision is the most impactful optimization:

  • FP16 → INT8: ~50% memory reduction, minimal quality loss
  • FP16 → INT4 (GPTQ/AWQ): ~75% memory reduction, slight quality loss
  • FP16 → GGUF Q4_K_M: Good balance for CPU+GPU hybrid inference

2. Right-Size Your Context Window

Don't allocate 128K context if your average request uses 2K. Configure your serving framework to limit max sequence length.

3. Batching

Dynamic batching dramatically improves throughput:

Throughput improvement with batching:
Batch 1:   100 tokens/sec
Batch 8:   650 tokens/sec  (6.5x)
Batch 32:  1800 tokens/sec (18x)

4. Spot/Preemptible Instances

For non-latency-critical workloads, spot instances can reduce GPU costs by 60-70%.

Quick Sizing Reference

For common models running in FP16 with typical serving overhead:

ModelMin VRAMRecommended GPUMonthly Cost (Cloud)
Llama 3 8B18 GB1× A10G~$400
Mistral 7B16 GB1× A10G~$400
Llama 3 70B150 GB2× A100 80GB~$4,000
Mixtral 8×7B90 GB2× A100 80GB~$4,000
Llama 3 70B (INT4)40 GB1× A100 80GB~$2,000

Next Steps

  1. Estimate your requirements using the formulas above
  2. Use our GPU sizing calculator for instant recommendations
  3. Compare GPU options in our infrastructure catalog
  4. Start small — launch with quantized models and scale up based on quality needs

The right GPU setup balances performance, cost, and headroom for growth. Don't over-invest upfront — modern serving frameworks make it easy to scale horizontally as demand grows.

GPUinferenceinfrastructurecost optimization

Related Articles

Stay up to date

Get notified when we publish new articles on AI model selection, cost optimization, and infrastructure planning.

Find Your Ideal AI Model

Compare 500+ models across pricing, benchmarks, and capabilities.Make data-driven decisions with Inferbase.

Curious how we compare models? Read our methodology.