How much VRAM does Llama 4 Scout need to run?

Llama 4 Scout is a 17B-active / 109B-total Mixture-of-Experts model. All 109B parameters must be resident in VRAM regardless of which experts route per token, so at FP16 weights alone consume roughly 203 GB. Add KV cache (proportional to context length) and runtime overhead, and a realistic single-instance deployment needs at least 256 GB of usable VRAM: 4× H100 80GB with tensor parallelism, 2× H200 141GB, or a single B200 192GB at INT8. Sizing tools that treat MoE models as dense at the active count (17B ≈ 32 GB) under-provision by a factor of 6 and the deployment OOMs on first request.

How fast is Llama 4 Scout per H100 compared to a dense 17B model?

Decode throughput scales with active parameters because only the routed experts read per token. Scout on a single H100 SXM produces approximately the same tokens-per-second as a dense 17B model on the same GPU: around 80-110 tok/s per request at batch=8 with vLLM. The compute economics look like a 17B model. The memory economics look like a 109B model. This decoupling is the entire point of MoE and is what makes Scout interesting for production: you get GPT-4-class quality at ~17B compute cost, you just need the VRAM headroom for all the experts to live in memory together.

What does it cost to serve Llama 4 Scout at production load?

On a 4× H100 SXM cluster (the cheapest config that fits at FP16), neocloud rates run roughly $10-12/hr on RunPod or Vast.ai, $14-17/hr on Lambda or CoreWeave, and $25-30/hr on AWS. At sustained throughput of ~250 tok/s aggregate across the cluster (batch=16, 80% bandwidth utilization), that works out to $1.20-3.50 per million output tokens depending on hosting tier. For comparison, Bedrock serves Llama 4 Scout at $0.17/M blended, which is much cheaper than self-hosting unless you have steady traffic above ~30B tokens per month.

Should I use FP8 or INT8 quantization for Llama 4 Scout?

FP8 is the right default on Hopper (H100/H200) and Blackwell (B100/B200) because the hardware natively supports it with negligible quality loss versus FP16. The memory footprint halves (Scout drops to ~110 GB at FP8), which means you can fit it on 2× H100 80GB instead of 4×, cutting your hourly cost in half. INT8 is broader-compatible (works on Ampere A100s) but has larger quality drops on MoE models specifically because the routing decisions are sensitive to per-expert quantization noise. If you're on Hopper or newer, use FP8.

Why does the embedded sizing calculator above show different numbers for MoE vs dense?

The Inferbase calculator separates total parameters (drives memory) from active parameters per token (drives compute). For Llama 4 Scout, that's 109B total / 17B active. Memory weight calculations use the 109B figure; throughput, prefill, and latency calculations use the 17B figure. Most public GPU calculators don't make this distinction. They treat the model as either dense-109B (massively under-estimating throughput) or dense-17B (massively under-estimating VRAM). The standalone /gpu-capacity-planning page lets you experiment with active params for any MoE model you're evaluating.

Sizing Llama 4 Scout for Production Inference

Llama 4 Scout is the first MoE release that meaningful enterprise teams are actually trying to serve themselves. Meta dropped a 109B-total / 17B-active model with a 1M context window, the open weights are real, and the throughput-per-dollar story sounds appealing on the rate-card of the providers that already host it. The sizing question that comes up almost immediately ("how many H100s do I need?") turns out to have a different answer than every public GPU calculator gives you, and the gap matters.

The short version: Scout looks like a 17B model when you measure compute and like a 109B model when you measure memory, and getting that decoupling right is the difference between a deployment that works and a deployment that OOMs on the first request. The longer version is what this post is about. We'll walk through the actual VRAM budget, the throughput math, the parallelism plan that fits, and the cost of running Scout in production at three traffic tiers.

How Mixture-of-Experts changes the sizing math

A dense transformer reads every parameter for every token. A 70B dense model reads 70B parameters per token, and the memory bandwidth required to do that 60-100 times per second is the bottleneck on H100-class GPUs. The roofline math is symmetric: weights read = compute consumed, and both scale with the parameter count.

Mixture-of-Experts breaks that symmetry. Scout has 109B total parameters but only 17B active per token: a router picks which 16 experts (out of a much larger pool) handle each token's forward pass. The remaining experts sit in VRAM doing nothing on that particular token, but they have to be there in case the router picks them on the next token. So:

Memory holds total params. All 109B parameters must be resident in VRAM. At FP16 (2 bytes/param) that's 203 GB before you account for KV cache, activations, or runtime overhead. At FP8 it drops to ~110 GB.
Compute reads active params per token. Throughput, prefill speed, and per-token latency all scale with the 17B active count, not the 109B total.

The practical implication: Scout uses the memory budget of a 109B model and the compute budget of a 17B model. Sizing tools that conflate the two (and most do) get one of these dimensions wildly wrong.

You can see the difference in the GPU Capacity Planner. Open the Mixture-of-Experts panel in the Model step and pick the Llama 4 Scout preset; the engine routes the 109B figure to memory and the 17B figure to compute automatically. Toggle the active-params field off (treating Scout as dense-109B) and the projected throughput drops to roughly 1/6 of what the model actually delivers. Toggle the total down to 17B (treating Scout as dense-17B) and the VRAM budget falls below 50 GB, a config that will load and then crash on the first request because there's no room for the other 92B parameters.

The VRAM budget at a glance

The component breakdown for Llama 4 Scout at FP16 with 8K context, 16 concurrent requests:

Component	Size
Model weights (109B × 2 bytes)	203.0 GB
KV cache per request (8K tokens × 32 layers × 8 KV heads × 128 head_dim × 2 bytes)	0.5 GB
KV cache total (× 16 batch)	8.0 GB
Activation memory (~3% of weights)	6.1 GB
Runtime overhead (vLLM, ~2%)	4.3 GB
Total per cluster	~221 GB

That doesn't fit on a single 80 GB H100. Spread across multiple GPUs via tensor parallelism, the per-GPU memory drops by 1/TP for weights and KV cache. The minimum viable configurations:

4× H100 SXM 80 GB at FP16: ~55 GB per GPU. Comfortable headroom. The default starting point.
2× H200 141 GB at FP16: ~110 GB per GPU. Also comfortable, fewer GPUs to coordinate.
1× B200 192 GB at FP16: single GPU, ~110 GB used. If you have access.
2× H100 SXM at FP8: ~55 GB per GPU. Halved precision halves the weight memory; FP8 quality loss on Scout is small enough that this is the right config for most production workloads on Hopper.

Stacked horizontal bar chart of Llama 4 Scout's VRAM budget at FP16, FP8, and INT4 quantization levels. Reference lines at 80 GB (H100), 141 GB (H200), and 192 GB (B200) show which configurations fit on which GPU. A fourth bar at the top shows the dense-17B calculator-error case where the model only consumes 35 GB but loads then OOMs at runtime.

KV cache scales with context length and batch size. Going from 8K to 32K context quadruples the KV portion (from 8 GB to 32 GB at batch=16); pushing batch to 64 triples it again. Long-context, high-concurrency workloads pressure the headroom hard, and the practical batch ceiling at full 1M context is uncomfortably low. The engine warns you when you're approaching the limit.

Throughput: what to actually expect

The roofline model says decode throughput per GPU equals min(batch × bandwidth / model_size, FLOPS / (2 × params_per_GPU)). For Scout, the relevant model_size is the active 17B portion (only those weights are read per decode step), and the relevant params_per_GPU is also active params, not total.

On 4× H100 SXM with vLLM and batch=16:

Per-GPU memory-bound throughput: 16 × 3.35 TB/s × 0.8 / (17 GB / 4) ≈ 10,000 tok/s before TP communication overhead.
TP=4 communication penalty over NVLink: ~15-20%.
Net per-cluster decode throughput: ~7,500-8,500 tok/s aggregate.

At an average response length of 500 tokens, that's roughly 15-17 requests per second sustained. P95 latency on a 1000-token input + 500-token output lands around 1.5-2.0 seconds end-to-end at this load.

The headline that matters: per-token decode throughput on Scout matches a dense 17B model on the same hardware. You aren't paying compute for the parameters you don't activate. This is what makes MoE economics work, since the cost-per-token approaches a 17B baseline, not the 109B baseline you'd guess from the model card.

Cost across three traffic tiers

Combining the 4× H100 SXM config with current per-hour rates from our pricing audit:

Provider	$/hr (4× H100 SXM)	$/mo (24/7)	$/M tokens at 7,500 tok/s
Runpod (Secure Cloud)	$11.96	$8,610	$0.44
Lambda Cloud	$17.16	$12,355	$0.64
CoreWeave	$20.00	$14,400	$0.74
AWS p5.48xlarge (1/2 of)	~$24.00	$17,280	$0.89

At sustained production load, that's $0.44-$0.89 per million output tokens self-hosted. For comparison, Bedrock serves Scout at $0.17/M blended, and Groq runs Llama 4 Scout at $0.11 input / $0.34 output. Self-hosting only beats those rates when you have enough steady traffic to justify paying for the GPU 24/7, typically above 30B tokens per month per cluster. Below that threshold, calling Bedrock or Groq is strictly cheaper, and you skip the operational burden.

The crossover math is the same one we covered in the self-host vs serverless TCO analysis, just specialized to Scout's per-cluster cost. The general rule holds: serverless MoE inference is shockingly cheap at low-to-medium volume, and self-hosting only makes sense when your traffic floor justifies a kept-warm cluster.

What teams get wrong

Three specific failure patterns surface repeatedly:

Treating Scout as dense-17B in the sizing tool. The most common mistake. The provisioning team picks a sizing config that fits "17B at FP16" (roughly 35 GB) and ships a single H100 deployment. The model loads (just barely), but every request OOMs because the actual weight footprint is 6× larger. Fix: always specify total + active when sizing MoE models.

Skipping FP8 on Hopper. Scout's quality at FP8 is nearly indistinguishable from FP16 on standard benchmarks (within 1-2 points on MMLU-Pro), but the memory savings let you halve your GPU count or double your batch capacity. Teams that default to FP16 because it's familiar are over-provisioning by 2× without realizing it.

Running with default vLLM batch ceilings on long-context. The 1M context window is real but pressures KV cache hard. Default vLLM configurations don't pre-reserve KV pool size, and a few simultaneous long-context requests can push other batched requests off the GPU. For production deployments using >32K context, set explicit --max-num-seqs and --max-model-len flags to bound the worst case.

When self-hosting Scout is the right call

Self-hosting Llama 4 Scout makes economic sense when:

Traffic floor justifies 24/7 utilization (typically 30B+ output tokens per month sustained, not just peak hours).
Data residency requirements rule out US-region API providers like Bedrock or Groq.
Latency tail requirements are tighter than what API providers offer (sub-100ms TTFT is realistic on a kept-warm self-host, harder on shared serverless infrastructure).
Custom fine-tuning is part of the workflow. You can't run a fine-tuned Scout on Bedrock without a custom-model deployment that costs more than self-hosting.

If none of those apply, the API path is faster to ship, cheaper at most realistic volumes, and operationally simpler. Use the GPU Capacity Planner to model your specific volume and latency targets against both lanes before committing to either.

For a deeper look at how the calculator's roofline model works, see our methodology post. For the broader context on inference economics in 2026, the pricing audit covers the same self-host vs serverless decision across the full frontier model lineup.

Sizing Llama 4 Scout for Production Inference

How Mixture-of-Experts changes the sizing math

The VRAM budget at a glance

Throughput: what to actually expect

Cost across three traffic tiers

What teams get wrong

When self-hosting Scout is the right call

Frequently asked questions

Have thoughts on this article?

Related Articles

Self-Hosting DeepSeek V3: What It Actually Costs

Why Most GPU Memory Calculators Are Wrong About KV Cache

Llama 3.3 70B Sizing Across H100, H200, and B200

Stay up to date

Start building with the right model.