Can Llama 3.3 70B run on a single H100?

Yes at INT4 (35 GB weights, fits comfortably with KV cache headroom for short context); marginally at FP8 (~70 GB weights, leaves only ~10 GB for KV cache and overhead, which works for batch=1 short context but fails for production load); not at FP16 (~140 GB weights, exceeds 80 GB capacity entirely). The clean single-GPU configuration is INT4 on H100; for FP8 production load you want 2× H100 with tensor parallelism, or a single H200 (141 GB) which fits Llama 3.3 70B at FP8 with comfortable headroom.

What's the throughput difference between H100 SXM and H200 for Llama 3.3 70B?

H200 has the same compute (989 FP16 TFLOPS) as H100 but 1.4× the memory bandwidth (4.8 TB/s vs 3.35 TB/s). Decode throughput on Llama 3.3 70B is bandwidth-bound at typical batch sizes, so H200 delivers roughly 1.3-1.4× the tokens-per-second of H100 SXM at the same batch size. On a single H200 at FP8 batch=8, expect ~120-160 tok/s decode vs H100's ~85-110 tok/s. Per-dollar throughput tells a different story: H200 is roughly 1.6× the hourly rate, so cost per million tokens is similar between the two.

Is B200 worth the price premium over H100?

For Llama 3.3 70B specifically: not yet. B200 has 192 GB VRAM and 8 TB/s bandwidth, but Llama 3.3 70B fits comfortably on H100 SXM at FP8 and the marginal throughput improvement on B200 (roughly 2× over H100 SXM) doesn't justify the 2.5-3× hourly rate at current pricing. B200 becomes the right choice when you're sizing larger models such as Llama 4 Maverick (400B), DeepSeek V3 (671B), or any future 1T+ release, where its memory capacity changes the GPU-count math entirely. For 70B-class models, H100 SXM remains the sweet spot.

Should I use FP16, FP8, or INT4 for Llama 3.3 70B?

FP8 is the right default on Hopper (H100/H200) and Blackwell (B100/B200). The hardware natively supports FP8 with negligible quality loss versus FP16 (within 0.5-1.0 points on standard benchmarks for Llama 3.3 70B), and the memory savings let you fit the model on roughly half the GPU count. INT4 is appropriate when you must run on Ampere (A100) or older hardware, or when you need to fit the model on a single H100, but expect 2-3 point quality loss on reasoning benchmarks and noticeable degradation on long-context tasks. FP16 is the conservative choice when quality matters more than memory budget, but for 70B-class models on modern hardware the FP8 quality gap is small enough that FP16 is rarely worth the doubled GPU cost.

What does it cost to serve Llama 3.3 70B in production?

On a single H200 at FP8 batch=8 ($3.99/hr on Runpod, $4.99/hr on Together), sustained ~140 tok/s decode throughput works out to roughly $0.10-0.13 per million output tokens self-hosted at 24/7 utilization. For comparison, DeepInfra serves Llama 3.3 70B at $0.10 input / $0.32 output per million tokens, and Groq at $0.59 / $0.79 with much higher throughput (302 tok/s). Self-hosting beats DeepInfra economically only above ~30 billion tokens per month per cluster, the typical cost crossover. Below that volume, calling DeepInfra or Together is strictly cheaper and skips the operational overhead.

Llama 3.3 70B Sizing Across H100, H200, and B200

Llama 3.3 70B is the workhorse of the open-weight frontier in 2026. It's the model most teams actually ship in production for non-reasoning workloads: chat, RAG, summarization, classification, structured extraction. The sizing question has gotten more nuanced because of the three NVIDIA datacenter generations now coexisting in pricing tables: H100 (Hopper, 80-141 GB), H200 (Hopper refresh, 141 GB), and B200 (Blackwell, 192 GB). Each delivers different VRAM headroom and different throughput-per-dollar.

This post walks through how Llama 3.3 70B actually performs on each tier, with the numbers we see in production deployments and the GPU Capacity Planner. The goal is to make the "which tier do I pick?" question quantitative.

The memory budget across three configurations

Llama 3.3 70B at FP16 occupies ~130 GB of weights, which puts it in an awkward spot for the H100 80GB ecosystem (doesn't fit on a single GPU, fits on 2× with TP). FP8 halves that to ~65 GB, which fits comfortably on a single H100 80GB but pressures KV cache hard for long-context production. INT4 drops to ~33 GB, single-GPU friendly with significant quality cost.

The viable production configurations:

Configuration	Weights	KV (16 batch, 8K ctx)	Total	Headroom
1× H100 SXM 80GB at INT4	33 GB	8 GB	~45 GB	~25 GB
1× H100 SXM 80GB at FP8	65 GB	8 GB	~78 GB	~2 GB (too tight)
2× H100 SXM 80GB at FP8 (TP=2)	33 GB/GPU	4 GB/GPU	~40 GB/GPU	comfortable
1× H200 SXM 141GB at FP8	65 GB	8 GB	~78 GB	~63 GB (comfortable)
1× B200 192GB at FP8	65 GB	8 GB	~78 GB	~114 GB (over-provisioned)
2× H100 SXM 80GB at FP16 (TP=2)	65 GB/GPU	4 GB/GPU	~75 GB/GPU	~5 GB (tight)

A few patterns jump out. Single-H100 FP8 looks great on paper but leaves no room for batch growth or long-context. The moment a 32K-context request arrives, you're out of KV memory. The clean production-ready single-GPU config for FP8 production load is H200, which has enough headroom to absorb both batch growth and context extension without re-architecting.

For FP16 production, 2× H100 is the cheapest viable option but tight on KV. Bumping to 2× H200 or 1× B200 buys real headroom.

Throughput tier-by-tier

The roofline math says decode throughput on Llama 3.3 70B is bandwidth-bound at production batch sizes. Per-GPU bandwidth × utilization ÷ model_size_in_bytes_per_GPU = decode tokens per second per request, then multiplied by batch.

GPU	Bandwidth	FP16 TFLOPS	Decode (FP8, batch=8)	Decode (FP8, batch=16)
H100 SXM 80GB	3.35 TB/s	989	~85-110 tok/s	~140-180 tok/s
H200 SXM 141GB	4.8 TB/s	989	~120-160 tok/s	~200-260 tok/s
B200 192GB	8.0 TB/s	2,250	~200-280 tok/s	~340-440 tok/s

These are aggregate decode throughputs per GPU, with vLLM 0.6+ at FP8 quantization. The exact numbers depend on batch composition, prefix cache hit rate, and TP topology, but the relative scaling tracks bandwidth almost linearly. H200 delivers ~1.4× H100, B200 delivers ~2.0-2.4× H100.

Bar chart of Llama 3.3 70B decode tokens-per-second per GPU at batch 8 and batch 16 across three GPU generations: H100 SXM (95 / 160 tok/s), H200 SXM (140 / 230 tok/s), and B200 (240 / 390 tok/s). Memory bandwidth values and price ranges are labeled below each group. The scaling tracks bandwidth ratios — decode on this model size is memory-bandwidth bound.

Prefill is compute-bound rather than bandwidth-bound, so the gap widens for prompt-heavy workloads. B200's 2.25× higher FP16 TFLOPS than H100 SXM translates to ~2-2.5× faster prefill on long inputs.

Cost per million tokens

Combining the throughput numbers with current per-hour pricing from our pricing audit:

Configuration	$/hr (cheapest)	Decode tok/s	$/M output tokens
1× H100 PCIe at INT4	$1.99 (Runpod)	~70	$7.90
2× H100 SXM at FP8	$5.98 (Runpod)	~280	$5.93
1× H200 SXM at FP8	$3.99 (Runpod)	~140	$7.92
2× H200 SXM at FP8	$7.98 (Runpod)	~280	$7.92
1× B200 at FP8	$5.49 (Runpod)	~240	$6.36

These are theoretical minimums at full 24/7 utilization with batch=8. Real-world utilization is typically 30-60% (load varies with traffic), so practical costs run 1.7-3× these numbers.

For comparison, the cheapest serverless rate for Llama 3.3 70B is DeepInfra at $0.10 input / $0.32 output per million tokens, roughly $0.21 blended. Self-hosting on the cheapest 2× H100 config at full utilization comes out to ~$5.93 per million output tokens, which is 18× more expensive than DeepInfra.

The crossover point: self-hosting beats DeepInfra economically only above ~25-30 billion output tokens per month per cluster, which is the territory of mid-stage SaaS with heavy AI usage or large-scale code automation pipelines. Below that volume, the API path is strictly cheaper and operationally simpler.

Which tier makes sense for which workload

A practical decision framework based on workload characteristics:

Latency-insensitive, high-volume batch processing.

Best fit: 2× H100 SXM at FP8. Cheapest tier that fits without quality compromise.
Skip H200 unless your context routinely pushes 32K+; the memory headroom isn't worth the price premium.
Skip B200 entirely; you're not pushing throughput hard enough to justify the rate.

Customer-facing chat, target sub-2s P95 latency.

Best fit: 1× H200 SXM at FP8. Single GPU keeps TP communication overhead out of the latency tail. 141 GB headroom absorbs context surprises gracefully.
2× H100 SXM works but tail latency is slightly worse due to TP-AllReduce.

Mixed agent workload, prompt-heavy with long contexts (32K+).

Best fit: 1× B200 at FP8. The 192 GB VRAM accommodates 64K+ contexts at production batch without re-architecting. Higher prefill throughput matters when you have 8K-32K input on every request.
2× H200 at FP8 is the workable alternative if B200 isn't available.

Latency-tolerant, embedding-class workload.

Best fit: 1× H100 SXM 80GB at INT4. Cheapest possible config; quality loss is acceptable for embedding-tier tasks.

You can model these directly in the calculator. Set the workload preset to match your traffic profile, the deployment target to cloud, and the engine produces the ranked configurations with cost and latency numbers per option.

What to actually run

For most production teams shipping Llama 3.3 70B in 2026:

Start serverless on DeepInfra or Together until you're consistently above $1,500-3,000/month spend.
Migrate to 1× H200 SXM at FP8 if your workload is single-GPU friendly and latency matters.
Migrate to 2× H100 SXM at FP8 if you need cheaper per-token cost and can tolerate slightly higher tail latency.
Migrate to B200 only when you outgrow H200, typically when context lengths push past 64K or when batch size needs to scale beyond 16.

There's no scenario where FP16 makes sense on production-grade Hopper or Blackwell hardware for 70B-class models in 2026. The FP8 quality loss is small enough, and the memory savings are large enough, that FP16 is strictly worse on cost-quality Pareto. Reserve FP16 for evaluation runs where reproducibility against the published model card matters.

For the cost decisions surrounding all of this, the enterprise inference pricing audit covers the broader landscape including the 9× spread on Llama 3.3 70B across hosting providers. The methodology post walks through how the calculator's roofline math actually works.

Llama 3.3 70B Sizing Across H100, H200, and B200

The memory budget across three configurations

Throughput tier-by-tier

Cost per million tokens

Which tier makes sense for which workload

What to actually run

Frequently asked questions

Have thoughts on this article?

Related Articles

Self-Hosting DeepSeek V3: What It Actually Costs

Sizing Llama 4 Scout for Production Inference

AI Model Comparison: How to Compare LLMs Across Benchmarks, Pricing, and Capabilities

Stay up to date

Start building with the right model.