How many GPUs do I need to run DeepSeek V3?

DeepSeek V3 has 671B total parameters; at FP16 the weights alone consume roughly 1.25 TB of VRAM. Realistic deployments use FP8 (which DeepSeek V3 was natively trained for), bringing the weight footprint to ~625 GB. The cleanest configurations are 8× H100 80GB at FP8 (640 GB total, comfortable headroom), 5× H200 141GB at FP8, or 4× B200 192GB. At INT4 you can fit it on 4× H100 SXM but with notable quality loss on long-context reasoning. There is no realistic single-GPU deployment for DeepSeek V3 on currently shipping hardware.

What's the throughput of DeepSeek V3 per H100 cluster?

DeepSeek V3 routes 37B active parameters per token (out of 671B total). Decode throughput per GPU on an 8× H100 SXM cluster runs roughly 1,400-1,800 tokens per second aggregate at batch=8 with vLLM, or about 175-225 tok/s per GPU. Per-token compute economics track a dense 37B model: significantly slower per-token than Llama 4 Scout's 17B-active class but faster than dense 70B serving. The model's quality on reasoning tasks justifies the higher per-token cost for workloads where output quality is the binding constraint.

Is DeepSeek V3 cheaper to run via API or self-host?

DeepSeek's own API is currently $0.14 per million input tokens and $0.28 per million output tokens for v4-flash, with the v4-pro tier at $0.435 input / $0.87 output during the May 31 promotional window. Self-hosting V3 on 8× H100 SXM at $24/hr neocloud rates costs roughly $17,280 per month per cluster. To beat the DeepSeek API on cost, you'd need sustained throughput above ~50 billion output tokens per month per cluster, a level that describes very few production workloads. For most teams, the API path is cheaper unless data residency or fine-tuning requirements rule it out.

Can DeepSeek V3 run on consumer hardware?

Practically no. The 671B parameter count rules out single-card consumer deployments at any meaningful precision. The most aggressive INT4 quantizations land around 320 GB which still exceeds 4× RTX 4090 (96 GB combined). Running DeepSeek V3 requires datacenter GPUs: H100, H200, B200, or AMD Instinct MI300X-class hardware. Distilled or smaller fine-tuned variants of the model family fit consumer hardware, but those are different models with different capabilities.

Why does the calculator show different VRAM numbers depending on whether MoE is enabled?

The Inferbase GPU Capacity Planner separates total parameters from active parameters per token. For DeepSeek V3, you set total=671B and active=37B. Memory weights use the 671B figure (all experts must be VRAM-resident); throughput, prefill, and latency calculations use the 37B figure (only active experts route per token). Treating V3 as dense at the 37B active count under-estimates VRAM by 18× and the deployment will fail to load. Treating it as dense at the 671B total under-estimates throughput by 18× and the cost projection looks unrealistically expensive.

Self-Hosting DeepSeek V3: What It Actually Costs

DeepSeek V3 is the most aggressive MoE release of the OSS frontier so far: 671B total parameters with 37B active per token, trained natively in FP8, and delivering reasoning quality competitive with Claude Opus and GPT-5 on most published benchmarks. The pricing on DeepSeek's own API is the cheapest frontier-class rate in the market right now, so the obvious question is whether self-hosting can possibly compete.

Spoiler: usually not. But the full math is worth walking through, because the answer flips for specific use cases (sustained high-volume traffic, data residency constraints, fine-tuned variants), and getting the sizing wrong on either lane is expensive.

The memory wall

DeepSeek V3's parameter count puts it in a different sizing regime than Llama 4 Scout or Mixtral. At FP16, the model weights alone require:

671B × 2 bytes/param / (1024^3) ≈ 1,249 GB

That's beyond any single-instance configuration on currently shipping NVIDIA or AMD hardware. The model was designed with FP8 in mind. DeepSeek published the model in FP8 native, with the weights occupying half the FP16 footprint:

671B × 1 byte/param / (1024^3) ≈ 625 GB

That fits, but only just. The viable configurations:

Configuration	Per-GPU usage at FP8	Headroom
8× H100 SXM 80 GB	~78 GB	~2 GB (uncomfortable)
8× H100 NVL 96 GB	~78 GB	~18 GB (comfortable)
5× H200 141 GB	~125 GB	~16 GB (workable)
4× B200 192 GB	~156 GB	~36 GB (comfortable)
8× MI300X 192 GB	~78 GB	~114 GB (over-provisioned)

The 8× H100 SXM 80GB configuration is the cheapest entry point but leaves almost no headroom for KV cache. At 8K context with batch=8, you need an additional ~32 GB of KV cache distributed across the cluster. The math works out to "fits under tight constraints," and production deployments routinely tip over the edge once long-context requests start arriving.

The realistic recommendation for a fresh deployment: 8× H100 NVL 96 GB or 4× B200 192GB. The marginal cost over H100 SXM 80GB pays for itself within weeks in operational sanity (no OOM at 32K+ context, ability to push batch size up for higher throughput).

You can model this directly in the GPU Capacity Planner. Set Mixture-of-Experts to the DeepSeek V3 preset, and the engine routes the 671B figure to memory and the 37B figure to compute. Toggle quantization between FP8 and FP16 to see the GPU count shift.

Throughput and the active-params reality

The compute math tracks the 37B active parameter count, not the 671B total. On the roofline model:

Per-GPU decode throughput ≈ batch × bandwidth / (active_params_per_GPU × bytes_per_param)

For an 8× H100 SXM cluster with TP=8, FP8 weights, batch=8:

8 × 3.35 TB/s × 0.8 / (37B / 8 × 1 byte) ≈ 5,800 tok/s aggregate per cluster

That's roughly 725 tok/s per GPU, which compares favorably with dense 70B serving on the same hardware (~150 tok/s per GPU). The MoE structure means each decode step only reads 37B-worth of parameters, even though all 671B sit in VRAM.

In practice, vLLM 0.6+ delivers 1,400-1,800 tok/s aggregate decode throughput on the standard 8× H100 cluster for V3. That's lower than the theoretical ceiling because of TP communication overhead, KV cache movement, and the routing decision adding a small per-token serialization point. At batch=16 the aggregate climbs to ~2,500 tok/s; at batch=32 KV cache pressure becomes the binding constraint.

Per-request latency at typical production load:

Workload	TTFT	Total (1K input + 500 output)
Single request, batch=1	~150 ms	~3.5 s
Concurrent, batch=8	~250 ms	~5.5 s
Concurrent, batch=16	~400 ms	~8.0 s

These numbers improve with FP8 KV cache (halves the per-request KV overhead), prefix caching for shared system prompts, and B200's higher bandwidth. They're roughly representative of well-tuned vLLM serving on the H100 cluster.

The cost conversation

Combining the 8× H100 cluster with current pricing:

Provider	$/hr (8× H100 SXM)	$/mo (24/7)	$/M tokens at 1,800 tok/s
Runpod (Secure Cloud)	$23.92	$17,222	$3.65
Lambda Cloud	$34.32	$24,710	$5.24
CoreWeave	$40.00	$28,800	$6.10
AWS p5.48xlarge	$48.00	$34,560	$7.32

For comparison, the DeepSeek API charges:

v4-flash: $0.14 input / $0.28 output → blended ~$0.21 per million tokens
v4-pro (promotional): $0.435 / $0.87 → blended ~$0.65
v4-pro (post-promo): $1.74 / $3.48 → blended ~$2.61

So self-hosting V3 on the cheapest neocloud comes out 3-25× more expensive per token than the v4-flash API at typical traffic levels. Even after the v4-pro promo expires, the API beats self-hosting on the cheapest hardware tier. Self-hosting only catches up with the API economically when:

Steady-state output volume exceeds ~50B tokens per month per cluster. Below that, the GPU sits idle for portions of the day and the per-token cost balloons.
Latency requirements need a kept-warm deployment. API providers add 100-300ms over a hot self-host.
Compliance / data residency rules out the DeepSeek-hosted API entirely.
You're running a fine-tuned variant that the public API doesn't serve.

If none of those apply, paying DeepSeek $0.21-0.65 per million tokens is strictly cheaper than self-hosting $3-7 per million on equivalent hardware. The gap widens further when you include engineering overhead. Keeping an 8-GPU vLLM cluster healthy, monitored, and scaled runs roughly $10-30K per month of loaded engineering cost on a small team.

Where the math flips

Three workload patterns where self-hosting V3 makes economic sense:

High-volume code generation pipelines. A continuous batch-inference workload pushing 100B+ tokens through V3 per month (typical of large-scale code refactoring, automated PR generation, or codebase-wide analysis) moves the per-token cost on a self-hosted cluster below the DeepSeek API. The example: a 200-engineer org running V3 as the primary code agent at 4-5 prompts per developer per minute averages 80-120B tokens/month, which puts self-hosting roughly at parity with the API.

Regulated workloads. Healthcare, defense, financial services contexts where the cost of API egress is reputational rather than just dollar, and the calculus changes entirely. Self-hosting becomes a compliance prerequisite, not a cost optimization.

Fine-tuned variants. The DeepSeek API doesn't serve customer fine-tunes (yet). If your evaluation shows that a fine-tuned V3 outperforms the base model by 5-10 points on your domain benchmarks, the only path to production is self-hosting the tuned weights.

For everyone else: ship on the DeepSeek API, monitor your spend, and revisit self-hosting if your monthly bill crosses ~$15K and your traffic profile is steady enough to justify a kept-warm cluster.

What this looks like in the calculator

Open the GPU Capacity Planner with the DeepSeek V3 preset selected and the heavy workload preset chosen. The recommended configuration should land on either 8× H100 SXM at FP8 or 5× H200 SXM at FP8, depending on your cloud provider filter. The reasoning panel will explain why specific GPUs were filtered out (memory headroom, latency target). The cost panel will show the monthly TCO across providers.

Compare that against the API path by running the same prompt-output mix through the pricing audit's framework: for most realistic traffic profiles, the API number is significantly smaller. That's the data-driven version of the recommendation above.

For deeper reading on the underlying calculations, see the methodology post. For the broader inference-cost picture, the hidden costs of LLM APIs covers the four factors that determine actual production spend across both lanes.

Self-Hosting DeepSeek V3: What It Actually Costs

The memory wall

Throughput and the active-params reality

The cost conversation

Where the math flips

What this looks like in the calculator

Frequently asked questions

Have thoughts on this article?

Related Articles

Sizing Llama 4 Scout for Production Inference

Llama 3.3 70B Sizing Across H100, H200, and B200

Why Most GPU Memory Calculators Are Wrong About KV Cache

Stay up to date

Start building with the right model.