How much GPU memory does Gemma 4 31B require?

Gemma 4 31B requires approximately 62 GB in FP16 for weights alone, fitting on a single A100 80GB or H100 80GB. With INT4 quantization, the weight footprint drops to roughly 16 GB, making it feasible on consumer GPUs like the RTX 4090 (24 GB). Add 20-30% for KV cache and runtime overhead depending on your context length and batch size.

What is the difference between Gemma 4 26B A4B and the 31B dense model?

The 26B A4B is a Mixture-of-Experts (MoE) model with 26 billion total parameters but only 4 billion active per token. This means it achieves close to the 31B's quality while using significantly less compute per inference. The trade-off is that the full 26B still needs to reside in memory, so VRAM requirements are similar, but throughput is higher because less computation happens per token.

Can I run Gemma 4 on a consumer GPU?

Yes, depending on the variant. The E2B model (5.1 GB total) runs comfortably on any modern GPU or even CPU. The E4B (8 GB total) fits on a 16 GB GPU. The 26B MoE and 31B dense models require quantization (INT4/INT8) to run on consumer hardware like the RTX 4090 or RTX 5090.

How does Gemma 4 compare to Llama 4 and other open-source models?

Gemma 4 31B scores 89.2% on AIME 2026 and 85.2% on MMLU Pro, placing it competitively with much larger models. The 26B MoE variant is particularly notable, achieving near-31B quality with only 4B active parameters. Compared to Llama 4 Scout (17B active / 109B total), Gemma 4's MoE is more memory-efficient while the dense 31B offers stronger per-parameter performance.

Google Gemma 4: Architecture, GPU Requirements, and What It Means for Open-Source AI

Google announced Gemma 4 on April 2, 2026, positioning it as "byte for byte, the most capable open models" with a focus on advanced reasoning and agentic workflows. The release includes four models spanning from 2.3 billion active parameters up to 31 billion, each targeting a different deployment scenario, and all available on Hugging Face under Apache 2.0. The performance numbers are strong, but the more interesting story is in the specific engineering decisions Google made around memory efficiency, multimodal input handling, and the Mixture-of-Experts design that allows near-frontier quality from 4 billion active parameters.

The Gemma 4 model family

Gemma 4 is not a single model but a family of four, each designed for a different point on the capability-efficiency frontier. The naming convention reflects this: the two smaller models use an "E" prefix (for "efficient"), while the two larger models are identified by their parameter counts.

Model	Total Params	Active Params	Context	Modalities	License
E2B	5.1B	2.3B	128K	Text, Image, Video, Audio	Apache 2.0
E4B	8B	4.5B	128K	Text, Image, Video, Audio	Apache 2.0
26B A4B	26B (MoE)	4B	256K	Text, Image, Video	Apache 2.0
31B	31B (Dense)	31B	256K	Text, Image, Video	Apache 2.0

All four models ship under Apache 2.0, which means no usage restrictions, no registration requirements, and full commercial use. Both base and instruction-tuned (IT) variants are available for each model.

The distinction between total and active parameters matters for understanding both memory requirements and inference costs. The E2B model has 5.1 billion total parameters (including embeddings) but only 2.3 billion are active during any given forward pass. The 26B MoE model takes this further: 26 billion parameters reside in memory, but only 4 billion activate per token, which means inference compute scales with the active count while memory scales with the total.

Architecture: what makes Gemma 4 different

Several architectural decisions in Gemma 4 diverge from the standard transformer recipe, and they have direct implications for how these models behave in production.

Per-Layer Embeddings (PLE)

The most distinctive feature is Per-Layer Embeddings, a mechanism where each decoder layer receives its own conditioning vector derived from the input tokens. In a standard transformer, the embedding layer produces a single representation that flows through all layers unchanged (aside from residual connections). PLE adds a secondary embedding table that provides per-layer residual signals, effectively giving each layer a different "view" of the input tokens.

The practical consequence is better parameter efficiency. Each layer can specialize more aggressively because it receives layer-specific conditioning, which means a 31B model with PLE can extract more capability per parameter than a 31B model without it. The overhead is modest: the secondary embedding table adds parameters but minimal inference latency.

Alternating attention: local and global

Gemma 4 alternates between two attention patterns across its layers. Local layers use sliding-window attention with a window of 512 or 1024 tokens, which keeps memory and compute bounded regardless of sequence length. Global layers use full-context attention with proportional RoPE positional encoding, which enables the model to attend to the entire 128K or 256K context window.

This hybrid approach is what makes 256K context feasible on hardware that could not handle full attention at that length. The local layers handle most of the computation cheaply, while the global layers (which are fewer in number) handle the long-range dependencies that matter for retrieval and reasoning over long documents.

Shared KV cache

In the larger models, the last N decoder layers share their key-value states rather than maintaining independent caches. This reduces the memory footprint of the KV cache and the compute required to fill it, which directly translates to higher throughput at longer context lengths. For production deployments running batch inference with 128K+ contexts, this is a meaningful optimization.

Mixture-of-Experts (26B A4B)

The 26B variant uses a Mixture-of-Experts architecture where each token is routed to a subset of expert networks. With 26 billion total parameters but only 4 billion active per token, the model achieves quality that approaches the 31B dense model while requiring significantly less compute per forward pass.

The important caveat for deployment planning is that MoE models require the full parameter set in memory even though only a fraction activates per token. You need enough VRAM to hold 26 billion parameters, but the inference speed is closer to what you would expect from a 4 billion parameter model. This makes MoE architectures particularly attractive for throughput-sensitive deployments where memory is available but per-token latency or cost is the binding constraint.

GPU memory requirements

Getting GPU sizing right for Gemma 4 requires accounting for model weights, KV cache, and runtime overhead. The weight footprint depends entirely on quantization precision, while the KV cache scales with context length and batch size.

Weight memory by precision

Model	FP16	INT8	INT4
E2B (5.1B)	10.2 GB	5.1 GB	2.6 GB
E4B (8B)	16 GB	8 GB	4 GB
26B A4B (MoE)	52 GB	26 GB	13 GB
31B (Dense)	62 GB	31 GB	15.5 GB

These are weight-only figures. In production, add 20-40% for KV cache, activations, and framework overhead depending on your context length and concurrency. If you need a more precise estimate for your specific deployment, our GPU sizing calculator handles the full memory budget calculation including KV cache and batch size.

Hardware recommendations

Model	Minimum viable	Recommended production
E2B	Any modern CPU, mobile SoC	Edge devices, Raspberry Pi 5, mobile
E4B	RTX 3060 12GB (INT4)	RTX 4070 Ti, T4, consumer GPU
26B A4B	RTX 4090 24GB (INT4)	A100 40GB, L40S
31B	RTX 4090 24GB (INT4, tight)	A100 80GB, H100

The E2B and E4B models are explicitly designed for on-device deployment, and they deliver on that promise. The E2B in INT4 requires roughly 2.6 GB for weights, which leaves headroom for KV cache even on a 4 GB mobile GPU. The E4B in INT4 fits comfortably on a 16 GB consumer GPU with room for moderate context lengths.

The 26B MoE and 31B dense models are more demanding. In FP16, the 31B requires a single A100 80GB or H100. With INT4 quantization, both can fit on an RTX 4090 (24 GB), though KV cache at long context lengths will push memory limits. For production serving with concurrent requests and long contexts, A100 80GB or H100 hardware is the practical minimum for either of the larger models.

For a detailed walkthrough of GPU memory calculations including KV cache sizing and multi-GPU setups, see our GPU sizing guide for LLM inference.

Benchmark performance

The benchmark numbers contextualize where Gemma 4 sits relative to other open-weight and proprietary models. The standout result is the 26B MoE: it comes within a few percentage points of the 31B dense model across nearly every benchmark while activating only 4 billion parameters per token.

Reasoning and knowledge

Benchmark	31B	26B A4B	E4B	E2B
MMLU Pro	85.2%	82.6%	69.4%	60.0%
AIME 2026	89.2%	88.3%	42.5%	37.5%
GPQA Diamond	84.3%	82.3%	58.6%	43.4%
BigBench Hard	74.4%	64.8%	33.1%	21.9%

The 31B model's 89.2% on AIME 2026 is particularly notable for a sub-40B open model. The MoE variant trails by less than 1 percentage point on AIME while using a fraction of the compute per token.

Coding

Benchmark	31B	26B A4B	E4B	E2B
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%
Codeforces Elo	2150	1718	940	633

A Codeforces Elo of 2150 from the 31B model puts it in competitive territory with models that have significantly more parameters. The MoE variant at 1718 demonstrates that the expert routing is effective for code generation tasks, though the gap to the dense model is wider here than on reasoning benchmarks.

Vision and multimodal

Benchmark	31B	26B A4B	E4B	E2B
MMMU Pro	76.9%	73.8%	52.6%	44.2%
MATH-Vision	85.6%	82.4%	59.5%	52.4%
OmniDocBench (lower = better)	0.131	0.149	0.181	0.290

All four models support vision input natively, with variable aspect ratio encoding that adjusts the token budget based on image content. The vision encoder uses learned 2D positions with multidimensional RoPE, which handles diverse image sizes without the distortion that comes from forced resizing.

Multimodal capabilities

Gemma 4's multimodal support goes beyond basic image understanding. The models handle text, images, video, and (for the smaller variants) audio through a unified architecture.

Capability	E2B	E4B	26B A4B	31B
Text	Yes	Yes	Yes	Yes
Images	Yes	Yes	Yes	Yes
Video	Yes	Yes	Yes	Yes
Audio	Yes	Yes	No	No

Audio support in the E2B and E4B models uses a USM-style conformer encoder, the same architecture Google used in Gemma 3n. This is significant for on-device applications that need speech understanding without a separate ASR pipeline.

The vision encoder supports configurable token budgets (70, 140, 280, 560, or 1120 tokens per image), which gives deployment teams fine-grained control over the trade-off between visual understanding quality and inference cost. Lower token budgets process images faster and consume less KV cache memory, which matters when processing many images or video frames in a single request.

Notable built-in capabilities include:

Object detection with native JSON bounding box output
GUI element detection for UI automation tasks
Document understanding and OCR
Video comprehension with multi-frame reasoning
Function calling with multimodal inputs (e.g., analyzing an image and calling a tool based on what it sees)
Extended thinking mode with configurable reasoning budgets up to 4,000 tokens

Agentic workflows and tool use

Google's framing of Gemma 4 around "agentic workflows" is not just marketing. The models ship with native function calling that works across modalities, and the extended thinking mode gives them a structured way to reason before acting. This combination is what separates a model that can answer questions from one that can operate as part of a multi-step pipeline.

The function calling implementation follows a tool-use pattern where the model receives a set of tool definitions (as JSON schemas), reasons about which tool to invoke, and returns structured JSON with the function name and arguments. What Gemma 4 adds is the ability to ground tool selection in multimodal context. A model can analyze an image, decide it needs to call a weather API based on a location visible in the photo, and return the structured call, all in a single inference pass.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

inputs = processor.apply_chat_template(
    messages,
    tools=tools,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=True,
)

The enable_thinking=True flag activates extended thinking, which allocates up to 4,000 tokens for internal reasoning before the model produces its response. This is particularly useful for agentic tasks where the model needs to plan a sequence of actions or decide between multiple tools. The thinking tokens are generated but can be hidden from the final output, so the user sees only the result while the model benefits from the intermediate reasoning.

For teams building agent frameworks or multi-step pipelines, the practical implication is that Gemma 4 can serve as the reasoning core of an agent without relying on prompt engineering tricks to simulate tool use. The function calling and thinking capabilities are part of the model's training, not bolted on through system prompts.

Deployment options

Gemma 4 has deployment paths ranging from browser-based WebGPU inference (E2B) to data center serving (31B).

Hugging Face Transformers is the reference implementation:

from transformers import AutoModelForMultimodalLM, AutoProcessor

model = AutoModelForMultimodalLM.from_pretrained(
    "google/gemma-4-31b-it",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("google/gemma-4-31b-it")

llama.cpp provides quantized inference for local and server deployments:

llama-server -hf ggml-org/gemma-4-E2B-it-GGUF

MLX targets Apple Silicon with TurboQuant support for roughly 4x memory reduction:

mlx_vlm.generate \
  --model google/gemma-4-E4B-it \
  --image image.jpg \
  --prompt "Describe this image"

Additional deployment paths include ONNX for cross-platform inference, mistral.rs for Rust-native serving, and transformers.js for browser-based inference via WebGPU.

Where each model fits

The four models target fundamentally different deployment scenarios, and choosing between them is less about which is "best" and more about which constraints dominate your use case.

E2B (2.3B active) is the right choice when the model must run on-device: mobile applications, edge inference, embedded systems, or scenarios where data cannot leave the device. Its audio support makes it particularly suitable for voice-enabled applications that need local processing.

E4B (4.5B active) occupies the space between on-device and cloud. It runs on consumer GPUs and handles multimodal tasks with audio support, making it a strong default for applications that have access to a single GPU but need more capability than the E2B provides.

26B A4B (4B active, MoE) is the efficiency play. If you have the memory budget for 26B parameters but need throughput closer to a 4B model, the MoE architecture delivers near-31B quality at a fraction of the per-token compute. This is the model to choose when you are serving high volumes of requests and inference cost per token is your primary concern.

31B (dense) is the quality ceiling. When you need the best possible output from an open-weight model under 40B parameters and have the hardware to support it, the 31B dense model is the straightforward choice. Its 256K context window and strong reasoning benchmarks make it suitable for complex tasks that require extended context and precise outputs.

What this means for the open-source landscape

Gemma 4 continues the trend of open-weight models closing the gap with proprietary offerings, but the MoE variant is where the release gets strategically interesting. A model that achieves 88.3% on AIME 2026 with only 4 billion active parameters per token changes the economics of self-hosted inference in a concrete way: the compute cost per token drops by an order of magnitude while quality remains competitive with models that are 8-10x larger in active parameters.

The Apache 2.0 licensing removes the usage restrictions that complicate deployment of some other open-weight models. There are no registration requirements, no commercial use limitations, and no output attribution requirements, which makes Gemma 4 a viable foundation for production applications without legal overhead.

For teams evaluating whether to self-host or use API access, the GPU memory requirements and cost implications of each variant are worth calculating precisely. Our model comparison tool and GPU sizing calculator can help you run those numbers against your specific workload and hardware constraints.