Gemma 4 is the 2026 generation of Google DeepMind's open-weights model family, comprising the E2B and E4B small multimodal variants, the 26B-A4B mixture-of-experts variant, and the 31B dense variant as the family's largest-and-most-capable release. The family supports text, image, and video input across all variants, with audio additionally supported on the E2B and E4B small models. As of May 2026, Gemma 4 31B IT sits in the leading group of mid-scale open-weights multimodal models on reasoning, coding, and vision benchmarks, with the Apache 2.0 license placing it in the broadly-permissive release tier alongside the Qwen 3.6 and DeepSeek V4 families.
At a glance
- Lab: Google DeepMind.
- Released: Early 2026, following the Gemma 3 generation. Training data cutoff of January 2025.
- Modality: Multimodal across the family. All sizes support text and image input. E2B and E4B small variants add audio. The 31B variant adds video (via frame processing).
- Open weights: Yes. Apache 2.0 license, including commercial use.
- Architecture: Family of four principal variants: E2B (2.3 billion effective parameters, dense, audio capable, 128K context), E4B (4.5 billion effective, dense, audio capable, 128K context), 26B-A4B (25.2 billion total parameters in a mixture-of-experts configuration with 3.8 billion active per token, 256K context), and 31B (30.7 billion dense parameters, 256K context, video input). The 31B variant uses 60 layers, hybrid sliding-window-and-global attention with a 1,024-token sliding window, and a 262,000-token vocabulary. The vision encoder is approximately 550 million parameters and integrated into the same model.
- Context window: 128K tokens on E2B and E4B; 256K tokens on 26B-A4B and 31B. Variable image resolution with configurable visual token budget (70, 140, 280, 560, or 1,120 tokens per image).
- Pricing: Open weights, free to self-host. Hosted-inference pricing through Fireworks AI, Together AI, Google AI Studio, and Vertex AI varies by provider.
- Distribution channels: Hugging Face, Google AI Studio, Vertex AI, Kaggle, Together AI, HuggingChat, vLLM, SGLang, llama.cpp quantisations, Ollama, Docker Model Runner, Google Colab notebooks.
Origins
Gemma 4 follows the Gemma 3 generation released in 2025 and represents the fourth major iteration in Google's open-weights model lineage. The family is positioned as the open-weights complement to the closed-weights Gemini frontier line: Gemma releases share underlying research with the Gemini models but ship at smaller scales with permissive licensing for self-hosting and fine-tuning.
The 2026 release consolidates multimodality across the family. Where Gemma 3 had separate text-only and vision-capable variants, every Gemma 4 SKU is multimodal by default, with the small E2B and E4B models additionally supporting audio (a capability that previously required separate Audio-PaLM or Gemini-Live integration). The 31B variant adds video input via frame extraction.
The four-SKU family structure (E2B small, E4B small, 26B-A4B mid-size MoE, 31B dense) targets a wider deployment band than the previous Gemma generations. The E2B and E4B variants are sized for on-device deployment on laptops, single-GPU workstations, and high-end consumer hardware. The 26B-A4B variant gives capability-conscious developers the MoE active-parameter inference cost profile at a 3.8 billion active footprint. The 31B variant is the family's flagship for production inference servers at single-GPU 80-gigabyte scale.
The training data cutoff of January 2025 places the family roughly a year behind the most recent frontier-model knowledge horizons. The training mix includes web documents in 140-plus languages, code repositories, mathematics texts, and a diverse image collection, with explicit filtering for CSAM, sensitive data, and quality-and-safety screening. The model card cites major safety improvements over Gemma 3 across all content-safety categories.
Capabilities
The Gemma 4 31B IT capability profile spans reasoning, coding, multimodal understanding, and multilingual generation.
Reasoning and mathematics is the headline capability. The model includes a built-in thinking mode via <|think|> tokens, configurable on a per-request basis. AIME 2026 (the high-difficulty mathematics olympiad benchmark) reports 89.2 percent without tools; the figure is consistent with the leading group of mid-scale open-weights models on the most-recent reasoning evaluations. GPQA Diamond reports 84.3 percent. BigBench Extra Hard reports 74.4 percent.
Coding capability is anchored by LiveCodeBench v6 at 80.0 percent and a Codeforces ELO of 2,150. The Codeforces figure in particular places Gemma 4 31B in the competitive-programming tier where the model can solve a meaningful fraction of contest problems unaided. The Apache 2.0 license makes this coding capability available for code-and-developer-tool integration without per-token licensing constraints.
Multimodal understanding spans image, video (on 31B), and audio (on E2B and E4B). MMMU Pro reports 76.9 percent. MATH-Vision reports 85.6 percent. The variable image resolution feature (with token budgets of 70 to 1,120 per image) lets developers trade vision context quality against inference cost; document and OCR workflows benefit from the 1,120-token-per-image high-resolution mode.
Long-context capability is supported by the 256K context window and the MRCR v2 8-needle 128K long-context benchmark at 66.4 percent. The hybrid sliding-window-and-global attention architecture, combined with the proportional RoPE (p-RoPE) positional encoding and unified key-value memory, is the technical mechanism that produces the long-context efficiency profile.
Multilingual capability spans 35-plus languages out of the box, with the underlying pre-training covering 140-plus languages. The expanded vocabulary (262,000 tokens) is sized to handle the broader multilingual support at competitive tokenisation efficiency.
The tool-calling capability includes native structured function calling for agentic workflows. The recommended modality ordering (images and audio before text) is documented in the model card for optimal multimodal-task performance.
Benchmarks and standing
Gemma 4 31B IT reports the following benchmark positions at release:
- MMLU Pro: 85.2 percent
- AIME 2026 (no tools): 89.2 percent
- LiveCodeBench v6: 80.0 percent
- Codeforces ELO: 2,150
- GPQA Diamond: 84.3 percent
- BigBench Extra Hard: 74.4 percent
- MMMU Pro (vision): 76.9 percent
- MATH-Vision: 85.6 percent
- MRCR v2 8-needle 128K (long-context): 66.4 percent
The combined profile places Gemma 4 31B in the top tier of mid-scale open-weights models across reasoning, coding, multimodal, and long-context axes as of May 2026. The MMLU Pro figure of 85.2 percent is identical to the Qwen 3.6-35B-A3B figure and represents the current ceiling for open-weights models at this scale. The Codeforces ELO of 2,150 is among the highest reported for any open-weights model at this parameter count.
Comparison against the larger MoE peers (DeepSeek V4 Pro, Qwen 3.6-35B-A3B, GLM-5.1) is mixed: Gemma 4 31B's dense architecture produces higher per-active-parameter capability but at materially higher inference cost than the sparse-active counterparts. The competitive question for developers is whether the dense efficiency profile or the MoE inference-cost profile better fits the production deployment target.
Benchmark leadership is point-in-time. The open-weights category turns over on roughly a quarterly cadence, and the next major Gemma, Llama, Qwen, and DeepSeek releases are expected through the second half of 2026.
Access and pricing
Gemma 4 ships as open-weights under Apache 2.0, permitting both research and commercial use without per-token licensing. The principal access channels:
- Hugging Face Hub at
google/gemma-4-31B-itand sibling repositories for E2B, E4B, and 26B-A4B variants. Direct download for self-hosting. - Google AI Studio provides the first-party hosted-inference surface for prototyping and lightweight production use.
- Vertex AI offers the enterprise-grade hosted endpoint on Google Cloud Platform.
- Fireworks AI hosts the 31B IT variant in the multimodal-models catalog with 262K context.
- Together AI and HuggingChat offer additional third-party hosted inference.
- Local deployment: vLLM, SGLang, Docker Model Runner, llama.cpp quantisations, Ollama. The Docker Model Runner integration is the lowest-friction local-deployment path for non-specialist developers.
- Notebooks: Google Colab and Kaggle integration for research and educational use.
Recommended sampling parameters from the model card: temperature 1.0, top-p 0.95, top-k 64 across use cases. Thinking mode is enabled by default via the enable_thinking parameter. Variable image resolution should be set higher (560 or 1,120 tokens) for OCR and document-parsing workloads.
Comparison
- Qwen 3.6 (Alibaba Qwen). The closest open-weights peer at comparable scale. Qwen 3.6-35B-A3B's mixture-of-experts architecture gives it lower inference cost per token; Gemma 4 31B's dense architecture gives it more uniform capability per active parameter. Both are Apache 2.0 licensed and both target the mid-scale multimodal market.
- DeepSeek V4 (DeepSeek). DeepSeek V4 Pro is materially larger (1.6 trillion total parameters, 49 billion active) and targets the open-weights frontier-capability tier. Gemma 4 31B targets a smaller deployment-cost band; the comparison is more about deployment-target fit than head-to-head capability.
- GLM-5.1 (Z.ai). GLM-5.1 is positioned around agentic engineering; Gemma 4 31B is positioned around general-purpose reasoning and multimodal understanding. The agent-focused vs general-purpose split is the principal competitive axis.
- Llama 4 (Meta AI). Llama 4 is the Western open-weights flagship from a different lab. Comparison favours Llama on research-ecosystem depth and Gemma on multimodal-out-of-the-box capability.
- Kimi K2.6 (Moonshot AI). Another Chinese-origin open-weights peer with a different reasoning-versus-coding emphasis.
The competitive picture is unusually crowded in the mid-scale open-weights category as of May 2026. Developer-deployment choice often turns on the existing-infrastructure-and-tooling lock-in (Hugging Face, vLLM, llama.cpp quantisation availability, integration with the specific cloud) rather than on raw benchmark differences.
Outlook
Open questions for the next 6 to 18 months:
- Successor cadence and family expansion. Whether Gemma 5 arrives on the same approximately-annual cadence, and whether the family expands further at the small-on-device end or the larger-than-31B production end, is the central roadmap question.
- Knowledge-horizon update. The January 2025 training cutoff places Gemma 4 roughly a year behind the most recent frontier-model knowledge horizons. A mid-2026 refresh with a more recent cutoff would materially close that gap.
- Vertex AI integration depth. Google's enterprise-AI strategy has progressively unified Gemini and Gemma surfaces within Vertex AI. The depth of the enterprise-grade tooling for Gemma 4 (custom fine-tuning, scheduled batch inference, agent integration) will indicate whether the Gemma line is treated as a research-and-developer release or as a full enterprise-deployment product.
- Audio-modality expansion to larger variants. Only the E2B and E4B small models currently support audio. Whether the 31B variant or a future flagship adds audio input is an open question.
- Independent leaderboard reproduction. Several of the benchmark numbers (AIME 2026, MMMU Pro, Codeforces ELO) are first-party reports. Independent reproduction on the standard open-weights leaderboards will indicate whether the position holds against the most-current peer releases.
Sources
- Hugging Face: Gemma 4 31B IT. Primary model card with architecture, benchmark, and distribution details.
- Fireworks AI: model catalog. Hosted-inference availability for Gemma 4 31B IT.
- Companion profile: Google DeepMind for the broader Google AI research context including the Gemini line.
- Companion model: Qwen 3.6 for the closest open-weights multimodal peer.
- Companion model: DeepSeek V4 for the frontier-tier open-weights peer.
- Companion model: GLM-5.1 for the agentic-engineering-focused open-weights peer.