Zamba2

Zamba2 is a family of open-weights hybrid Mamba2-transformer language models developed by Zyphra and released through 2024 in three sizes: Zamba2-1.2B, Zamba2-2.7B, and Zamba2-7B. The architecture combines a backbone of Mamba2 state-space-model layers interleaved with shared transformer attention blocks in an alternating pattern, augmented with LoRA projectors that allow per-layer specialization at minimal parameter overhead. The Zamba2 suite reports up to 50 percent reductions in time-to-first-token and 6 times reductions in key-value cache memory relative to comparable transformer models, with state-of-the-art benchmark performance against peer 7-billion-parameter open-weights models including Mistral-7B, Gemma-7B, and Llama-3-8B at the November 2024 release.

At a glance

Lab: Zyphra
Released: Zamba2-2.7B (mid-2024); Zamba2-7B (Q3 2024); Zamba2-1.2B (Q3 2024). Technical report posted to arXiv on November 22, 2024.
Modality: Text. Base models and instruction-tuned variants for general-purpose language modeling and dialogue.
Open weights: Yes. All three sizes released open-weights through the Zyphra organization on Hugging Face under permissive licensing for research and commercial use.
Context window: 4,096 tokens at the native pretraining length, with extension techniques applied to the larger variants.
Pricing: Free for self-hosted deployment under the open-weights license. Hosted access through community providers (such as Together AI and other inference platforms).
Distribution channels: Zyphra organization on Hugging Face, Zyphra/Zamba2 on GitHub, and integration with the Hugging Face Transformers library.

Origins

Zyphra was founded in 2021 in San Francisco by Krithik Puthalath, Beren Millidge, Tomás Figliolia, and Danny Martinelli. The company's research direction was oriented around next-generation neural network architectures, parameter sharing, and continual learning. The Zamba family of small efficient language models was the principal public research output of the 2021 to 2024 period.

The first Zamba model (Zamba1-7B) was published in early 2024 and introduced the core architectural idea: a hybrid backbone of Mamba state-space-model layers (the Mamba1 variant at the time) interleaved with shared transformer attention blocks. The shared transformer block is invoked from multiple positions in the network, with parameter sharing reducing total parameter count without sacrificing the attention capability that Mamba layers do not provide directly.

Zamba2 was the successor architecture, released through 2024 with three sizes (1.2B, 2.7B, and 7.4B parameters) and several refinements over Zamba1. The Mamba1 blocks were replaced with Mamba2 blocks, which offer improved compute efficiency and stronger empirical performance per parameter. The single shared attention block of Zamba1 was replaced with two shared attention blocks interleaved in an ABAB pattern throughout the network, providing more attention coverage at modest parameter cost. A LoRA projector was applied to each shared MLP block, allowing the network to specialize the MLPs at each invocation of the shared layer across depth, with depth-specialization added at minimal increase in total parameter count.

The Zamba2 series builds on the initial Zamba1-7B work, with optimization across architecture, training and annealing datasets, and training run length up to three trillion tokens. The 7.4B-parameter variant was trained for two trillion tokens rather than the full three trillion target, due to compute and time limitations characterized in the technical report.

The June 2025 Series A of $100 million at a $1 billion post-money valuation, led by Jaan Tallinn, was the company's most consequential public-facing transition. Tallinn's prior history of leading Series A rounds for Google DeepMind and Anthropic gave the round an unusual provenance signal. The October 2025 IBM Cloud and AMD partnership provided dedicated training infrastructure for subsequent model releases.

Capabilities

Zamba2 is built specifically for efficient inference on small-and-mid-parameter language modeling. Three capability features distinguish it from peer open-weights models in the same parameter class.

The first is the hybrid architecture itself. The Mamba2-transformer hybrid combines the long-sequence efficiency of state-space-model layers (Mamba2 layers process sequences with linear time-and-memory cost) with the attention coverage that transformer layers provide. The interleaved ABAB pattern of shared attention blocks gives the network attention reasoning at strategic depth points without paying the quadratic compute and memory cost of pure-transformer architectures.

The second is parameter sharing through LoRA-decorated shared blocks. Two shared transformer attention blocks are reused at multiple network depths, with LoRA projectors providing per-invocation specialization. The shared-plus-LoRA design reduces total parameter count compared to a stacked-transformer baseline of equivalent depth, with the Zamba2 technical report characterizing the trade-off as favorable on the standard small-model benchmark suite.

The third is the inference performance profile. Zyphra reports Zamba2 achieves up to 50 percent time-to-first-token reduction versus comparable pure-transformer models, and a 6 times reduction in key-value cache memory at long context due to the SSM-based architecture. The KV cache reduction matters most for inference cost economics in deployment, where memory consumption per active concurrent request is a principal constraint on throughput.

The Zamba2 family includes both base models and instruction-tuned variants. The Zamba2-Instruct variants apply post-training (supervised fine-tuning and preference optimization) to produce dialogue-capable assistants suitable for direct conversational use.

Benchmarks and standing

Zamba2-7B, the flagship variant of the family, reported state-of-the-art evaluation benchmark performance against the leading 7-billion-parameter open-weights models at the November 2024 release. The principal comparison set included Mistral-7B, Gemma-7B, and Llama-3-8B. The dataset and training methodology produced significant improvements in factual knowledge recall and reasoning capability as measured by MMLU and ARC.

On MMLU (5-shot or 0-shot, as appropriate), Zamba2-7B reported scores competitive with or exceeding the top of the 7B class at the November 2024 measurement. On ARC, the AI2 reasoning challenge benchmark, similar relative positioning was reported.

The Zamba2-2.7B variant has been positioned as a competitive entry in the small-model class against Microsoft Phi-2 and Phi-3, Qwen-2.5 sub-3B variants, and Llama-3.2-3B. The family targets the per-dollar inference economics segment of the open-weights ecosystem, where Mistral, Microsoft Phi, Qwen, and DeepSeek small variants all compete.

The standard horizontal language model benchmarks include the Artificial Analysis Intelligence Index (which generally treats Zamba2 as competitive within the small-model tier rather than as a frontier-tier entry), MMLU, ARC, GSM8K, and HumanEval. Direct head-to-head benchmark publications versus Mistral and Llama on shared evaluation infrastructure are documented in the Zamba2 technical report.

Benchmark leadership in the small-and-efficient language model category rotates rapidly. Zamba2's positioning is representative of the late-2024 open-weights small-model landscape; subsequent releases from peer labs (including Phi-4 from Microsoft, Qwen-3 small variants, and Mistral small variants) have continued to compress performance differences across the segment.

Access and pricing

Zamba2 weights are distributed through the Zyphra organization on Hugging Face, with separate model repositories for each size and for base versus instruction-tuned variants. The principal repositories are Zyphra/Zamba2-7B, Zyphra/Zamba2-2.7B, Zyphra/Zamba2-1.2B-instruct, and the Zamba2-Instruct variants for the larger sizes.

Self-hosted deployment of the open-weights variants is free under the model license. Inference cost depends on the user's own hardware. The PyTorch reference implementation is at Zyphra/Zamba2 on GitHub, and the model is also supported through the Hugging Face Transformers library directly.

The 7B variant is feasible to deploy on consumer-grade GPU configurations, with the 2.7B and 1.2B variants targetable at edge and on-device deployments. The Mamba2 architecture's lower KV cache footprint matters most when running with long contexts or many concurrent requests, where the memory advantage relative to standard transformers is most pronounced.

Zyphra has not published a hosted API for the Zamba2 family directly. Hosted inference is available through third-party providers including Together AI and adjacent inference platforms.

Comparison

Direct competitors and adjacent open-weights small language models:

Mistral-7B and Mistral 7B Instruct (Mistral AI). The principal European open-weights small-model peer at the 7B scale. Mistral-7B was an early benchmark for the open-weights small-model category. Zamba2-7B reports leading benchmark performance versus Mistral-7B on standard evaluations at the November 2024 release.
Llama-3-8B (Meta AI / FAIR). The principal US open-weights small-model peer. Llama-3-8B has broader community ecosystem and longer-running tooling support; Zamba2-7B claims hybrid-architecture efficiency advantages at competitive benchmark performance.
Gemma-7B (Google DeepMind). The Google open-weights small-model peer. Adjacent benchmark positioning with different distribution channels.
Microsoft Phi-3 and Phi-4 (Phi-3, Phi-4). The Microsoft small-language-model line. Phi targets dense small models with curated training data; Zamba2 targets hybrid architecture efficiency. Phi-4 has reached the upper tier of small-model benchmark performance through 2025 and 2026.
Qwen-2.5 small variants (Alibaba Qwen). Chinese open-weights peers at the 1.5B, 3B, and 7B scales. Apache 2.0 licensing matches Zamba2 distribution posture.
DeepSeek small variants (DeepSeek). Chinese open-weights peers including DeepSeek-V2-Lite at 16B total / 2.4B active. Sparse-MoE alternative to Zamba2's dense-with-shared-blocks design.
Arcee AFM-4.5B (Arcee AI). Direct enterprise-small-model peer at adjacent parameter scale. Arcee positions on enterprise compliance and data residency rather than architectural efficiency.

Zamba2's distinctive position among 2024 vintage open-weights small language models: the hybrid Mamba2-transformer architecture with shared attention blocks and LoRA-decorated MLPs, the time-to-first-token and KV cache memory advantages that the architecture enables, and the competitive benchmark performance on the standard small-model evaluation suite at release.

Outlook

Open questions for Zamba2 over the next 6 to 18 months:

Successor model cadence. Zamba2 was Zyphra's principal public model line through 2024. Subsequent releases (a Zamba3 or comparable successor) on the AMD-and-IBM training infrastructure announced October 2025 will signal the company's research throughput.
Maia agent integration. Zyphra is building Maia, a general-purpose superagent for knowledge-work productivity. Whether Zamba models serve as the underlying backbone of Maia, or whether Maia uses larger frontier-tier models from peer labs, will affect the strategic role of the Zamba family.
Hybrid architecture mainstream adoption. Mamba2 and SSM-transformer hybrid architectures remain a niche category against the dominant pure-transformer baseline. Whether the Zamba2 efficiency claims propagate into broader adoption (or whether peer labs catch up via alternative efficiency techniques) is the principal architectural question.
Benchmark cadence against peer small models. Mistral, Microsoft Phi, Qwen, and DeepSeek all release small-model variants on rapid cadences. Zamba2's relative position has been characterized as competitive at release; sustained competitive standing through 2026 depends on continued releases.
Commercial-grade deployment evidence. Self-hosted deployment of Zamba2 has been documented in the open-weights community. Named enterprise reference customers and adoption depth are watchable signals for the family's commercial traction.

Sources

The Zamba2 Suite: Technical Report (arXiv). Full technical report covering architecture, training, and benchmark results.
Hugging Face: Zyphra/Zamba2-7B. Model weights and model card for the 7B variant.
Hugging Face: Zyphra/Zamba2-2.7B. Model weights and model card for the 2.7B variant.
Hugging Face: Zyphra/Zamba2-1.2B-instruct. Model weights for the 1.2B instruction-tuned variant.
GitHub: Zyphra/Zamba2. PyTorch reference implementation.
Zyphra blog: Zamba2-7B. Official 7B release announcement.
Hugging Face Transformers: Zamba2. Library documentation.
Zyphra blog: Zamba2-small (2.7B). 2.7B release announcement.

Zamba2

At a glance

Origins

Capabilities

Benchmarks and standing

Access and pricing

Comparison

Outlook

Sources

Nextomoro

AI Research Lab Intelligence

Zamba2

At a glance

Origins

Capabilities

Benchmarks and standing

Access and pricing

Comparison

Outlook

Sources

Nextomoro

QwQ-32B

Qwen3 Coder 480B-A35B

MiniMax M2

Kimi K2.5

Qwen 3.6

AI Research Lab Intelligence