Kimi K2.5

Kimi K2.5 is Moonshot AI's January 2026 open-weights multimodal flagship, a 1-trillion-parameter mixture-of-experts model with 32 billion active parameters, native image and video understanding, and a 256K context window under a modified MIT license.
Kimi K2.5

Kimi K2.5 is the January 2026 generation of Moonshot AI's open-weights Kimi model family, a 1-trillion-parameter mixture-of-experts language model with 32 billion active parameters per token, a native 256K context window, and integrated vision support across image and video input. The model is the first natively multimodal entry in the Kimi K-line, introducing the MoonViT vision encoder alongside the language backbone and adding agent-swarm coordination as the strategic differentiator for long-horizon task workflows. As of May 2026, Kimi K2.5 sits in the leading tier of Chinese-origin open-weights frontier models alongside DeepSeek V4, Qwen 3.6, and GLM-5.1, with subsequent K2.6 micro-versions extending the family's reach into the second quarter of 2026.

At a glance

  • Lab: Moonshot AI.
  • Released: January 29, 2026.
  • Modality: Native multimodal. Text, image, and video input through an integrated 400-million-parameter MoonViT vision encoder. Text output.
  • Open weights: Yes. Modified MIT license.
  • Architecture: Sparse mixture-of-experts. 1 trillion total parameters with 32 billion active per token. 384 total experts, 8 selected experts plus 1 shared expert per token, 61 layers (including one dense layer), Multi-head Latent Attention (MLA), 160,000-token vocabulary.
  • Context window: 256,000 tokens.
  • Pricing: Open weights, free to self-host. Hosted-inference pricing through the Moonshot platform and third-party providers (including Fireworks AI) varies by tier; specific per-token figures have not been uniformly published across the surface.
  • Distribution channels: Hugging Face, the Moonshot platform API at platform.moonshot.ai, the consumer Kimi chat interface at kimi.com, the Kimi Code IDE at kimi.com/code, and third-party hosted-inference providers.

Origins

Kimi K2.5 follows the Kimi K2 release of mid-2025 and represents the third major generation in the K-line of frontier-grade Moonshot AI models. Moonshot AI is one of the four principal Chinese frontier-model labs alongside DeepSeek, Alibaba Qwen, and Z.ai, and the K2.5 release maintains the cadence of approximately quarterly major releases that the Chinese open-weights cohort established across 2025 and 2026.

The architectural direction in K2.5 represents two significant generational changes from K2. First, the move to native multimodality: where K2 was a text-only model, K2.5 integrates the 400-million-parameter MoonViT vision encoder during pre-training, enabling image and video input without the post-hoc adapter approach that earlier text-first models used. The model card frames K2.5 as "pre-trained on vision-language tokens" rather than fine-tuned for vision, which is the structural design choice that distinguishes the family's multimodal capability from text-first peers that added vision later.

The second generational change is the scale increase to the 1-trillion-total / 32-billion-active configuration. The earlier K2 generation operated at a smaller total-parameter ceiling; K2.5's 1-trillion-total scale places it in the same capacity tier as DeepSeek V4 Pro (1.6 trillion total). The 32-billion-active count gives K2.5 a per-token inference cost between the smaller active-parameter peers (Qwen 3.6 at 3 billion, MiniMax M2 at 10 billion) and the larger DeepSeek V4 Pro (49 billion).

The agent-swarm coordination capability is the third distinctive element. The K2.5 model card introduces multi-agent task decomposition as a first-class capability: rather than single-agent execution against a tool stack, K2.5 is designed to spawn coordinated sub-agents that work in parallel on different aspects of a complex task. The BrowseComp and WideSearch benchmark configurations both report scores in an "Agent Swarm" mode that materially exceeds single-agent performance.

The release places Moonshot AI at the visual-agent end of the Chinese open-weights spectrum, complementing DeepSeek's text-frontier focus, Alibaba Qwen's broad-multimodal capability across the 3.x generations, and Z.ai's agentic-engineering specialisation. The publicly cited technical paper (arxiv 2602.02276, titled "Kimi K2.5: Visual Agentic Intelligence") names the strategic positioning directly.

Capabilities

The Kimi K2.5 capability profile spans four principal axes: visual understanding, agent-swarm coordination, coding (including code generation from visual specifications), and reasoning.

Visual understanding is the headline differentiator for the K-line's first multimodal flagship. On MMMU-Pro, the model reports 78.5 percent. On MathVision, 84.2 percent. On OCRBench, 92.3 percent (a leading position for any open-weights model). On VideoMME and LongVideoBench, 87.4 and 79.8 percent respectively. The video benchmarks in particular reflect the native pre-training approach: rather than processing video as a sequence of independently-encoded frames, K2.5 operates on a temporal-coherent token stream that preserves cross-frame state.

Agent-swarm coordination is the second principal capability. On BrowseComp in the agent-swarm configuration, K2.5 reports 78.4 percent. On WideSearch in the same configuration, 79.0 percent. The agent-swarm mode is positioned as the structural answer to the long-horizon-task scaling problem: complex tasks decompose into parallel sub-agent workstreams, each of which operates with a focused context window, and the results compose into the final user-facing output. The model card identifies the agent-swarm capability as appropriate for complex research, multi-step browsing, and long-document synthesis tasks.

Coding capability is anchored by SWE-Bench Verified at 76.8 percent and the harder SWE-Bench Pro at 50.7 percent, LiveCodeBench at 85.0 percent, and Terminal Bench 2.0 at 50.8 percent. The distinctive coding capability for K2.5 is what the model card calls "coding with vision": the ability to generate code from visual specifications such as UI designs, hand-drawn wireframes, or video workflows demonstrating the desired behaviour. This capability is enabled by the native multimodal pre-training and is positioned as a distinct workflow from text-only code generation.

Reasoning and knowledge capability is anchored by AIME 2025 at 96.1 percent (one of the higher published figures for any open-weights model on this benchmark at the time of release), GPQA-Diamond at 87.6 percent, MMLU-Pro at 87.1 percent, and Humanity's Last Exam (HLE-Full with tools) at 50.2 percent. The dual-mode operation distinguishes K2.5 from peers: a thinking mode (recommended temperature 1.0) that emits extended reasoning traces before responding, and an instant mode (recommended temperature 0.6) optimised for fast-path responses.

Long-context capability is supported by the 256K context window and reflected in the Longbench v2 score of 61.0 percent and the AA-LCR (long-context reasoning) score of 70.0 percent.

Benchmarks and standing

Kimi K2.5 reports the following benchmark positions at release:

  • AIME 2025: 96.1 percent
  • GPQA-Diamond: 87.6 percent
  • MMLU-Pro: 87.1 percent
  • HLE-Full (with tools): 50.2 percent
  • MMMU-Pro: 78.5 percent
  • MathVision: 84.2 percent
  • OCRBench: 92.3 percent
  • VideoMME: 87.4 percent
  • LongVideoBench: 79.8 percent
  • SWE-Bench Verified: 76.8 percent
  • SWE-Bench Pro: 50.7 percent
  • LiveCodeBench: 85.0 percent
  • Terminal Bench 2.0: 50.8 percent
  • Longbench v2: 61.0 percent
  • AA-LCR (long-context reasoning): 70.0 percent
  • BrowseComp (Agent Swarm): 78.4 percent
  • WideSearch (Agent Swarm): 79.0 percent

The combined profile places Kimi K2.5 in the top tier of open-weights frontier-grade models across reasoning, coding, multimodal, and agent axes at release. The AIME 2025 score of 96.1 percent is among the highest published for any open-weights model on that benchmark, and the OCRBench score of 92.3 percent leads the open-weights cohort at the time of release. The agent-swarm BrowseComp and WideSearch results are first-party configurations and represent a different evaluation mode than the single-agent figures peers typically report; the magnitude is genuine but the comparison is not strictly apples-to-apples against single-agent baselines.

Benchmark leadership is point-in-time. The subsequent Kimi K2.6 refresh extends the family's position into the second quarter of 2026, and the next major Moonshot release is expected through the second half of 2026.

Access and pricing

Kimi K2.5 ships under a modified MIT license, permitting research and commercial use. Distribution channels:

  • Hugging Face Hub as the primary open-weights release.
  • Moonshot platform API at platform.moonshot.ai, with OpenAI-compatible and Anthropic-compatible endpoints.
  • Kimi consumer chat at kimi.com.
  • Kimi Code IDE at kimi.com/code, the company's first-party developer surface optimised for the coding-with-vision capabilities.
  • Fireworks AI hosts the K2.5 family in the multimodal-models catalog (the K2.5 entry on the Fireworks catalog uses the 262K-context configuration; an FP8 quantised variant is the principal hosted form).
  • Local deployment quantisations: native INT4 quantisation support (the same method used for Kimi K2 Thinking), with llama.cpp, Ollama, LM Studio, and Jan integration for consumer-scale local inference.
  • Deployment frameworks: vLLM, SGLang, KTransformers. Minimum transformers version 4.57.1 for full feature support.

Recommended sampling parameters: thinking mode uses temperature 1.0; instant mode uses temperature 0.6. The thinking content (within <think> blocks) must be preserved across multi-turn conversations for the model to retain its reasoning context.

Comparison

  • Kimi K2.6 (Moonshot AI). The direct successor in the K-line, sitting alongside K2.5 in third-party catalogs and representing the subsequent micro-version refresh.
  • DeepSeek V4 (DeepSeek). The principal Chinese frontier-tier open-weights peer at comparable scale. DeepSeek V4 Pro is materially larger (1.6 trillion total, 49 billion active) but text-only. The multimodal-versus-text distinction is one of the principal competitive axes.
  • Qwen 3.6 (Alibaba Qwen). The Alibaba open-weights peer at smaller scale but with comparable multimodal capability. The agent-swarm capability is the principal differentiator for Kimi K2.5.
  • GLM-5.1 (Z.ai). The Z.ai open-weights peer focused on agentic engineering. K2.5 is broader in capability surface; GLM-5.1 is more focused on the agent-engineering specialisation.
  • MiniMax M2 (MiniMax). Another Chinese open-weights peer at smaller scale with text-only modality. K2.5 leads on multimodal capability; M2 leads on Artificial Analysis composite ranking among models in its scale band.

The Chinese open-weights frontier-tier set in May 2026 is unusually crowded, with each principal lab differentiating on a distinct axis (DeepSeek on capacity ceiling, Qwen on broad multimodal capability, Z.ai on agentic engineering, Moonshot on visual-agent capability, MiniMax on inference economics).

Outlook

Open questions for the next 6 to 18 months:

  • Coding-with-vision adoption. The natural-language-to-UI-design code generation capability is a distinctive product surface. Whether the Kimi Code IDE adoption scales materially, and whether other IDE vendors integrate the capability through the Moonshot API, will indicate the practical traction of the capability.
  • Agent-swarm framework adoption. The agent-swarm coordination is currently a model-level capability with first-party tooling. Whether the community builds adapter layers (LangChain, LlamaIndex, AutoGen, CrewAI integrations) for the Kimi agent-swarm primitives will indicate ecosystem traction.
  • Successor cadence. The K2.6 micro-version followed K2.5 within roughly three months. Whether the cadence continues at this pace and what scale or capability jump appears next is the central roadmap question.
  • Independent multimodal-benchmark reproduction. The MMMU-Pro, MathVision, OCRBench, VideoMME, and LongVideoBench scores are first-party reports. Independent reproductions on the standard open multimodal leaderboards will determine whether the headline positions hold against the most-current peer releases.
  • Long-context utilisation. The 256K context window is large but smaller than peers (Qwen 3.6 at 1M with YaRN). Whether Moonshot extends the context further in a successor variant, and whether the long-context evaluations show usable retrieval and reasoning across the full window, are open questions.

Sources

About the author
Nextomoro

Nextomoro

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Research Lab Intelligence.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.