DeepSeek V3

DeepSeek V3 is an open-weights large language model released by DeepSeek on December 26, 2024, built on a mixture-of-experts architecture with 671 billion total parameters and 37 billion activated per token. The model is distributed under the MIT license through Hugging Face and GitHub, served via the DeepSeek API at api.deepseek.com, and powers the consumer chat product at chat.deepseek.com. As of its release, DeepSeek V3 was the most capable open-weights large language model on standard educational and coding benchmarks, and the public reporting of its approximately $5.6 million training cost reframed industry assumptions about the capital required to train a frontier-tier model.

At a glance

Lab: DeepSeek
Released: December 26, 2024 (initial V3); March 24, 2025 (V3-0324 update); September 29, 2025 (V3.2-Exp)
Modality: Text
Open weights: Yes. Weights and inference code released on Hugging Face and GitHub under the MIT license (V3-0324 and later updates use full MIT terms; the original December 2024 release used a custom DeepSeek License with similar permissive terms).
Context window: 128,000 tokens
Pricing: $0.40 per million input tokens, $0.89 per million output tokens (V3 December 2024); cache-hit input pricing reduced to approximately $0.07 per million tokens. Pricing has shifted across V3 sub-versions; V3.2 is offered at $0.28 per million input and $0.42 per million output tokens.
Distribution channels: deepseek-ai/DeepSeek-V3 on Hugging Face, GitHub deepseek-ai/DeepSeek-V3, DeepSeek API, consumer chat at chat.deepseek.com

Origins

DeepSeek V3 is the third major version of the DeepSeek language model family, following DeepSeek-LLM (a dense 67-billion-parameter line in late 2023) and DeepSeek V2 (a 236-billion-parameter mixture-of-experts model in May 2024). The V2 release introduced two architectural ideas that became foundational for V3: Multi-head Latent Attention (MLA), a compressed attention variant that reduces the memory footprint of the key-value cache at long context, and the DeepSeekMoE routing scheme, which uses fine-grained expert specialization rather than the coarser MoE layouts then standard in open-weights releases.

V3 scaled both ideas. The model was pre-trained on 14.8 trillion tokens of multilingual data, with 671 billion total parameters distributed across 256 routed experts and 1 shared expert per layer; 8 experts are selected per token at inference, producing the 37 billion active-parameter figure. Training used 2,048 Nvidia H800 GPUs, the export-control-compliant variant of the H100 sold into China, connected via NVLink within nodes and InfiniBand between them. DeepSeek's reported total training cost of approximately $5.576 million covered 2.788 million H800 GPU hours across pre-training, context extension, and fine-tuning. The figure assumed a $2 per GPU-hour rental cost and excluded hardware capital costs, prior research, and post-training reinforcement-learning expenditures, but it was nonetheless an order of magnitude below the typical reported budgets for comparable proprietary frontier models.

The technical report documented several engineering choices the company credited for the cost figure. Mixed-precision arithmetic used 8-bit floating point for forward passes and a custom 12-bit format for specific layer inputs. An auxiliary-loss-free load-balancing strategy replaced the standard MoE auxiliary loss. A multi-token-prediction training objective produced denser learning signal per gradient step. The DualPipe pipeline-parallelism scheme overlapped forward and backward computation with communication, raising hardware utilization on H800 clusters that lacked the higher NVLink bandwidth available on Nvidia's flagship chips.

Initial reception in the open-weights research community was strong. The broader market reaction came one month later with the January 20, 2025 release of DeepSeek-R1, the reasoning model trained with reinforcement learning on V3, which produced the late-January 2025 selloff in US AI hardware stocks colloquially known as the "DeepSeek shock." V3's role in that episode was foundational: R1 was built on V3, and the V3 training-cost figure was the basis for the broader argument that the cost of frontier-tier capability had been compressed by Chinese open-weights research more aggressively than US incumbents had assumed.

The V3 line received a substantial mid-cycle update on March 24, 2025, designated DeepSeek-V3-0324, which raised parameter count slightly (to approximately 685 billion) and improved benchmark scores across reasoning, coding, and instruction-following. A further update, DeepSeek-V3.2-Exp, was released in late 2025, introducing sparse attention and further price reductions. V3 remained the active production line until the April 2026 preview release of DeepSeek V4.

Capabilities

DeepSeek V3 handles general-purpose text generation, multi-turn dialogue, instruction following, code generation, mathematical reasoning, and document analysis. The model was trained on a multilingual corpus with substantial Chinese and English coverage, and on Chinese-language tasks it consistently leads the open-weights tier.

The mixture-of-experts configuration is the dominant architectural feature. Of 671 billion total parameters, 37 billion are activated per token, which keeps inference compute manageable while maintaining effective capacity comparable to substantially larger dense models. DeepSeekMoE routing uses 256 fine-grained routed experts plus 1 shared expert per layer, with 8 experts selected per token. The fine-grained design encourages greater expert specialization than coarser MoE configurations.

Multi-head Latent Attention compresses the key-value cache through a low-rank projection, reducing memory consumption at long context relative to standard multi-head attention. The mechanism makes the 128,000-token context window tractable on practical hardware and is a structural enabler of the model's economics for both training and inference.

The auxiliary-loss-free load-balancing strategy is a consequential training innovation. Standard MoE training uses an auxiliary loss term to encourage even distribution of tokens across experts, which interferes with the primary language-modeling gradient. V3's bias-based routing adjustment achieves load balance without adding loss terms.

V3 is not a reasoning model in the chain-of-thought sense. Extended reasoning capability was introduced through R1, which applied reinforcement learning on top of V3 to produce chain-of-thought traces. V3 itself produces direct responses without extended internal computation, which keeps inference latency low but limits performance on the most reasoning-intensive benchmarks relative to dedicated reasoning models from the same period.

Benchmarks and standing

At release in December 2024, DeepSeek V3 scored 88.5 on MMLU, 75.9 on MMLU-Pro, and 59.1 on GPQA Diamond, placing it ahead of every other open-weights model on these axes and within range of the leading closed-weights models of the same period (GPT-4o and Claude 3.5 Sonnet). On HumanEval the model scored 65.2; on MATH, 61.6; on GSM8K, 89.3; on BBH, 87.5.

The Artificial Analysis Intelligence Index placed the December 2024 release at 16, in the upper range of the open-weights tier but below the closed-weights frontier. The index is a composite metric and has evolved methodologically since 2024.

The March 2025 V3-0324 update raised most reported benchmarks meaningfully. AIME 2025 climbed from 70.0 to 87.5, GPQA from 71.5 to 81.0, LiveCodeBench v6 from 63.5 to 73.3, and Aider from 57.0 to 71.6. The V3.2-Exp variant later in 2025 reported SWE-bench Verified scores in the 72 to 74 range and LiveCodeBench scores in the 80s.

On LMArena's general human-preference leaderboard, V3 entered the top tier of open-weights models at release and held a strong position through the first half of 2025 before being overtaken by successors and by competing open-weights frontier releases (Kimi K2, Qwen 3). The combination of MoE efficiency, MIT licensing, and competitive coding scores made V3 a default open-weights choice for developer-facing applications through 2025. Benchmark leadership rotates rapidly; V3's positions are representative of the December 2024 through late 2025 landscape.

Access and pricing

DeepSeek V3 weights are distributed through deepseek-ai/DeepSeek-V3 on Hugging Face and GitHub deepseek-ai/DeepSeek-V3. The original December 2024 release used a custom DeepSeek License with permissive commercial terms; the March 2025 V3-0324 release moved the line to standard MIT.

Hosted API access is available at api.deepseek.com with OpenAI-compatible endpoints. Pricing for the December 2024 V3 release was $0.40 per million input tokens and $0.89 per million output tokens, with cache-hit reduction to approximately $0.07 per million input tokens. V3.2-Exp is priced at $0.28 per million input tokens and $0.42 per million output tokens, reflecting both efficiency gains from sparse attention and the broader Chinese-LLM price compression of late 2024 and 2025.

The consumer chat product at chat.deepseek.com provides free public access to V3 and successor models, including R1 reasoning mode. The DeepSeek mobile applications became among the most-downloaded AI apps globally following the January 2025 R1 release.

For self-hosted deployment, the V3 weight set requires multi-GPU server configurations. The 37-billion active-parameter inference profile is tractable on standard tooling including vLLM and SGLang, though full-precision deployment of the 671-billion-parameter weight set requires substantial GPU memory.

Comparison

Direct competitors to DeepSeek V3 across its 2024 to 2025 active period:

GPT-4o (OpenAI). Closed-weights frontier model from the same period as V3. On MMLU and GPQA, V3 was within a few points of GPT-4o at release; on HumanEval and MATH, GPT-4o held a modest edge. GPT-4o was API-only at substantially higher per-token prices, while V3 offered open-weights deployment alongside the hosted API at much lower cost.
Claude 3.5 Sonnet (Anthropic). Anthropic's mainline closed-weights model from mid-2024. V3 reported MMLU scores comparable to Claude 3.5 Sonnet and trailed slightly on coding-specific benchmarks. Claude's advantage was on safety-tuned task performance and US-origin enterprise certifications; V3's advantage was open-weights distribution and aggressive pricing.
Llama 4 (Meta AI). The primary US open-weights peer at the frontier tier. Llama 4 Maverick uses a 109-billion-parameter MoE with 17 billion active parameters; V3 uses 671 billion total with 37 billion active. On English-language coding benchmarks the two are competitive; V3 leads on Chinese-language tasks where Llama's training data is less comprehensive. The choice between them turns on geographic supply-chain and regulatory considerations.
Qwen 3 (Alibaba Qwen). The other major Chinese open-weights line released through 2025. Qwen 3 covers a broader variant spectrum with multimodal sub-lines for vision, audio, coding, and math; V3's distinguishing position is the larger total parameter count and pure-text focus. Apache 2.0 licensing on most Qwen 3 variants matches V3-0324's MIT.
Kimi K2 (Moonshot AI). The third major Chinese open-weights frontier release, launched July 2025 with 1 trillion total parameters and 32 billion active. Kimi K2's coding-specific benchmarks led DeepSeek V3 at the K2 release date, reflecting Moonshot's emphasis on agentic coding. V3's broader benchmark profile and earlier release date positioned it as the more widely deployed of the two through late 2025.
DeepSeek V4 (DeepSeek). V3's successor within the same lab, released in preview April 2026 with 1.6 trillion total parameters, 49 billion active, a 1-million-token context window, and Huawei Ascend integration. V4 supersedes V3 as DeepSeek's flagship.

V3's distinctive position among 2024 to 2025 open-weights models: leading composite benchmark scores at release, MoE efficiency that kept inference economical, MIT licensing on the V3-0324 update onward, and a training cost figure that reframed the cost-of-frontier-capability narrative for the open-weights tier through 2025.

Outlook

Open questions for DeepSeek V3 over the next 6 to 18 months:

Migration to V4. The practical question for V3 deployments is migration timing. V4's higher benchmark scores and 1-million-token context window are material upgrades, but V3 remains substantially less expensive on per-token API pricing and runs on more conservative GPU configurations for self-hosting.
The training-cost narrative. The $5.6 million figure was a snapshot in time and excluded several real cost components (research, hardware capital, post-training). Whether subsequent models in the DeepSeek line continue to compress reported training costs at a similar rate remains a contested point in industry coverage.
US export-control trajectory. V3 was trained on Nvidia H800 GPUs sold into China before the late-2023 export-control tightening. V4 has shifted to Huawei Ascend silicon. The forward path for V3 successors and the broader Chinese frontier-AI ecosystem depends on the trajectory of US export controls through 2026 and 2027.
Commercial enterprise adoption. V3's combination of cost and capability has driven adoption in cost-sensitive deployments and developer-facing applications. The extent to which it penetrates regulated-sector enterprise accounts, where the Chinese-origin supply chain and data-handling questions apply, will continue to be shaped by policy decisions in the US, EU, and allied jurisdictions.

Sources

DeepSeek-V3 Technical Report (arXiv). Full technical report covering architecture, training methodology, MLA, DeepSeekMoE, auxiliary-loss-free load balancing, and multi-token prediction.
GitHub: deepseek-ai/DeepSeek-V3. Official repository with model weights, inference code, and configuration files.
Hugging Face: deepseek-ai/DeepSeek-V3. Model weights and model card for the December 2024 release.
Artificial Analysis: DeepSeek V3 (Dec) Intelligence, Performance and Price Analysis. Composite benchmark scores, context window, and pricing for the December 2024 release.
Wikipedia: DeepSeek. Company history, V3 release context, and reception of the V3 and R1 releases.
DeepSeek API Docs: DeepSeek-V3-0324 Release. Official release notes for the March 2025 V3-0324 update with benchmark improvements.
Stratechery: DeepSeek FAQ. Industry analysis of the DeepSeek V3 and R1 releases and the cost-of-frontier-capability narrative.
BentoML: The Complete Guide to DeepSeek Models. Overview of the V3 to V4 progression with architectural and benchmark detail.
Sebastian Raschka: A Technical Tour of the DeepSeek Models from V3 to V3.2. Independent technical analysis of the V3 line architecture and improvements.

DeepSeek V3

At a glance

Origins

Capabilities

Benchmarks and standing

Access and pricing

Comparison

Outlook

Sources

Nextomoro

AI Research Lab Intelligence

DeepSeek V3

At a glance

Origins

Capabilities

Benchmarks and standing

Access and pricing

Comparison

Outlook

Sources

Nextomoro

QwQ-32B

Qwen3 Coder 480B-A35B

MiniMax M2

Kimi K2.5

Qwen 3.6

AI Research Lab Intelligence