Phi-4

Phi-4 is Microsoft Research's 14-billion-parameter open-weights language model, released in December 2024 under the MIT license, designed around high-quality synthetic training data to achieve reasoning and math performance well above its parameter count.
Phi-4

Phi-4

Phi-4 is Microsoft AI's 14-billion-parameter open-weights language model, released in December 2024 as the fourth generation in Microsoft Research's Phi family of small language models. It is available on Hugging Face under the MIT license and on Azure AI Foundry as a managed endpoint, and is capable of instruction following, multi-turn dialogue, mathematical reasoning, and code generation. As of April 2026, Phi-4 remains the reference point for small-model reasoning quality in the open-weights category, and its family of derivatives -- Phi-4-mini, Phi-4-multimodal, and the Phi-4-reasoning series -- has extended the original model's reach into multimodal, edge, and chain-of-thought use cases.

At a glance

  • Lab: Microsoft AI
  • Released: December 2024 (Azure AI Foundry); January 2025 (Hugging Face MIT release)
  • Modality: Text
  • Open weights: Yes; MIT license. No usage restrictions.
  • Context window: 16,384 tokens (extended during midtraining from a 4K default)
  • Pricing: Free for self-hosting; per-token pricing on Azure AI Foundry and partner inference platforms
  • Distribution channels: Hugging Face microsoft/phi-4, Azure AI Foundry, GitHub Models, Ollama

Origins

The Phi lineage begins with a bet that data quality, not parameter count, is the binding constraint on small-model capability.

Phi-1 was published in June 2023 as a 1.3-billion-parameter code model trained on what the team called "textbook-quality" synthetic data and curated web content. The paper introducing it, "Textbooks Are All You Need," argued that a model trained on a small but carefully chosen corpus could match much larger models on narrow benchmarks. Phi-1 reached 50.6% pass@1 on HumanEval with a fraction of the parameters of comparable code models.

Phi-1.5 followed in September 2023 with the same 1.3-billion-parameter count but broader coverage of common-sense reasoning in natural language. Phi-2, announced at Microsoft Ignite in December 2023, scaled to 2.7 billion parameters and demonstrated that the synthetic-data thesis generalized beyond code: on reasoning and language benchmarks, Phi-2 outperformed models with up to five times more parameters.

Phi-3 arrived in April 2024, with the 3.8-billion-parameter Phi-3-mini launching first, followed by Phi-3-small (7B) and Phi-3-medium (14B) at Microsoft Build in May. The Phi-3-mini was specifically designed to run on a smartphone; Phi-3-medium at 14B matched the parameter count that Phi-4 would later reuse for a substantially improved model.

The research effort behind the Phi family was led by Sébastien Bubeck, then Microsoft's vice president of generative AI research. Bubeck departed Microsoft for OpenAI at the end of 2024, shortly after Phi-4's release.

Phi-4 was announced on December 12, 2024, initially available through Azure AI Foundry under a Microsoft Research License Agreement. Microsoft released the weights on Hugging Face under the permissive MIT license in January 2025. Unlike the Phi-3 generation, which used primarily curated web and code data, Phi-4 treats synthetic data as a first-class training ingredient throughout pretraining and midtraining -- not just fine-tuning. The technical report describes a pipeline combining multi-agent prompting (using multiple language models to collaboratively generate and critique training examples), self-revision workflows (where a model iterates on its own outputs), and instruction reversal (deriving question-answer pairs by working backward from answers). Organic web content and code make up the bulk of the token count by volume, but the data diversity and density of synthetic content drives the model's performance on structured reasoning tasks.

The Phi-4 family expanded through 2025. In February, Microsoft released Phi-4-mini (3.8B) and Phi-4-multimodal (5.6B), the latter capable of processing text, image, and audio inputs simultaneously. In April and May 2025, Microsoft released Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning, a set of models fine-tuned and reinforcement-learned specifically for mathematical and scientific chain-of-thought reasoning.

Capabilities

Phi-4's primary strength is structured reasoning. On tasks requiring multi-step deduction, mathematical problem-solving, and scientific question answering, it consistently performs above the level one would predict from its parameter count. The model's training on dense synthetic data -- problems designed to require careful intermediate steps -- produces a distinctive profile: stronger on STEM reasoning than on factual recall, stronger on analytical tasks than on knowledge-intensive ones requiring broad world coverage.

Instruction following and multi-turn dialogue work well at typical task lengths within the 16K context window. Code generation is solid for common languages and patterns. The model is notably weaker on tasks that depend heavily on recency or long-tail world knowledge, which is a predictable tradeoff given how little of its training was drawn from broad web crawls compared to larger models.

The 16K context window is sufficient for most document-summarization and retrieval-augmented generation (RAG) use cases but is shorter than the long-context windows offered by several competitor models. Microsoft extended the context from 4K to 16K during a midtraining phase using adjusted rotary positional embedding (RoPE) parameters.

Phi-4's relatively small size is a genuine practical advantage. At 14 billion parameters, the model fits on a single consumer GPU at standard precision and on multiple GPUs with headroom for batched inference, which is not true of many frontier-class models. Quantized variants distributed through llama.cpp run on Apple Silicon Macs and high-end consumer hardware. This deployment profile makes Phi-4 attractive for local inference, on-premises enterprise deployment, and resource-constrained environments.

Benchmarks and standing

Phi-4's benchmark profile at launch in December 2024 was notable for its math and reasoning results relative to model size.

On MATH (competition mathematics), Phi-4 scores approximately 80%, exceeding its training teacher GPT-4o at the time of release and outperforming models significantly larger than 14B parameters. On GPQA Diamond (graduate-level scientific reasoning across biology, chemistry, and physics), Phi-4 scores approximately 56%, which placed it ahead of GPT-4o-mini and Llama 3.1 70B at the time of initial evaluation. These two benchmarks are the model's strongest showing relative to peers.

On MMLU (broad knowledge and reasoning), Phi-4 reaches approximately 84%, competitive with models two to three times larger. On HumanEval (Python code completion), Phi-4 reaches the high-70s percentage range, competitive with models such as Llama 3.1 8B and Mistral 7B but below the leading code-specialized models.

The model's benchmark weaknesses are knowledge-retrieval tasks, long-context comprehension, and multilingual coverage -- areas where the large-crawl training regimes of models like Qwen 2.5 or Llama 3.1 give them an advantage through sheer data breadth.

The Phi-4-reasoning and Phi-4-reasoning-plus variants released in May 2025 significantly extend the family's reach on mathematical benchmarks. Phi-4-reasoning-plus outperforms DeepSeek-R1-Distill-70B on AIME 2025 (advanced competition mathematics) and approaches the performance of the full DeepSeek-R1 model at a fraction of the inference cost.

Benchmark figures reflect data available through April 2026. Scores shift as evaluation methodologies change and new models enter the comparison set.

Access and pricing

Phi-4 weights are freely available at microsoft/phi-4 on Hugging Face under the MIT license. There are no usage restrictions; the license permits commercial use, fine-tuning, and redistribution without royalty.

For self-hosting, standard inference stacks support Phi-4. Ollama packages the model for single-command local deployment. llama.cpp and GGUF-format quantizations are available from community repositories for deployment on Apple Silicon and consumer GPUs. vLLM and TGI (Text Generation Inference) support Phi-4 for production inference servers.

For hosted access, Azure AI Foundry offers Phi-4 as a managed endpoint with per-token pricing under Microsoft's standard AI services pricing. GitHub Models provides free-tier access for development and evaluation. OpenRouter aggregates multiple Phi-4 inference providers with unified billing.

The Phi-4 family members -- Phi-4-mini, Phi-4-multimodal, Phi-4-reasoning, Phi-4-reasoning-plus, Phi-4-mini-reasoning, and Phi-4-reasoning-vision-15B -- are all available on Hugging Face under MIT licenses and on Azure AI Foundry. The multimodal and mini variants are additionally available on edge deployment platforms including Ollama and ONNX Runtime for on-device inference.

Comparison

Direct competitors to Phi-4 in the small open-weights text category, as of April 2026:

  • Phi-3 (Microsoft AI). Phi-4's direct predecessor. Phi-3-medium (14B) shares the same parameter count but trains with less synthetic data and achieves lower scores on math and reasoning benchmarks. Phi-4 is a clean upgrade for most use cases; Phi-3 retains value mainly in fine-tuned derivative ecosystems built around it before Phi-4's release.
  • Llama 4 Scout (Meta AI). Llama 4 Scout uses a mixture-of-experts architecture with 109B total parameters and 17B active, which puts it in a different size and memory category than Phi-4 despite the similar active-parameter count. Scout is stronger on factual knowledge, multilingual tasks, and long-context retrieval. Phi-4 is stronger on structured reasoning and math at lower hardware requirements.
  • Mistral 7B / Mistral Small (Mistral AI). Mistral 7B is the perennial compact baseline. Phi-4 consistently outperforms Mistral 7B on reasoning and math benchmarks by a substantial margin; Mistral 7B holds an advantage on throughput at equal hardware cost due to its smaller parameter count. Mistral Small (22B) is a closer competitor but at greater deployment cost.
  • Qwen 2.5 7B (Alibaba). Qwen 2.5 7B shows strong multilingual coverage and competitive coding performance, particularly for languages other than English. Phi-4 is stronger on mathematical reasoning; Qwen 2.5 is stronger on knowledge breadth and multilingual tasks. Qwen 2.5 14B is a closer size-matched comparison and is more competitive with Phi-4 on reasoning benchmarks than the 7B variant.

Phi-4's distinctive position is the combination of MIT license (no commercial restrictions, unlike the Llama Community License), focused reasoning quality, and the full family of derivative variants that extend it to multimodal and chain-of-thought use cases.

Outlook

Open questions for Phi-4 and the Phi family over the next 6 to 18 months:

  • Phi-5 timeline and design. Whether Microsoft Research continues the Phi naming convention and the synthetic-data thesis at the next scale step is an open question following Bubeck's departure. The Phi-4-reasoning variants suggest the team is extending the current generation rather than rushing to a new base model.
  • Edge deployment competition. Phi-4-mini (3.8B) and Phi-4-multimodal (5.6B) compete directly with Qwen 2.5 3B, Gemma 3, and Llama 3.2 3B in the on-device tier. Which model becomes the default embedded in enterprise edge devices and mobile applications is not resolved.
  • The synthetic-data ceiling. Phi-4's training approach is explicitly a data-quality bet over a data-quantity bet. Whether that approach scales beyond 14B parameters -- or whether it hits diminishing returns -- will shape Microsoft Research's roadmap beyond Phi-4.
  • Reasoning model competition. The Phi-4-reasoning-plus model competes with DeepSeek-R1 distillation and other small reasoning models in a fast-moving category. Benchmark leadership here is measured in months.
  • Microsoft's closed-model pivot. The launch of the MAI frontier model series in early 2026 signals that Microsoft is investing in closed in-house models alongside the Phi open-weights line. How resources are allocated between the two tracks will affect the pace of Phi development going forward.

Sources

About the author
Nextomoro

Nextomoro

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Research Lab Intelligence.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.