Hermes 4

Hermes 4 is an open-weights hybrid-reasoning language model family released by Nous Research in August 2025, built on Meta's Llama 3.1 base in 70-billion-parameter and 405-billion-parameter sizes with permissive licensing and minimal content restrictions.
Hermes 4

Hermes 4

Hermes 4 is an open-weights large language model family released by Nous Research in August 2025, built on the Meta Llama 3.1 base in 70-billion-parameter and 405-billion-parameter variants. It introduces a hybrid reasoning mode where the model emits explicit <think>...</think> segments when it decides to deliberate, and supports both fast direct generation and deeper step-by-step inference at the user's option. As of late April 2026, Hermes 4 sits among the leading openly-licensed reasoning models, leading on the Nous-built RefusalBench measurement of unconstrained answer behavior and competing with closed-source frontier models on mathematics and reasoning benchmarks.

At a glance

  • Lab: Nous Research
  • Released: August 2025 (Hermes 4 Technical Report, arXiv 2508.18255)
  • Modality: Text
  • Open weights: Yes. Distributed under the Meta Llama 3 Community License Agreement, which permits broad commercial and non-commercial use including derivative works. Weights and configurations available on Hugging Face under the NousResearch organization.
  • Context window: 128,000 tokens (inherits the Llama 3.1 base context length)
  • Pricing: Free for self-hosting. Hosted inference available at multiple price points: OpenRouter routes Hermes 4 405B at approximately $0.30 per million input tokens and $1.20 per million output tokens; the Nous Portal offers direct API access; community-hosted endpoints are available across Together AI, Fireworks AI, and other inference providers.
  • Distribution channels: Nous Portal, Hugging Face NousResearch organization, OpenRouter, Together AI, Fireworks AI, and self-hosted via vLLM, TGI, or llama.cpp.

Origins

Hermes 4 sits in a six-release lineage that traces back to Nous Research's first open-source fine-tunes in 2023. The Nous-Capybara 7B fine-tune in August 2023 and the Hermes 7B fine-tune in October 2023 established the brand as a leading source of high-quality openly-licensed instruction-tuned models built on third-party base weights. Hermes 1 through 3 progressively scaled the recipe across Llama 2 (7B, 13B, 70B) and Llama 3 base models, with each generation refining post-training data quality, instruction-following depth, and reasoning fidelity.

The Hermes 4 release in August 2025 was Nous Research's most ambitious post-training run to date. The post-training corpus expanded from approximately 1 million samples and 1.2 billion tokens used in Hermes 3 to roughly 5 million samples and 60 billion tokens for Hermes 4, blended across reasoning and non-reasoning data. The reasoning corpus emphasized verified-trace data: synthetic chain-of-thought generations checked for mathematical correctness, code-execution validity, and logical consistency before inclusion in the training mix.

The technical report (Teknium et al., arXiv 2508.18255) documented the post-training pipeline in detail. A base supervised fine-tuning stage on the expanded corpus established the core capability profile. A hybrid reasoning extension then taught the model to distinguish thinking from response generation via explicit segment tags. A final preference optimization stage tuned for instruction-following and minimal response refusal completed the run. The stated design goal was a model that "does what you tell it" without the heavy refusal scaffolding present in the major closed-frontier APIs.

The release was distributed quietly relative to peer launches: no embargoed press cycle, no partner integration pre-announcements. Hermes 4 weights were uploaded to Hugging Face on a Monday in August 2025, with technical-report and benchmark numbers following over the next 48 hours. Industry coverage characterized this distribution pattern as deliberate, mirroring Nous Research's positioning as an open-source-first organization rather than a press-cycle-first one.

Capabilities

Hermes 4 is a hybrid reasoning model in the architecture sense first popularized by OpenAI's o-series and subsequently adopted in closed-source form by Anthropic and Google DeepMind. The model's training enables it to selectively engage internal reasoning before producing a final answer, with the reasoning emitted in explicit <think> tags that the application surface can choose to display, hide, or route separately.

Hybrid reasoning is exposed through two modes. Fast mode generates a final response directly without an internal reasoning phase, suitable for low-latency conversational applications. Reasoning mode generates explicit <think>...</think> blocks containing intermediate reasoning steps before producing the final response, suitable for math, coding, multi-step inference, and verification-heavy workloads. The user toggles between modes through the system prompt or by including specific control tokens.

The 405-billion-parameter variant inherits the Llama 3.1 transformer architecture: a decoder-only design with grouped-query attention and standard rotary positional embeddings. The 70-billion-parameter variant trades capability for inference cost and runs comfortably on a single 8-GPU server in FP8 quantization. An FP8-quantized variant of the 405B is also distributed, enabling single-server inference for organizations with appropriate hardware.

A distinguishing feature in commercial coverage has been the model's low refusal rate. Nous Research published RefusalBench, a custom benchmark designed to measure how often a language model declines to answer questions that fall outside its hard safety boundaries. Hermes 4 405B scored 57.1 percent on RefusalBench in reasoning mode, substantially higher than GPT-4o (17.7 percent) and Claude Sonnet 4 (17 percent) on the same benchmark. The framing in the Hermes 4 launch documentation positions this as a feature: a model that follows instructions without refusing to engage on edge-case prompts.

Multi-turn dialogue, document analysis, code generation, and mathematical reasoning all sit within the standard capability profile inherited from the Llama 3.1 base and refined by Hermes 4 post-training. Strict format faithfulness (JSON output, structured tags, schema-constrained generation) was an explicit Hermes 4 training target and is one of the model's stronger performance areas.

Benchmarks and standing

Hermes 4 405B sits in the upper tier of openly-licensed reasoning models on mathematics and reasoning benchmarks. On MATH-500, Hermes 4 405B scored 96.3 percent in reasoning mode, near saturation of the benchmark and competitive with the closed-source frontier. On AIME 2024 (American Invitational Mathematics Examination, the qualifying test for the USA Math Olympiad), Hermes 4 405B scored 81.9 percent in reasoning mode, placing it among the leading openly-licensed reasoning models on this difficult mathematics benchmark.

On the Artificial Analysis Intelligence Index, Hermes 4 405B sits in the mid-30s on the composite score, behind the frontier closed-source models (GPT-5.5 at 60.24, Claude Opus 4.7 at 57.28, Gemini 3.1 Pro at 57.18) but in the same range as comparable open-weights peers. On HumanEval+ (function-completion coding), Hermes 4 405B scored in the mid-80s, competitive with peers in the open-weights tier.

The RefusalBench result described above (57.1 percent in reasoning mode) is the benchmark Hermes 4 most distinctively leads on, and is constructed by Nous Research. Independent validation of the RefusalBench scoring methodology has been limited; coverage of this benchmark has emphasized the directional signal (Hermes 4 refuses fewer requests than peers) rather than the precise numeric ranking.

Benchmark positions are point-in-time and shift on the scale of weeks given the release cadence in 2026. The frontier-tier closed-source models lead Hermes 4 on most benchmarks. Hermes 4's competitive position is strongest in the openly-licensed tier, on mathematics and reasoning, and on instruction-following metrics where minimal refusal is a desired property.

Access and pricing

Hermes 4 weights are distributed at the NousResearch organization on Hugging Face. Both 70B and 405B parameter variants are available, along with FP8-quantized versions of the 405B for single-server inference. The Llama 3 Community License governs use; commercial deployment is permitted at any scale below the 700-million-monthly-active-users threshold defined in the license without separate Meta agreement.

Hosted inference is available through multiple channels. The Nous Portal provides direct API access at Nous Research's own pricing. OpenRouter routes Hermes 4 across multiple inference providers at approximately $0.30 per million input tokens and $1.20 per million output tokens for the 405B variant. Together AI, Fireworks AI, and several other community inference providers also host Hermes 4 endpoints at varying price points.

For self-hosted deployment, Hermes 4 405B in FP8 fits on a single 8-GPU H100 or H200 server using vLLM or TGI inference frameworks. Hermes 4 70B fits comfortably on smaller multi-GPU configurations. The release page includes recommended inference configurations and prompt formats for both reasoning and fast modes.

Comparison

Direct competitors to Hermes 4 405B in the openly-licensed reasoning-model tier as of April 2026:

  • Llama 4 (Meta AI). The closest base-model peer. Llama 4 Maverick is Meta's flagship openly-licensed reasoning model with 109 billion parameters in MoE configuration. Llama 4 leads Hermes 4 on the Artificial Analysis Intelligence Index composite score, but Hermes 4 leads on RefusalBench and on direct instruction-following metrics where minimal refusal is a desired property.
  • DeepSeek V4 (DeepSeek). The current open-weights frontier leader on the Intelligence Index composite. DeepSeek V4 Pro at 1.6 trillion total parameters with 49 billion active sits at rank 8 on the Index. Hermes 4 405B trails on the composite but matches or exceeds DeepSeek V4 on some specific reasoning benchmarks where Nous Research's post-training corpus emphasized verified-trace data.
  • Qwen 3 (Alibaba). Alibaba's openly-licensed flagship at the 235-billion-parameter scale. Strong multilingual capability across Asian languages where Hermes 4 trails (Hermes 4 inherits Llama 3.1's English-leaning training distribution). On English-language reasoning benchmarks, Hermes 4 405B and Qwen 3 are broadly comparable.
  • gpt-oss (OpenAI). OpenAI's openly-licensed reasoning release in the 120-billion-parameter range. gpt-oss targets a smaller deployment footprint than Hermes 4 405B and is positioned by OpenAI for cost-sensitive open-source deployments. Hermes 4 405B leads on the benchmark composite at the cost of larger inference footprint.

Hermes 4's distinctive position in this competitive set is its low-refusal, instruction-faithful posture for organizations that want a frontier-capable openly-licensed model without the heavy guardrails present in major closed-frontier APIs. Nous Research's brand as an open-source-first organization differentiates the model commercially from peer Llama-derived releases.

Outlook

Open questions for Hermes 4 and the Nous Research model lineage over the next 6 to 18 months:

  • Hermes 5 timeline and base model. Whether Hermes 5 continues building on Llama 3.1 derivatives, jumps to Llama 4 base weights, or adopts a different open-weight base entirely. Llama 4 has been available openly since early 2026; a Hermes 5 release on the Llama 4 base could close the Intelligence Index composite gap relative to DeepSeek V4 and other Chinese open-weight peers.
  • Native pretraining versus continued fine-tuning. Nous Research's $50 million Series A in April 2025 funded a hardware buildout that could in principle support native model pretraining rather than fine-tuning third-party bases. Whether the next Hermes generation continues the fine-tuning approach or shifts to native pretraining is a watchable signal about the company's research direction.
  • The DisTrO decentralized training network. Nous Research's Solana-based DisTrO network is an architectural bet on decentralized training across heterogeneous compute. Whether DisTrO produces a Hermes-line model trained substantially on the network rather than centralized hardware is one of the most-watched research questions in the open-source AI ecosystem.
  • The Forge agentic platform integration. Nous Research ships Forge, an agentic-AI development platform, alongside its model releases. Whether Forge integrates Hermes 4 reasoning mode as a default backend, and whether agentic capability becomes a Hermes 5 design target, is open.
  • Refusal-rate trajectory across the open-source ecosystem. Hermes 4 is positioned at the high-instruction-faithfulness end of the open-weights spectrum. Whether peer open-weight labs (Meta, DeepSeek, Alibaba, Mistral) move toward similar refusal postures, hold the line, or move in the opposite direction will shape the competitive distinction Hermes 4 currently holds on RefusalBench.

Sources

About the author
Nextomoro

AI Research Lab Intelligence

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Research Lab Intelligence.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.