DIAMOND

DIAMOND is a diffusion-based world model from the General Intuition founding team that simulates interactive game environments by predicting pixel-level next frames conditioned on player actions, with NeurIPS 2024 spotlight publication.
DIAMOND

DIAMOND

DIAMOND (DIffusion As a Model Of eNvironment Dreams) is a diffusion-based world model and reinforcement learning agent training framework developed by Eloi Alonso, Adam Jelley, and Vincent Micheli, the research team that subsequently co-founded General Intuition. The model trains a diffusion model to predict the next frame of a game environment conditioned on the previous frames and the agent's actions, then trains a reinforcement learning agent inside the resulting learned simulator. The DIAMOND paper, "Diffusion for World Modeling: Visual Details Matter in Atari," was published as a NeurIPS 2024 spotlight and reported a mean human-normalized score of 1.46 on the Atari 100k benchmark, the first time an agent trained entirely within a world model crossed the 1.0 threshold.

At a glance

  • Lab: General Intuition (research originated at the University of Geneva and Edinburgh prior to the General Intuition founding)
  • Released: Paper preprint posted to arXiv on May 20, 2024 (arXiv:2405.12399). NeurIPS 2024 spotlight presentation. Counter-Strike: Global Offensive playable demo released subsequently.
  • Modality: World model. Inputs include sequences of game frames and action specifications. Outputs are predicted next frames and agent policies trained inside the learned simulator.
  • Open weights: Yes, partial. The DIAMOND research code and trained agents are open source on GitHub at eloialonso/diamond. The General Intuition commercial successor models are closed.
  • Context window: Not applicable in the language model sense. The model handles sequences of game frames as visual conditioning, with sequence length determined by the architecture.
  • Pricing: Free for the open source research codebase. Commercial successor models from General Intuition are not publicly released.
  • Distribution channels: eloialonso/diamond on GitHub, diamond-wm.github.io, and arXiv:2405.12399.

Origins

DIAMOND was developed by Eloi Alonso (lead author), Adam Jelley, and Vincent Micheli, with co-authors including Anssi Kanervisto and François Fleuret. The research originated in the academic period preceding General Intuition's founding, with Alonso, Jelley, and Micheli later co-founding General Intuition in 2024 alongside Pim de Witte (founder of the Medal video-game-clip-sharing platform). The General Intuition seed round of $134 million, announced October 2025 and led by Khosla Ventures and General Catalyst with Raine participating, was characterized in industry coverage as one of the largest seed rounds in AI history at the time.

The DIAMOND research thesis was that the dominant world-model approach in published research had been to encode environments as compact discrete latent tokens, and that the latent-token compression discarded visual details that mattered for reinforcement learning agent performance. The DIAMOND alternative was to use a diffusion model directly, predicting full-resolution next-frame pixels rather than discrete latent codes. The hypothesis was that visual fidelity in the world model would translate into better agent policies trained inside the simulator.

The Atari 100k benchmark was the principal experimental setting. Atari 100k is a sample-efficient reinforcement learning benchmark in which agents are evaluated on Atari games after only 100,000 frames (about two hours of gameplay) of training data. The benchmark is designed to test data efficiency rather than asymptotic performance, and the "mean human-normalized score" metric expresses agent performance relative to expert human players. A score of 1.0 indicates expert human-level performance averaged across the games in the suite.

DIAMOND reported a mean human-normalized score of 1.46 on Atari 100k, the first time an agent trained entirely within a learned world model had exceeded human expert level. The result was published at NeurIPS 2024 as a spotlight presentation, which is one of the conference's higher-tier acceptance categories.

A subsequent demonstration extended the DIAMOND approach to Counter-Strike: Global Offensive, training the diffusion world model on 87 hours of CS:GO gameplay video and producing a fully interactive, playable neural game engine. The CS:GO demo runs at approximately 10 frames per second on a single Nvidia RTX 3090 GPU. The CS:GO release was structurally distinct from the Atari 100k research result: it demonstrated DIAMOND as a standalone neural game engine, with the diffusion world model providing the playable simulation directly to a human player rather than serving as a training environment for a reinforcement learning agent.

Capabilities

DIAMOND is built specifically for visual world modeling in interactive environments. Three capability features distinguish it from peer world model approaches.

The first is full-resolution diffusion-based next-frame prediction. Standard world models in published research use compact discrete latent representations (DreamerV3, IRIS, MuZero) that compress the environment into a tokenized form before predicting transitions. DIAMOND uses a diffusion model to predict the next frame at full resolution conditioned on the previous frames and the agent's action. The diffusion approach preserves visual details that the discrete-latent approaches discard, and the DIAMOND paper presents the architectural analysis showing why visual details matter for downstream RL agent performance.

The second is interactive playability. The DIAMOND CS:GO demonstration showed that the diffusion world model is sufficiently coherent to operate as an interactive game engine that a human player can play directly. The interactive capability is structurally distinct from world-model-as-training-environment use, because human-player interaction is far less forgiving of inconsistencies than RL-agent training inside the same simulator. The demo runs at 10 fps on consumer hardware (Nvidia RTX 3090), which is below the 30-to-60 fps standard for production game engines but high enough for playable interaction.

The third is data efficiency. The CS:GO demonstration trained on 87 hours of gameplay footage, characterized in coverage as approximately 0.5 percent of the data required for comparable projects such as GameNGen. The data efficiency claim relates to the diffusion approach's ability to extract usable simulation dynamics from comparatively small video corpora.

The reported limitations of the CS:GO demo include the lack of explicit physics modeling (players can jump infinitely because the model does not enforce gravity or collision), and degradation when the human player deviates from commonly used paths in the training data. The deviation collapse is a characteristic failure mode of learned world models trained on a fixed gameplay distribution.

Pipeline and evaluation

DIAMOND's evaluation framework focuses on world-model and reinforcement learning benchmarks rather than horizontal language model leaderboards.

On Atari 100k, the principal sample-efficient reinforcement learning benchmark, DIAMOND reported a mean human-normalized score of 1.46 across the standard 26-game evaluation suite. This was the first time any agent trained entirely within a world model exceeded the 1.0 mean human-normalized threshold (approximate human expert level). For comparison, the contemporaneous IRIS and DreamerV3 baselines reported scores below 1.0 on the same benchmark.

For the Counter-Strike: Global Offensive interactive demonstration, the principal evaluation metrics are interactive frame rate (10 fps on a single RTX 3090) and qualitative coherence of the playable simulation. The CS:GO demo is not a standard benchmark in the sample-efficient RL sense, but it demonstrates the world model's capacity to operate as a standalone neural game engine.

The standard horizontal language model benchmarks (Artificial Analysis Intelligence Index, LMArena, GPQA Diamond, AIME, SWE-bench Verified) do not apply to world modeling. The relevant benchmarks for the world-model category are Atari 100k, Atari 200k, ProcGen, and adjacent reinforcement learning sample-efficiency suites, plus qualitative interactive demonstrations on commercial-grade game environments.

Industry coverage of DIAMOND has consistently characterized the model as one of the principal published world-model results of 2024, alongside Google DeepMind's Genie line, the Decart Oasis interactive demo, and the broader spatial-AI research at Meta AI / FAIR (JEPA) and World Labs.

Access and availability

DIAMOND research code, trained agents, and playable world models are released open source through the GitHub repository at eloialonso/diamond. The repository is under the original authors' research-distribution license, which permits academic and research use. Commercial deployment of the DIAMOND research code is not the principal distribution channel; researchers interested in commercial deployment work directly with General Intuition on the company's closed-source successor models.

The General Intuition commercial successor models are not publicly released as of May 2026. The company's strategic premise is that the Medal gameplay-video corpus (approximately two billion gaming videos per year contributed by approximately 10 million Medal monthly active users) supplies a uniquely scaled training resource for spatial-temporal reasoning models, with applications targeting gaming, search-and-rescue drones, and other embodied-AI domains. The DIAMOND research lineage anchors the technical credibility of the company's commercial model program.

The CS:GO playable demo and the underlying trained world model are available through the diamond-wm.github.io project page for researchers interested in reproducing the interactive demonstration.

Comparison

Direct competitors and adjacent world model systems:

  • Genie (Google DeepMind). Frontier-lab world model line. Genie 2 demonstrated playable game-environment generation at higher quality and resolution than DIAMOND CS:GO, with substantially greater compute investment.
  • Oasis (Decart). Israeli real-time interactive world-model startup. Oasis demonstrates Minecraft-style interactive simulation at production-grade frame rates, with commercial positioning around real-time interactive world models.
  • JEPA (Meta AI / FAIR). Joint Embedding Predictive Architecture research line under Yann LeCun. Different architectural approach (predictive embeddings rather than pixel-level diffusion) but overlapping research thesis on world models for embodied AI.
  • DreamerV3 (academic). The principal latent-token world model in published research at the time of DIAMOND's release. DIAMOND's central comparison point in the Atari 100k benchmark; DIAMOND reports superior sample efficiency.
  • IRIS (academic). Another latent-token world model. Adjacent published baseline; DIAMOND outperforms on Atari 100k.
  • GameNGen (Google DeepMind). Diffusion-based interactive Doom simulator from Google researchers. Direct architectural peer at the diffusion-game-engine framing; DIAMOND CS:GO claimed comparable quality from substantially less training data.
  • World Labs Marble (World Labs). The Fei-Fei Li-led spatial intelligence company's product. Commercial spatial-AI peer with different research framing (spatial-3D world generation rather than interactive game-environment simulation).

DIAMOND's distinctive position among 2024 vintage world model systems: full-resolution diffusion-based next-frame prediction, the first published agent trained entirely within a world model to exceed mean human expert performance on Atari 100k, the playable CS:GO neural game engine demonstration on consumer hardware, and the open source research distribution that anchors the General Intuition founding team's research credibility.

Outlook

Open questions for DIAMOND and the General Intuition successor program over the next 6 to 18 months:

  • Commercial successor model release. General Intuition's seed round funded development of foundation-scale spatial-temporal reasoning models trained on the Medal gameplay corpus. Whether the successor models extend the DIAMOND approach to multi-game and cross-environment generalization, and the cadence of public capability disclosures, are watchable signals.
  • Embodied-AI transfer evidence. General Intuition has positioned drones (search-and-rescue) as a flagship commercial application. Whether spatial reasoning trained on gameplay video transfers cleanly to physical-world embodied tasks is the central open research and commercial question.
  • Open source DIAMOND extensions. The published DIAMOND code remains a research-community reference implementation. Whether subsequent open releases from the General Intuition team (or from academic reproduction) extend the architecture to additional environments and to longer training horizons will affect the community research surface.
  • Competitive dynamics with Genie and Decart. Frontier-lab world models from Google DeepMind (Genie, Genie 2, and successor lines) and commercial peers including Decart's Oasis operate at greater compute scale or with different commercial positioning. DIAMOND-derived successor models will be evaluated against these references.
  • Frame rate and resolution scaling. The CS:GO demo runs at 10 fps. Whether the diffusion-world-model approach scales to production-grade frame rates and resolutions is the principal technical scaling question for interactive applications.

Sources

About the author
Nextomoro

Nextomoro

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Research Lab Intelligence.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.