DIAMOND

Diffusion as a Model of Environment Dreams

Alonso et al., NeurIPS 2024
Presented by Manish Mathai

May 1, 2026

World Modeling - Counter-Strike: Global Offensive

CS:GO - game engine
CS:GO - world model

Look ma, No Game Engine!

Today

  1. What is a world model? The 2018 idea.
  2. What DIAMOND changed. Pixel-space diffusion, three denoising steps.
  3. Where it wins. Atari 100k headline, visual detail, CS:GO scaling.
  4. Where it breaks. Live failure modes on CS:GO.

What is a world model?

A world model is a learned, action-conditioned simulator of an environment.

\[x_{t+1} \;\sim\; M_\theta(x_t,\, a_t)\]

  • \(M_\theta\): the world model with learned parameters \(\theta\).
  • \(x_t\): the current observation (a frame, a sensor reading).
  • \(a_t\): the action (the system’s control input).

Choose an action \(a\) and evaluate \(M_\theta(x_t, a)\). Sample one next frame, then repeat. The resulting sequence is the predicted trajectory, aka the rollout.

World Models, 2018: V, M, C

“Recurrent World Models Facilitate Policy Evolution” (Ha & Schmidhuber, NeurIPS 2018)

VizDoom TakeCover: agent trained entirely inside the dream and deployed to the real game.

Model Job Params
V (VAE) frame to latent 4.4M
M (MDN-RNN) predict next latent 1.7M
C (linear) pick next action 1,088

Most of the model is perception and memory. The policy on top is tiny.

Source: worldmodels.github.io

Training V, M, C

Three steps, each independent:

  1. Collect. 10,000 rollouts from a random policy in CarRacing.
  2. Train V and M. V (VAE) fits on frames only, no rewards. M (MDN-RNN) predicts the next latent given current latent, action, and hidden state. Output is a mixture distribution, not a point estimate.
  3. Evolve C. CMA-ES, no backprop. The policy is small enough for evolutionary search.

Representation first, policy second.

V: frames in, reconstructions out.

C: linear map from \((z, h)\).

Learning inside a dream: VizDoom

Can agents learn inside of their own dreams?

  • Task: survive fireballs in VizDoom TakeCover
  • Same V/M/C structure as CarRacing
  • M predicts the next latent and death:

\[P(z_{t+1}, d_{t+1} \mid a_t, z_t, h_t)\]

  • C trains inside the dream environment
  • Reward: survival time, up to 2100 steps
  • The learned policy is deployed back to real VizDoom

Real VizDoom frame and VAE reconstruction.

Agent training inside the DoomRNN dream environment.

Dreams are only faithful while memory holds.

In a dream, the model can only remember what its memory carries forward.

  • CarRacing is forgiving. The road stays in view.
  • VizDoom is harder. M must track monsters, fireballs, and death across many steps.
  • Object permanence is the stress test. Turn around, hide something, revisit a place: the model has to remember.

The catch: if training did not demand remembering, the model probably will not.

DIAMOND: the paper we are here to discuss

DIffusion As a Model Of eNvironment Dreams

Alonso, Jelley, Micheli, Kanervisto, Storkey, Pearce, Fleuret, NeurIPS 2024 Spotlight


DIAMOND replaces latent dynamics with a conditional diffusion world model over frames.

  • World model: \(D_\theta\) predicts the next frame from recent frames and actions
  • Memory: no RNN inside \(D_\theta\), just frame stacking
  • Training loop: collect real data, fit \(D_\theta\), train the agent inside the generated environment
  • Why it matters: image-space rollouts can preserve visual details that discrete latents may lose

Diffusion for dynamics modeling, not image generation

A diffusion model for image generation learns:

\[\text{noise} \;\longmapsto\; \text{a plausible image}\]

DIAMOND’s world model learns:

\[(x_{\le t},\; a_{\le t},\; \epsilon) \;\longmapsto\; x_{t+1}\]


  • \(x_{\le t}\): recent frames. What the agent has seen so far.
  • \(a_{\le t}\): recent actions. What the agent did.
  • \(\epsilon\): noise. Diffusion starts from noise and denoises toward the next frame.
  • \(x_{t+1}\): next frame. This becomes part of the history for the next step.

Two clocks inside one rollout

  • Environment time \(t\) runs forward. The game advances one generated frame at a time.
  • Denoising time \(\tau\) runs backward. Each next frame is refined from noise to a clean observation.
  • History is the condition. \(D_\theta\) sees recent frames and actions at every denoising step.
  • Rollout is autoregressive. The generated frame becomes part of the history for the next environment step.
  • Why this is hard. A small visual error can feed back into the next condition and compound over time.

Paper Figure 1. The vertical sweep is the cost paid for each generated game step.

Architecture: U-Net with frame and action conditioning

2D U-Net: image-to-image with skip connections.

The U-Net denoises the next frame. Three conditioning paths feed it:

  • Past frames as channels. The last \(L\) frames concatenated with the noisy next frame.
  • Actions through AdaGN. A small MLP maps recent actions to per-block scale and bias for group-norm layers.
  • Noise level \(\tau\) through AdaGN. Same path, with \(\tau\) telling the denoiser how much noise remains.

What is \(D_\theta\) trained to do?

For each replay segment:

  1. Take the real next frame \(x_{t+1}^{[0]}\)
  2. Sample a noise level \(\sigma(\tau)\)
  3. Add Gaussian noise: \(x_{t+1}^{[\tau]} \sim \mathcal{N}(x_{t+1}^{[0]}, \sigma(\tau)^2 I)\)
  4. Condition on recent frames and actions
  5. Train \(D_\theta\) to recover the clean next frame

\[\hat{x}_{t+1}^{[0]} = D_\theta(x_{t+1}^{[\tau]}, \tau, x_{t-L+1:t}^{[0]}, a_{t-L+1:t})\]

\[\mathcal{L}(\theta)=\|\hat{x}_{t+1}^{[0]} - x_{t+1}^{[0]}\|_2^2\]

Inside \(D_\theta\): no RNN, no latent state, no cross-attention. Memory is the frame stack.

EDM gives stable few-step sampling

Stable Diffusion uses dozens of denoising steps. IRIS uses 16. DIAMOND uses 3.

Why EDM, not DDPM?

DDPM trains the network to predict the added noise. At high noise, that target degenerates: the network can almost copy the noisy input. Bad score estimate at the start of sampling. Errors compound.

EDM (Karras 2022) interpolates the target with noise level: clean frame at high noise, residual correction at low noise.

Same U-Net, same data, different parameterization.

DDPM: drift worsens as \(n\) (denoising steps per frame) shrinks.
EDM: stable across the same \(n\) range, to \(t = 1000\). Paper Figure 3.

Why not one denoising step?

Paper Figure 4. Single-step (top row) vs multi-step (bottom row) denoising in Boxing.

Top row: \(n = 1\). Single-step denoising averages over possible futures for the unpredictable black player. Blurry interpolation.

Bottom row: \(n = 3\). DIAMOND’s default. The sampler drives toward one mode of the posterior. Crisp frame.

Why the white player stays crisp. The policy controls the white player, so its action is known to \(D_\theta\).

Training the agent inside the dream

DIAMOND’s outer loop:

  1. Collect. The current policy acts in the real environment. Record frames, actions, rewards, terminations.
  2. Fit \(D_\theta\) on everything collected so far. EDM loss, 3 denoising steps per frame.
  3. Fit \(R_\psi\), a small reward and termination model from the same frame history.
  4. Dream. Roll out imagined trajectories inside \((D_\theta, R_\psi)\).
  5. Train the policy on the dream. A CNN-LSTM actor-critic is updated by policy gradient.
  6. Repeat from step 1 with the updated policy.

Only step 1 touches reality. Everything else lives in the dream.

Atari 100k: the headline number

Mean human-normalized score. Agents are trained entirely inside their own world model.

Method Mean HNS
SimPLe (2019) 0.332
TWM (2023) 0.956
IRIS (2023) 1.046
DreamerV3 (2023) 1.097
STORM (2023) 1.266
DIAMOND (2024) 1.459

Paper Table 1. Human score \(=1.0\), random \(=0.0\). 5 seeds per game, 26 games.

Paper Figure 2. Stratified bootstrap confidence intervals for mean and IQM (interquartile mean: middle 50% of scores) on Atari 100k. DIAMOND in blue.

Where DIAMOND wins

  • Superhuman on 11 of 26 games, new best among agents trained entirely in a world model
  • 100k interactions per game \(\approx\) 2 hours of human play; unrestricted agents often get 500x that budget
  • The gains are not uniform across the benchmark.
  • Stronger where small visual details matter: Asterix enemies vs rewards, Breakout brick layout, RoadRunner reward dots
  • Weaker on some games such as Frostbite and Hero

The paper’s bet: modeling frames directly in pixel space preserves details that tokenizers can erase.

Visual details: DIAMOND vs IRIS

Every other top Atari 100k world model builds a discrete bottleneck first:

Method Internal representation
IRIS (2023) VQ-VAE image tokens
DreamerV3 (2023) categorical latents
STORM (2023) categorical tokens
DIAMOND (2024) raw pixel frames

The bet. Image-space diffusion preserves small visual details that discrete bottlenecks may round off.

The paper’s evidence. IRIS rollouts flicker on tiny details; DIAMOND rollouts stay more consistent.

IRIS: enemies turn into rewards, bricks flicker, score digits wobble.
DIAMOND: same games, fewer flagged inconsistencies. Paper Figure 5.

Why frame stacking matters

Frame stacking is not just the simple option. In their tests, it won.

  • Paper’s limitation: frame stacking is a minimal memory mechanism
  • Why it is cheap: history is fed as pixels, so \(D_\theta\) stays close to a 2D U-Net
  • Appendix M comparison: frame stacking beat cross-attention in 3D rollouts
  • CS:GO FVD: 34.8 for frame stack vs 81.4 for cross-attention
  • Sample rate: 7.4 Hz for frame stack vs 2.5 Hz for cross-attention
  • Tradeoff: state older than the buffer has no explicit place to live inside \(D_\theta\)

Cheap and effective inside the buffer. Past the buffer, there is no architectural memory.

Scaling: from Atari to Counter-Strike

What grew:

  • Data: 87 hours of human Dust II gameplay
  • Training setup: no RL agent, no collection loop, fit offline on static data
  • Params: 4.4M \(\to\) 381M
  • Resolution: dynamics at \(56 \times 30\), upsampler at \(280 \times 150\)
  • Hardware: 12 days on a single RTX 4090 to train. ~10 frames per second on an RTX 3090 at inference.

Paper Figure 6. Frames captured from people playing with keyboard and mouse inside DIAMOND’s CS:GO world model.

87 hours is roughly 0.5% of the data GameNGen used for its DOOM neural engine.

Scaling: what stayed the same

  • Same 2D U-Net backbone
  • Same frame-stack conditioning
  • Same action-modulated normalization
  • Same EDM loss
  • Still 3 denoising steps for the dynamics model
  • Upsampler uses 10 denoising steps for visual quality
  • CS:GO experiment is world model only: no trained CS:GO agent

DIAMOND gets a playable 3D world model from the cheap version of this bet.

What the paper says can break

  • Discrete controls only: Atari evaluation is on discrete-action environments
  • Memory is minimal: frame stacking only remembers the recent buffer
  • Longer-term memory is future work: the authors point to transformer-style context
  • Reward and termination are separate: \(R_\psi\) is not integrated into the diffusion model
  • CS:GO failure examples: losing visibility can produce new weapons or map regions, and repeated jumps can become possible


Scaling may improve visual quality. It does not automatically give the world model persistent state.

Back to the dream: CS:GO, deliberately broken

Same playable world model we opened with. Runs on the authors’ released checkpoint at diamond-wm.github.io.

Same architecture used for Atari.

  • Same U-Net
  • Same frame-stack memory
  • Same 3-denoising-step sampler
  • 3D rendering instead of 2D

Same memory limit.


Same underlying cause: finite memory + long-horizon dependence = object-amnesia.

Three live CS:GO stress tests

  1. Walk up to a wall and stare at it. When the rest of the scene rolls out of the frame stack, what comes back will not be what was there.
  2. Sprint down a corridor and look back. The corridor behind me is a different corridor.
  3. Find a weapon on the ground, look away, look back. Different weapon. Sometimes no weapon.


The goal is not to find a weird edge case. It is to run an object-permanence test in a 3D environment.

What DIAMOND showed us

What it showed:

  • Diffusion trained with the EDM objective and sampled at 3 steps is a viable action-conditioned dynamics model
  • Operating directly in image space preserves pixel-level details that discrete-token world models lose
  • The same architecture scales from 4.4M on 2D Atari to 381M on a 3D first-person shooter, on a single consumer GPU, on 87 hours of data
  • New best Atari 100k among agents trained entirely in a world model (HNS \(1.46\))

What DIAMOND did not show

What it did not show, and the paper agrees:

  • Long-horizon state persistence beyond the frame-stack depth
  • How to integrate a structured memory mechanism into a pixel-space diffusion loop
  • How to make the world model faithful enough to train a CS:GO-level agent in the dream. The CS:GO experiment has no agent, only the world model
  • A path to continuous-control domains where state is even higher-dimensional than pixels


DIAMOND is a working proof-of-concept for one design axis (tokenizer-free) and an honest flag for one open problem (memory). Both halves are real.