DIAMOND

World Modeling - Counter-Strike: Global Offensive

Look ma, No Game Engine!

Today

What is a world model? The 2018 idea.
What DIAMOND changed. Pixel-space diffusion, three denoising steps.
Where it wins. Atari 100k headline, visual detail, CS:GO scaling.
Where it breaks. Live failure modes on CS:GO.

What is a world model?

A world model is a learned, action-conditioned simulator of an environment.

\[x_{t+1} \;\sim\; M_\theta(x_t,\, a_t)\]

\(M_\theta\): the world model with learned parameters \(\theta\).
\(x_t\): the current observation (a frame, a sensor reading).
\(a_t\): the action (the system’s control input).

Choose an action \(a\) and evaluate \(M_\theta(x_t, a)\). Sample one next frame, then repeat. The resulting sequence is the predicted trajectory, aka the rollout.

World Models, 2018: V, M, C

“Recurrent World Models Facilitate Policy Evolution” (Ha & Schmidhuber, NeurIPS 2018)

VizDoom TakeCover: agent trained entirely inside the dream and deployed to the real game.

Model	Job	Params
V (VAE)	frame to latent	4.4M
M (MDN-RNN)	predict next latent	1.7M
C (linear)	pick next action	1,088

Most of the model is perception and memory. The policy on top is tiny.

Source: worldmodels.github.io

Training V, M, C

Three steps, each independent:

Collect. 10,000 rollouts from a random policy in CarRacing.
Train V and M. V (VAE) fits on frames only, no rewards. M (MDN-RNN) predicts the next latent given current latent, action, and hidden state. Output is a mixture distribution, not a point estimate.
Evolve C. CMA-ES, no backprop. The policy is small enough for evolutionary search.

Representation first, policy second.

V: frames in, reconstructions out.

C: linear map from \((z, h)\).

Learning inside a dream: VizDoom

Can agents learn inside of their own dreams?

Task: survive fireballs in VizDoom TakeCover
Same V/M/C structure as CarRacing
M predicts the next latent and death:

\[P(z_{t+1}, d_{t+1} \mid a_t, z_t, h_t)\]

C trains inside the dream environment
Reward: survival time, up to 2100 steps
The learned policy is deployed back to real VizDoom

Real VizDoom frame and VAE reconstruction.

Agent training inside the DoomRNN dream environment.

Dreams are only faithful while memory holds.

In a dream, the model can only remember what its memory carries forward.

CarRacing is forgiving. The road stays in view.
VizDoom is harder. M must track monsters, fireballs, and death across many steps.
Object permanence is the stress test. Turn around, hide something, revisit a place: the model has to remember.

The catch: if training did not demand remembering, the model probably will not.

DIAMOND: the paper we are here to discuss

DIffusion As a Model Of eNvironment Dreams

Alonso, Jelley, Micheli, Kanervisto, Storkey, Pearce, Fleuret, NeurIPS 2024 Spotlight

DIAMOND replaces latent dynamics with a conditional diffusion world model over frames.

World model: \(D_\theta\) predicts the next frame from recent frames and actions
Memory: no RNN inside \(D_\theta\), just frame stacking
Training loop: collect real data, fit \(D_\theta\), train the agent inside the generated environment
Why it matters: image-space rollouts can preserve visual details that discrete latents may lose

Diffusion for dynamics modeling, not image generation

A diffusion model for image generation learns:

\[\text{noise} \;\longmapsto\; \text{a plausible image}\]

DIAMOND’s world model learns:

\[(x_{\le t},\; a_{\le t},\; \epsilon) \;\longmapsto\; x_{t+1}\]

\(x_{\le t}\): recent frames. What the agent has seen so far.
\(a_{\le t}\): recent actions. What the agent did.
\(\epsilon\): noise. Diffusion starts from noise and denoises toward the next frame.
\(x_{t+1}\): next frame. This becomes part of the history for the next step.

Two clocks inside one rollout

Environment time \(t\) runs forward. The game advances one generated frame at a time.
Denoising time \(\tau\) runs backward. Each next frame is refined from noise to a clean observation.
History is the condition. \(D_\theta\) sees recent frames and actions at every denoising step.
Rollout is autoregressive. The generated frame becomes part of the history for the next environment step.
Why this is hard. A small visual error can feed back into the next condition and compound over time.

Paper Figure 1. The vertical sweep is the cost paid for each generated game step.

Architecture: U-Net with frame and action conditioning

2D U-Net: image-to-image with skip connections.

The U-Net denoises the next frame. Three conditioning paths feed it:

Past frames as channels. The last \(L\) frames concatenated with the noisy next frame.
Actions through AdaGN. A small MLP maps recent actions to per-block scale and bias for group-norm layers.
Noise level \(\tau\) through AdaGN. Same path, with \(\tau\) telling the denoiser how much noise remains.

What is \(D_\theta\) trained to do?

For each replay segment:

Take the real next frame \(x_{t+1}^{[0]}\)
Sample a noise level \(\sigma(\tau)\)
Add Gaussian noise: \(x_{t+1}^{[\tau]} \sim \mathcal{N}(x_{t+1}^{[0]}, \sigma(\tau)^2 I)\)
Condition on recent frames and actions
Train \(D_\theta\) to recover the clean next frame

\[\hat{x}_{t+1}^{[0]} = D_\theta(x_{t+1}^{[\tau]}, \tau, x_{t-L+1:t}^{[0]}, a_{t-L+1:t})\]

\[\mathcal{L}(\theta)=\|\hat{x}_{t+1}^{[0]} - x_{t+1}^{[0]}\|_2^2\]

Inside \(D_\theta\): no RNN, no latent state, no cross-attention. Memory is the frame stack.

EDM gives stable few-step sampling

Stable Diffusion uses dozens of denoising steps. IRIS uses 16. DIAMOND uses 3.

Why EDM, not DDPM?

DDPM trains the network to predict the added noise. At high noise, that target degenerates: the network can almost copy the noisy input. Bad score estimate at the start of sampling. Errors compound.

EDM (Karras 2022) interpolates the target with noise level: clean frame at high noise, residual correction at low noise.

Same U-Net, same data, different parameterization.

DDPM: drift worsens as \(n\) (denoising steps per frame) shrinks.

EDM: stable across the same \(n\) range, to \(t = 1000\). Paper Figure 3.

Why not one denoising step?

Paper Figure 4. Single-step (top row) vs multi-step (bottom row) denoising in Boxing.

Top row: \(n = 1\). Single-step denoising averages over possible futures for the unpredictable black player. Blurry interpolation.

Bottom row: \(n = 3\). DIAMOND’s default. The sampler drives toward one mode of the posterior. Crisp frame.

Why the white player stays crisp. The policy controls the white player, so its action is known to \(D_\theta\).

Training the agent inside the dream

DIAMOND’s outer loop:

Collect. The current policy acts in the real environment. Record frames, actions, rewards, terminations.
Fit \(D_\theta\) on everything collected so far. EDM loss, 3 denoising steps per frame.
Fit \(R_\psi\), a small reward and termination model from the same frame history.
Dream. Roll out imagined trajectories inside \((D_\theta, R_\psi)\).
Train the policy on the dream. A CNN-LSTM actor-critic is updated by policy gradient.
Repeat from step 1 with the updated policy.

Only step 1 touches reality. Everything else lives in the dream.

Atari 100k: the headline number

Mean human-normalized score. Agents are trained entirely inside their own world model.

Method	Mean HNS
SimPLe (2019)	0.332
TWM (2023)	0.956
IRIS (2023)	1.046
DreamerV3 (2023)	1.097
STORM (2023)	1.266
DIAMOND (2024)	1.459

Paper Table 1. Human score \(=1.0\), random \(=0.0\). 5 seeds per game, 26 games.

Paper Figure 2. Stratified bootstrap confidence intervals for mean and IQM (interquartile mean: middle 50% of scores) on Atari 100k. DIAMOND in blue.

Where DIAMOND wins

Superhuman on 11 of 26 games, new best among agents trained entirely in a world model
100k interactions per game \(\approx\) 2 hours of human play; unrestricted agents often get 500x that budget
The gains are not uniform across the benchmark.
Stronger where small visual details matter: Asterix enemies vs rewards, Breakout brick layout, RoadRunner reward dots
Weaker on some games such as Frostbite and Hero

The paper’s bet: modeling frames directly in pixel space preserves details that tokenizers can erase.

Visual details: DIAMOND vs IRIS

Every other top Atari 100k world model builds a discrete bottleneck first:

Method	Internal representation
IRIS (2023)	VQ-VAE image tokens
DreamerV3 (2023)	categorical latents
STORM (2023)	categorical tokens
DIAMOND (2024)	raw pixel frames

The bet. Image-space diffusion preserves small visual details that discrete bottlenecks may round off.

The paper’s evidence. IRIS rollouts flicker on tiny details; DIAMOND rollouts stay more consistent.

IRIS: enemies turn into rewards, bricks flicker, score digits wobble.

DIAMOND: same games, fewer flagged inconsistencies. Paper Figure 5.

Why frame stacking matters

Frame stacking is not just the simple option. In their tests, it won.

Paper’s limitation: frame stacking is a minimal memory mechanism
Why it is cheap: history is fed as pixels, so \(D_\theta\) stays close to a 2D U-Net
Appendix M comparison: frame stacking beat cross-attention in 3D rollouts
CS:GO FVD: 34.8 for frame stack vs 81.4 for cross-attention
Sample rate: 7.4 Hz for frame stack vs 2.5 Hz for cross-attention
Tradeoff: state older than the buffer has no explicit place to live inside \(D_\theta\)

Cheap and effective inside the buffer. Past the buffer, there is no architectural memory.

Scaling: from Atari to Counter-Strike

What grew:

Data: 87 hours of human Dust II gameplay
Training setup: no RL agent, no collection loop, fit offline on static data
Params: 4.4M \(\to\) 381M
Resolution: dynamics at \(56 \times 30\), upsampler at \(280 \times 150\)
Hardware: 12 days on a single RTX 4090 to train. ~10 frames per second on an RTX 3090 at inference.

Paper Figure 6. Frames captured from people playing with keyboard and mouse inside DIAMOND’s CS:GO world model.

87 hours is roughly 0.5% of the data GameNGen used for its DOOM neural engine.

Scaling: what stayed the same

Same 2D U-Net backbone
Same frame-stack conditioning
Same action-modulated normalization
Same EDM loss
Still 3 denoising steps for the dynamics model
Upsampler uses 10 denoising steps for visual quality
CS:GO experiment is world model only: no trained CS:GO agent

DIAMOND gets a playable 3D world model from the cheap version of this bet.

What the paper says can break

Discrete controls only: Atari evaluation is on discrete-action environments
Memory is minimal: frame stacking only remembers the recent buffer
Longer-term memory is future work: the authors point to transformer-style context
Reward and termination are separate: \(R_\psi\) is not integrated into the diffusion model
CS:GO failure examples: losing visibility can produce new weapons or map regions, and repeated jumps can become possible

Scaling may improve visual quality. It does not automatically give the world model persistent state.

Back to the dream: CS:GO, deliberately broken

Same playable world model we opened with. Runs on the authors’ released checkpoint at diamond-wm.github.io.

Same architecture used for Atari.

Same U-Net
Same frame-stack memory
Same 3-denoising-step sampler
3D rendering instead of 2D

Same memory limit.

Same underlying cause: finite memory + long-horizon dependence = object-amnesia.

Three live CS:GO stress tests

Walk up to a wall and stare at it. When the rest of the scene rolls out of the frame stack, what comes back will not be what was there.
Sprint down a corridor and look back. The corridor behind me is a different corridor.
Find a weapon on the ground, look away, look back. Different weapon. Sometimes no weapon.

The goal is not to find a weird edge case. It is to run an object-permanence test in a 3D environment.

Transition from the Breakout diagnosis slide: “Breakout in 2D was the clean version. 3D is the vivid version. Everything I am about to show you is the same architectural failure, scaled up from a 64x64 grayscale frame to a 280x150 first-person shooter.”

Demo instructions for the live run (IN ORDER):

Wall-stare demo. Spawn at Dust II T-spawn. Face a wall at short distance. Hold forward briefly so the model is confident about the wall’s presence. Then stare without moving for 5-10 seconds. Turn 180 degrees. The scene behind you is usually not the scene you came from; the model has rolled your original view out of the frame stack and re-hallucinated the “behind you” view from the prior.
Corridor traversal demo. Start at one end of a long Dust II hallway. Sprint forward. When you reach the end, turn around. The hallway you just traversed is almost always a different geometry than the one you walked through. Sometimes a door appears that was not there. Sometimes the whole hallway is replaced by an open room. The audience will visibly gasp at the first clean reproduction of this.
Weapon-permanence demo. Find a weapon on the ground. Look directly at it for 2 seconds. Look at the ceiling for 3 seconds. Look back. Typical outcome: the weapon is either (a) a different weapon, (b) gone, or (c) still correct but with mutated floor textures around it. The mutation rate per trial is high enough that one or two tries usually yields something dramatic.

Back-up plan if live fails: the authors’ project page has a video gallery. If the browser or network is flaky at presentation time, play the 30-second “same initial state, multiple rollouts” video from diamond-wm.github.io instead. Do NOT fumble; a prepared screencapture is always ready.

The load-bearing sentence before moving on: “I am not finding weird edge cases to embarrass the authors. I am running the same stress test from Breakout on a bigger game. Finite memory does not care what dimension it is in.”

Timing: 2-3 minutes (demo dominates).

Q&A anchor: if asked “why does the original playable demo look so much better than what you are showing” – because the original demo at short horizon and with careful movement stays inside the regime the paper reports. I am deliberately selecting actions that push the horizon past the frame stack to expose the limitation.

Q&A anchor: if asked “can the authors fix this with a longer frame stack” – pushes the boundary out but does not solve it. Any finite stack has a finite memory. The fix needs a fundamentally different memory mechanism (RNN state, transformer context, neural SSM). Paper’s own Limitations section names this as future work.

What DIAMOND showed us

What it showed:

Diffusion trained with the EDM objective and sampled at 3 steps is a viable action-conditioned dynamics model
Operating directly in image space preserves pixel-level details that discrete-token world models lose
The same architecture scales from 4.4M on 2D Atari to 381M on a 3D first-person shooter, on a single consumer GPU, on 87 hours of data
New best Atari 100k among agents trained entirely in a world model (HNS \(1.46\))

What DIAMOND did not show

What it did not show, and the paper agrees:

Long-horizon state persistence beyond the frame-stack depth
How to integrate a structured memory mechanism into a pixel-space diffusion loop
How to make the world model faithful enough to train a CS:GO-level agent in the dream. The CS:GO experiment has no agent, only the world model
A path to continuous-control domains where state is even higher-dimensional than pixels

DIAMOND is a working proof-of-concept for one design axis (tokenizer-free) and an honest flag for one open problem (memory). Both halves are real.