Alonso et al., NeurIPS 2024 Presented by Manish Mathai
May 1, 2026
World Modeling - Counter-Strike: Global Offensive
CS:GO - game engine
CS:GO - world model
Look ma, No Game Engine!
Today
What is a world model? The 2018 idea.
What DIAMOND changed. Pixel-space diffusion, three denoising steps.
Where it wins. Atari 100k headline, visual detail, CS:GO scaling.
Where it breaks. Live failure modes on CS:GO.
What is a world model?
A world model is a learned, action-conditioned simulator of an environment.
\[x_{t+1} \;\sim\; M_\theta(x_t,\, a_t)\]
\(M_\theta\): the world model with learned parameters \(\theta\).
\(x_t\): the current observation (a frame, a sensor reading).
\(a_t\): the action (the system’s control input).
Choose an action \(a\) and evaluate \(M_\theta(x_t, a)\). Sample one next frame, then repeat. The resulting sequence is the predicted trajectory, aka the rollout.
Collect. 10,000 rollouts from a random policy in CarRacing.
Train V and M. V (VAE) fits on frames only, no rewards. M (MDN-RNN) predicts the next latent given current latent, action, and hidden state. Output is a mixture distribution, not a point estimate.
Evolve C. CMA-ES, no backprop. The policy is small enough for evolutionary search.
Representation first, policy second.
V: frames in, reconstructions out.
C: linear map from \((z, h)\).
Learning inside a dream: VizDoom
Can agents learn inside of their own dreams?
Task: survive fireballs in VizDoom TakeCover
Same V/M/C structure as CarRacing
M predicts the next latent and death:
\[P(z_{t+1}, d_{t+1} \mid a_t, z_t, h_t)\]
C trains inside the dream environment
Reward: survival time, up to 2100 steps
The learned policy is deployed back to real VizDoom
Real VizDoom frame and VAE reconstruction.
Agent training inside the DoomRNN dream environment.
Dreams are only faithful while memory holds.
In a dream, the model can only remember what its memory carries forward.
CarRacing is forgiving. The road stays in view.
VizDoom is harder. M must track monsters, fireballs, and death across many steps.
Object permanence is the stress test. Turn around, hide something, revisit a place: the model has to remember.
The catch: if training did not demand remembering, the model probably will not.
DDPM trains the network to predict the added noise. At high noise, that target degenerates: the network can almost copy the noisy input. Bad score estimate at the start of sampling. Errors compound.
EDM (Karras 2022) interpolates the target with noise level: clean frame at high noise, residual correction at low noise.
Same U-Net, same data, different parameterization.
DDPM: drift worsens as \(n\) (denoising steps per frame) shrinks.
EDM: stable across the same \(n\) range, to \(t = 1000\). Paper Figure 3.
Why not one denoising step?
Paper Figure 4. Single-step (top row) vs multi-step (bottom row) denoising in Boxing.
Top row: \(n = 1\). Single-step denoising averages over possible futures for the unpredictable black player. Blurry interpolation.
Bottom row: \(n = 3\). DIAMOND’s default. The sampler drives toward one mode of the posterior. Crisp frame.
Why the white player stays crisp. The policy controls the white player, so its action is known to \(D_\theta\).
Training the agent inside the dream
DIAMOND’s outer loop:
Collect. The current policy acts in the real environment. Record frames, actions, rewards, terminations.
Fit \(D_\theta\) on everything collected so far. EDM loss, 3 denoising steps per frame.
Fit \(R_\psi\), a small reward and termination model from the same frame history.
Dream. Roll out imagined trajectories inside \((D_\theta, R_\psi)\).
Train the policy on the dream. A CNN-LSTM actor-critic is updated by policy gradient.
Repeat from step 1 with the updated policy.
Only step 1 touches reality. Everything else lives in the dream.
Atari 100k: the headline number
Mean human-normalized score. Agents are trained entirely inside their own world model.
Method
Mean HNS
SimPLe (2019)
0.332
TWM (2023)
0.956
IRIS (2023)
1.046
DreamerV3 (2023)
1.097
STORM (2023)
1.266
DIAMOND (2024)
1.459
Paper Table 1. Human score \(=1.0\), random \(=0.0\). 5 seeds per game, 26 games.
Paper Figure 2. Stratified bootstrap confidence intervals for mean and IQM (interquartile mean: middle 50% of scores) on Atari 100k. DIAMOND in blue.
Where DIAMOND wins
Superhuman on 11 of 26 games, new best among agents trained entirely in a world model
100k interactions per game\(\approx\) 2 hours of human play; unrestricted agents often get 500x that budget
The gains are not uniform across the benchmark.
Stronger where small visual details matter: Asterix enemies vs rewards, Breakout brick layout, RoadRunner reward dots
Weaker on some games such as Frostbite and Hero
The paper’s bet: modeling frames directly in pixel space preserves details that tokenizers can erase.
Visual details: DIAMOND vs IRIS
Every other top Atari 100k world model builds a discrete bottleneck first:
Method
Internal representation
IRIS (2023)
VQ-VAE image tokens
DreamerV3 (2023)
categorical latents
STORM (2023)
categorical tokens
DIAMOND (2024)
raw pixel frames
The bet. Image-space diffusion preserves small visual details that discrete bottlenecks may round off.
The paper’s evidence. IRIS rollouts flicker on tiny details; DIAMOND rollouts stay more consistent.
IRIS: enemies turn into rewards, bricks flicker, score digits wobble.
DIAMOND: same games, fewer flagged inconsistencies. Paper Figure 5.
Why frame stacking matters
Frame stacking is not just the simple option. In their tests, it won.
Paper’s limitation: frame stacking is a minimal memory mechanism
Why it is cheap: history is fed as pixels, so \(D_\theta\) stays close to a 2D U-Net
Appendix M comparison: frame stacking beat cross-attention in 3D rollouts
CS:GO FVD: 34.8 for frame stack vs 81.4 for cross-attention
Sample rate: 7.4 Hz for frame stack vs 2.5 Hz for cross-attention
Tradeoff: state older than the buffer has no explicit place to live inside \(D_\theta\)
Cheap and effective inside the buffer. Past the buffer, there is no architectural memory.
Scaling: from Atari to Counter-Strike
What grew:
Data: 87 hours of human Dust II gameplay
Training setup: no RL agent, no collection loop, fit offline on static data
Params: 4.4M \(\to\) 381M
Resolution: dynamics at \(56 \times 30\), upsampler at \(280 \times 150\)
Hardware: 12 days on a single RTX 4090 to train. ~10 frames per second on an RTX 3090 at inference.
Paper Figure 6. Frames captured from people playing with keyboard and mouse inside DIAMOND’s CS:GO world model.
87 hours is roughly 0.5% of the data GameNGen used for its DOOM neural engine.
Scaling: what stayed the same
Same 2D U-Net backbone
Same frame-stack conditioning
Same action-modulated normalization
Same EDM loss
Still 3 denoising steps for the dynamics model
Upsampler uses 10 denoising steps for visual quality
CS:GO experiment is world model only: no trained CS:GO agent
DIAMOND gets a playable 3D world model from the cheap version of this bet.
What the paper says can break
Discrete controls only: Atari evaluation is on discrete-action environments
Memory is minimal: frame stacking only remembers the recent buffer
Longer-term memory is future work: the authors point to transformer-style context
Reward and termination are separate:\(R_\psi\) is not integrated into the diffusion model
CS:GO failure examples: losing visibility can produce new weapons or map regions, and repeated jumps can become possible
Scaling may improve visual quality. It does not automatically give the world model persistent state.
Back to the dream: CS:GO, deliberately broken
Same playable world model we opened with. Runs on the authors’ released checkpoint at diamond-wm.github.io.
Same architecture used for Atari.
Same U-Net
Same frame-stack memory
Same 3-denoising-step sampler
3D rendering instead of 2D
Same memory limit.
Same underlying cause: finite memory + long-horizon dependence = object-amnesia.
Three live CS:GO stress tests
Walk up to a wall and stare at it. When the rest of the scene rolls out of the frame stack, what comes back will not be what was there.
Sprint down a corridor and look back. The corridor behind me is a different corridor.
Find a weapon on the ground, look away, look back. Different weapon. Sometimes no weapon.
The goal is not to find a weird edge case. It is to run an object-permanence test in a 3D environment.
What DIAMOND showed us
What it showed:
Diffusion trained with the EDM objective and sampled at 3 steps is a viable action-conditioned dynamics model
Operating directly in image space preserves pixel-level details that discrete-token world models lose
The same architecture scales from 4.4M on 2D Atari to 381M on a 3D first-person shooter, on a single consumer GPU, on 87 hours of data
New best Atari 100k among agents trained entirely in a world model (HNS \(1.46\))
What DIAMOND did not show
What it did not show, and the paper agrees:
Long-horizon state persistence beyond the frame-stack depth
How to integrate a structured memory mechanism into a pixel-space diffusion loop
How to make the world model faithful enough to train a CS:GO-level agent in the dream. The CS:GO experiment has no agent, only the world model
A path to continuous-control domains where state is even higher-dimensional than pixels
DIAMOND is a working proof-of-concept for one design axis (tokenizer-free) and an honest flag for one open problem (memory). Both halves are real.