23 World Models
What if an AI could close its eyes and imagine the future? Not merely predict the next word in a sequence, but construct a rich internal simulation of how the world works: what happens when a ball is thrown, when a car turns a corner, when a robot reaches for a cup. This is the ambition behind world models, AI systems that learn internal representations of environment dynamics and use them to predict, plan, and reason.
World models represent a fundamentally different paradigm from the pattern-matching and sequence prediction that dominate current AI. A classifier looks at data and assigns labels. A generative model looks at patterns and produces similar patterns. A world model simulates. It takes the current state of the world and an action, and predicts what will happen next. This capacity for mental simulation is what allows biological organisms, including humans, to plan ahead, avoid danger, and imagine counterfactuals. Many researchers believe it is a prerequisite for achieving truly intelligent machines.
Current autoregressive LLMs, despite their impressive language abilities, fundamentally lack the capacity for genuine understanding because they do not build world models (LeCun 2022). An LLM can describe what happens when you drop a glass, but it does not have an internal simulation of gravity, fragility, and shattering. The argument is that learning world models is the key missing piece on the path to human-level AI.
23.1 What is a World Model?
The concept draws from cognitive science. Humans maintain internal models of their environment: you can close your eyes and imagine walking through your house, predict where a thrown ball will land, or mentally rehearse a conversation before having it. These internal simulations allow planning without trial and error.
Formally, a world model is a learned function \(f\) that, given the current state \(s_t\) and an action \(a_t\), predicts the next state: \[\hat{s}_{t+1} = f(s_t, a_t)\]
In deep learning, world models are typically neural networks trained on sequences of observations (images, sensor readings, game states) and actions. Once trained, the model can “hallucinate” future trajectories by repeatedly applying \(f\), starting from the current state and a sequence of planned actions. The agent can evaluate these imagined trajectories to select the best course of action, all without touching the real world.
A language model predicts the next token given previous tokens. A world model predicts the next state given the current state and an action. The key distinction is that world models are grounded in dynamics: they model cause and effect, not just statistical co-occurrence of symbols. This grounding is what makes them useful for planning and control.
23.1.1 Why World Models Matter
World models address several fundamental limitations of model-free approaches:
- Sample efficiency: A model-free RL agent must interact with the environment millions of times to learn. A world model agent can learn from real interactions and then practice “in imagination,” dramatically reducing the number of real-world episodes needed.
- Safety: Before executing a risky action, the agent can simulate its consequences. If the imagined outcome is catastrophic (crashing into a wall, dropping a fragile object), the agent can choose a different action without ever experiencing the failure.
- Transfer and generalization: A good world model captures the underlying physics and dynamics of an environment, not just superficial patterns. This enables transfer to novel situations that share the same dynamics but differ in appearance.
- Planning: World models enable planning by imagination. The agent can search over possible action sequences, simulate each one, evaluate the outcomes, and choose the best sequence. This is fundamentally more powerful than reactive, stimulus-response behavior.
23.2 Joint-Embedding Predictive Architectures (JEPA)
A central question in world model design is: what should the model predict? The naive answer is “pixels”: given the current video frame and an action, predict the next video frame. But pixel-level prediction is enormously wasteful. Most pixels in a frame are irrelevant (background, static objects), and forcing the model to predict them consumes capacity that could be spent on the semantically important aspects of the scene.
LeCun's Joint-Embedding Predictive Architecture (JEPA) (LeCun 2022) proposes an elegant alternative: predict in representation space, not pixel space. Instead of generating the raw next frame, the model predicts the embedding of the next observation.
23.2.1 Architecture
JEPA consists of three components:
- An encoder \(f_\theta\) that maps observations \(x\) and \(y\) to embeddings \(f_\theta(x)\) and \(f_\theta(y)\).
- A predictor \(g_\phi\) that takes the embedding of \(x\) and an optional conditioning variable \(z\) and predicts the embedding of \(y\): \(\hat{f}_\theta(y) = g_\phi(f_\theta(x), z)\).
- A target encoder (an exponential moving average of the main encoder) that produces the target embeddings. Crucially, no gradient flows through the target encoder, which prevents representation collapse (the trivial solution where the encoder maps everything to the same embedding).
The training loss encourages the predicted embedding to match the target embedding: \(g_\phi(f_\theta(x), z) \approx \bar{f}_\theta(y)\), where \(\bar{f}_\theta\) is the target encoder.
Consider predicting what happens when you push a cup across a table. A pixel-level predictor must render every detail: the exact wood grain of the table, the reflection on the cup, the lighting. A JEPA predictor only needs to capture the semantics: “the cup moved 5cm to the right.” By operating in representation space, JEPA focuses on what matters and ignores irrelevant detail.
23.2.2 Avoiding Representation Collapse
A major challenge in self-supervised learning is representation collapse: the encoder might learn to map all inputs to the same embedding, which trivially minimizes the prediction loss but learns nothing useful. JEPA addresses this through the target encoder mechanism (EMA update) and by carefully designing the masking and prediction tasks to require non-trivial predictions.
23.3 I-JEPA: Learning from Images
I-JEPA (Image-based Joint-Embedding Predictive Architecture) (Assran et al. 2023) applies the JEPA framework to static images. The goal is to learn rich visual representations without reconstruction, data augmentation, or pixel-level losses.
23.3.1 How It Works
Given an image, I-JEPA:
- Splits the image into a grid of patches (following the Vision Transformer approach).
- Masks one or more large, contiguous blocks of patches (not random individual patches).
- Passes the visible patches through the encoder to produce embeddings.
- Uses the predictor (a small transformer) to predict the embeddings of the masked patches, conditioned on the visible patch embeddings and the positional information of the masked regions.
- Compares the predictions against embeddings produced by the target encoder.
I-JEPA deliberately masks large contiguous regions, not random patches. Random patch masking can be solved by local interpolation (predicting a missing patch from its neighbors using texture). Large contiguous masks force the model to reason about high-level semantic content: “What object is behind this mask? What is the overall scene structure?” This design choice is what drives I-JEPA toward learning semantic, rather than textural, representations.
23.3.2 Results and Significance
I-JEPA achieves strong performance on ImageNet linear probing and downstream transfer tasks, competitive with contrastive methods (like DINO) and masked image modeling (like MAE), but without requiring any hand-crafted data augmentation. This is significant because data augmentation introduces inductive biases (invariance to crops, flips, color jitter) that may not generalize to all domains. I-JEPA learns these invariances naturally from the data.
23.4 V-JEPA: Learning from Video
V-JEPA (Video JEPA) (Bardes et al. 2024) extends the framework to video, making it a true world model that captures temporal dynamics. If I-JEPA learns about the structure of static scenes, V-JEPA learns about how the world changes over time.
23.4.1 Spatiotemporal Masking
Given a video, V-JEPA masks large spatiotemporal tubes: contiguous regions that span multiple frames. This forces the model to predict not just what an object looks like, but how it moves, where it goes, and what happens when it interacts with other objects.
The model uses a Video Vision Transformer (ViViT) as its encoder. The predictor takes the visible spatiotemporal patch embeddings and predicts the embeddings of the masked tubes. The target encoder provides the ground-truth embeddings.
23.4.2 Emergent Physical Understanding
V-JEPA demonstrates emergent understanding of physical concepts that it was never explicitly taught:
- Object permanence: The model correctly predicts that an object hidden behind another object still exists and will reappear.
- Contact dynamics: It understands that a ball hitting a wall will bounce, not pass through.
- Simple physics: The model captures gravity, momentum, and basic collision dynamics.
V-JEPA achieves state-of-the-art results on video understanding benchmarks (Kinetics-400, Something-Something v2) without any pixel-level reconstruction, text supervision, or pre-training on labeled data.
V-JEPA's emergent physical reasoning is one of the most exciting results in world model research. The model was trained only to predict masked video embeddings, yet it developed an implicit understanding of physics. This suggests that learning to predict “what comes next” in a video, when done at the right level of abstraction (representations, not pixels), naturally leads to physical understanding.
23.5 Google Genie and Genie 2
While JEPA learns world models through self-supervised prediction, Google DeepMind's Genie project takes a different approach: learning interactive world models that can be played.
23.5.1 Genie 1: The Foundation
Genie 1 (Bruce et al. 2024), published at ICML 2024, was the first generative interactive environment model. Trained on 200,000 hours of unlabeled 2D platformer gameplay videos scraped from the internet, Genie learned to generate playable 2D worlds from a single image prompt, despite never being given action labels during training.
The architecture consists of three components:
- A video tokenizer that converts raw video frames into a sequence of discrete tokens using a VQ-VAE (Vector Quantized Variational Autoencoder).
- A latent action model that infers what action was taken between consecutive frames, even though no action labels exist in the training data. The model discovers a compact set of latent actions (move left, jump, etc.) purely from observing state transitions.
- A dynamics model (a spatial-temporal transformer) that, given the current frame tokens and a latent action, predicts the next frame's tokens.
Genie 1's latent action model is one of its most elegant contributions. By training on unlabeled video, the model must discover what actions exist from the data alone. Given two consecutive frames, it learns to infer a compact latent code representing the action that caused the transition. At generation time, a user can map keyboard inputs to these discovered latent actions, making the generated world interactive. The model figured out “left,” “right,” and “jump” without ever being told those concepts exist.
What made Genie 1 particularly exciting for the research community is that it was released on HuggingFace, making it accessible for experimentation. Though the full 11B parameter model required significant compute, smaller configurations and the published architecture allowed researchers to train their own variants on custom datasets, from Atari-style games to simple physics simulations.
23.5.2 Genie 2: Scaling to 3D
Genie 2 (Google DeepMind 2024) scaled the approach to photorealistic 3D environments. Where Genie 1 generated simple 2D platformer worlds, Genie 2 generates consistent, playable 3D environments from a single image. Given one starting frame, it simulates how the world would change in response to keyboard and mouse actions, generating new frames in real time.
Genie 2's capabilities are remarkable:
- Diverse 3D worlds: It generates environments with consistent geometry, lighting, and physics, from indoor rooms to outdoor landscapes.
- Long-horizon generation: It can sustain coherent simulation for minutes, maintaining spatial consistency as the virtual camera moves through the scene.
- Image conditioning: Given a single photograph (even a hand-drawn sketch), Genie 2 creates an interactive 3D environment faithful to the input scene.
- Agent training: RL agents can be trained entirely inside Genie 2's imagined worlds, without building a hand-crafted simulator.
Genie 2 demonstrates a powerful idea: if your world model is good enough, you do not need a hand-crafted simulator. You can simply describe or sketch the environment you want, let the world model generate it, and train your RL agent inside that generated world. This dramatically lowers the barrier to building training environments for embodied AI.
23.5.3 How Genie Differs from Video Generation
While Genie 2 generates video frames, it differs fundamentally from models like Sora (Chapter 7). Video generation models produce a fixed sequence of frames from a text prompt. Genie 2 generates frames reactively in response to user actions, maintaining a consistent internal state. It is not generating a movie; it is simulating a world.
23.6 Training Your Own World Model
One of the most exciting aspects of world model research is that it is accessible to individual researchers and small teams. You do not need Google-scale compute to train a meaningful world model; the key is choosing the right domain and architecture for your resources.
23.6.1 Starting Small: 2D Environments
The most practical starting point is 2D environments. Genie 1's architecture was published in full detail, and the community has created smaller-scale reproductions. The general recipe is:
- Collect video data: Record gameplay from a 2D game (e.g., using Gymnasium/Atari environments, or screen-recording a platformer). Even a few thousand short episodes can suffice for a simple environment.
- Train a video tokenizer: Use a VQ-VAE to compress frames into discrete tokens. Open-source implementations exist in PyTorch and JAX. The tokenizer learns to reconstruct frames from a compact codebook of visual tokens.
- Train a dynamics model: Given the current frame tokens and an action, predict the next frame's tokens. A small transformer (even a few million parameters) can learn the dynamics of simple environments.
- Generate and interact: At inference time, start from a real frame, tokenize it, and autoregressively generate future frames conditioned on user actions.
Start with a visually simple, deterministic environment (e.g., a single-room game with solid-color backgrounds and a few sprites). Stochastic environments and complex textures require much larger models. Use a small VQ-VAE codebook (256 to 1024 codes) to keep the dynamics model's vocabulary manageable. Train the tokenizer first and freeze it before training the dynamics model. Monitor reconstruction quality: if the tokenizer cannot faithfully reconstruct frames, the dynamics model has no chance.
23.6.2 Scaling Up: The DreamerV3 Approach
For readers interested in model-based RL specifically, DreamerV3's codebase is open-source and well-documented. Training a DreamerV3 agent on standard benchmarks (DMControl, Atari) is feasible on a single GPU:
- The world model is a Recurrent State-Space Model (RSSM) that maintains a latent state and predicts forward, using a mix of deterministic and stochastic components.
- The entire pipeline (world model, actor, critic) trains end-to-end.
- The codebase supports custom environments, so you can plug in your own Gymnasium-compatible environment and train a world model agent from scratch.
23.6.3 Using Pre-Trained Models
For those who want to experiment without training from scratch, several pre-trained world models are available on HuggingFace and similar platforms. The Genie 1 model weights, Cosmos tokenizers, and various community reproductions of world model architectures provide starting points for fine-tuning on custom domains. Fine-tuning a pre-trained world model on a small dataset from your target domain is often far more effective than training from scratch.
Here is a concrete weekend project: install Gymnasium, record 10,000 episodes of CartPole or LunarLander, train a small VQ-VAE tokenizer on the frames, then train a tiny transformer to predict next-frame tokens given current-frame tokens and the action. You will have built a world model that can “imagine” CartPole trajectories. It will not be perfect, but watching your model hallucinate plausible physics is deeply satisfying and teaches more about world models than any paper can.
23.7 NVIDIA Cosmos
NVIDIA's Cosmos (NVIDIA 2025) is a platform for world foundation models, designed to support physical AI and robotics at industrial scale. While research projects like JEPA and Genie explore the science of world models, Cosmos focuses on making them practically deployable.
The Cosmos platform provides:
- Pre-trained world foundation models: Both diffusion-based and autoregressive video generation models at multiple scales, trained on large-scale video data.
- Video tokenizers: Modules for converting between continuous video and discrete token representations, enabling the use of language model architectures for video prediction.
- Post-training tools: Frameworks for adapting pre-trained world models to specific domains (autonomous driving, robotic manipulation, industrial automation).
- Physical accuracy focus: Unlike general-purpose video generators, Cosmos emphasizes physically plausible dynamics, which is critical for training embodied agents that must operate in the real world.
NVIDIA envisions Cosmos as the foundation for “physical AI”: robots, autonomous vehicles, and industrial systems that understand and interact with the physical world. By providing pre-trained world models that can be fine-tuned for specific applications, Cosmos aims to do for physical AI what GPT did for language: provide a general-purpose foundation that accelerates development across the entire field.
23.8 World Models and Reinforcement Learning
World models and reinforcement learning have a deep, symbiotic relationship. Model-free RL learns by trial and error in the real environment, requiring millions of interactions. Model-based RL uses a learned world model to simulate interactions, learning largely “in imagination.”
23.8.1 DreamerV3
DreamerV3 (Hafner et al. 2023) is the most successful demonstration of this approach to date. It combines a learned world model with imagination-based policy optimization to master over 150 diverse tasks, from Atari games to robotic control to Minecraft, all with a single algorithm and no task-specific tuning.
The DreamerV3 loop operates as follows:
- Collect data: The agent interacts with the environment using its current policy and stores the resulting experiences (observations, actions, rewards) in a replay buffer.
- Learn the world model: A recurrent state-space model (RSSM) is trained on the replay data to predict observations, rewards, and episode termination from past states and actions.
- Imagine trajectories: Starting from states sampled from the replay buffer, the world model generates thousands of imagined rollouts: sequences of predicted states and rewards that would result from candidate action sequences.
- Learn the policy: An actor-critic algorithm is trained entirely on the imagined trajectories, using the world model's predicted rewards. The agent learns to take actions that lead to high-reward imagined futures.
DreamerV3 learned to collect diamonds in Minecraft, a task considered a benchmark challenge for RL because it requires a long sequence of sub-tasks: chop trees, craft planks, build tools, mine stone, mine iron, mine diamonds. The agent learned this entirely through world model imagination, demonstrating that even complex, long-horizon tasks with sparse rewards can be solved with model-based RL.
23.8.2 Why Imagination Works
Training on imagined experience has several advantages over training on real experience:
- Speed: Imagined rollouts are generated by the world model, which runs on a GPU. They are orders of magnitude faster than real-time interaction with an environment.
- Parallelism: Thousands of imagined trajectories can be generated simultaneously.
- Safety: The agent can explore dangerous strategies (driving off a cliff, touching a hot stove) in imagination without real consequences.
- Data reuse: Real experience is used to train the world model, and then reused indirectly through unlimited imagined experience. Every real data point is leveraged many times over.
23.9 Connections to Other Chapters
World models connect deeply to several topics covered elsewhere in this book:
- Multimodality (Chapter 7): World models must process multiple modalities (vision, proprioception, touch) to simulate the physical world accurately.
- Vision-Language-Action Models (Chapter 7): VLAs can be enhanced with world models for planning and simulation before acting.
- Reinforcement Learning (Chapter 20): World models are the foundation of model-based RL, enabling sample-efficient learning through imagination.
- Agentic Systems (Chapter 4): Agents with world models can plan multi-step actions by simulating outcomes before committing to a course of action.
23.10 The Future of World Models
World models are still in their early stages, but their trajectory is clear. Several developments are on the horizon:
Scaling: Just as language models improved dramatically with scale, world models are expected to improve as they are trained on more data, with larger architectures, and with more compute. Genie 2 and Cosmos are early examples of this scaling trend.
Multimodal world models: Current world models primarily operate on vision. Future models will incorporate audio, touch, proprioception, and other sensory modalities, building richer simulations of reality.
Unified reasoning and simulation: The eventual goal is a model that combines the linguistic reasoning of LLMs with the physical simulation of world models. Such a system could read a set of assembly instructions, visualize the steps, simulate the physics of each action, and guide a robot through the task.
Bridging sim-to-real: World models trained on real-world video may eventually close the sim-to-real gap that plagues current robotics, providing simulated environments that are indistinguishable from reality for training purposes.
If we could build a world model of sufficient fidelity, it would essentially be a complete simulation of reality. Every experiment, every training episode, every what-if scenario could be run in simulation at GPU speed. Building such a model may be the most important technical challenge of the coming decades, and its solution would transform not just AI, but science, engineering, and medicine.
23.11 Exercises
- Read LeCun's “A Path Towards Autonomous Machine Intelligence” (LeCun 2022) position paper. Summarize the key differences between JEPA and generative (decoder-only) world models. Why does LeCun believe prediction in representation space is superior to prediction in pixel space?
- Compare and contrast Genie 2 and Sora (Chapter 7). Both generate video, but they serve fundamentally different purposes. What are concreted differences between a world simulator and a video generator?
- DreamerV3 learned to collect diamonds in Minecraft using imagination-based training. What failure modes might arise if the world model is inaccurate? How could these be detected and mitigated?
- Design (on paper) a world model for a self-driving car. What modalities would it need to process? What should it predict? How would you train it, and how would you evaluate whether its simulations are accurate enough for safe policy training?