World Models

The Invisible Architecture Behind Simulated Reality

Aug 06, 2025

Five questions for consideration as you think about how AI systems create believable, interactive experiences.

When an AI generates an image, does it see a picture or a world?
Consider whether AI understands spatial relationships or merely assembles pixels.

Is a simulated character alive beyond the camera’s frame?
Think about whether virtual beings persist in meaningful ways when they're not directly observed. (That sounds intuitively ridiculous but consider the role of persistent characters in an ongoing game-like experience.)

What if creativity isn’t about generating content but about simulating possibilities?
Reflect on shifting our focus from AI as a content factory to an imagination engine.

How can we distinguish between a compelling illusion and genuine understanding in generative AI?
Probe the difference between superficial memorization and internal comprehension of a simulated reality.

Does the AI know it broke the window (and does it care)?
Explore the significance of cause-and-effect relationships, agency, and consequences within AI-generated worlds.

In my recent course on Data for Deep Learning, we grappled with a deceptively simple question without realizing it:

What makes a simulated world feel coherent, interactive, and alive?

It’s a question with implications far beyond gaming. The systems we design for storytelling, immersive education, urban planning, self-driving cars, and robotics all rely on a concept that many creators and researchers treat as invisible: the world model.

What Is a World Model?

In the most basic sense, a world model is an internal representation of an environment. It’s how a system (whether a game engine, an AI, or even a simulation of a city) understands what exists, what happens, and what could happen next. It’s the architecture beneath the surface, the logic that holds a world together.

But world models are not just about pixels, coordinates, or geometry. These models are about relationships, causality, and possibility. These models provide structure to both immersive storytelling and machine learning, serving as the hidden foundation upon which all interactivity rests.

As of August 2025, we are seeing a lot of exciting releases that advance the use of AI based on world models. This post isn’t about those releases but about the concept of world models.

Beyond the Pretty Picture: A Creator’s Guide to Evaluating the New Wave of Generative AI

For creators in gaming, film, and storytelling, this new wave of generative AI is undeniably powerful, but it's also a source of profound uncertainty. How do we distinguish a dazzling tech demo from a genuinely transformative creative tool? How do we look past the immediate spectacle to gauge an AI’s true potential for our craft?

The answer doesn't lie in the polish of the pixels or the slickness of the short clip. It lies in a deeper, more fundamental capability: the AI's ability to understand and simulate a world.

As we navigate the hype, the single most important question to ask of any new generative tool is not "What can it generate?" but "What kind of world does it understand?" The shift from a mere content generator to a world simulator is the leap that will redefine creative collaboration.

The Pitfall of the Static Image

We’ve seen how generative AI has excelled at creating beautiful but shallow artifacts. Like a Hollywood set, they are all façade. A diffusion model can render a stunning jazz scene in a Paris nightclub, but it has no concept of the space behind the stage or the weight of the people sitting around the table. A short, text-to-video clip can show a character walking down a street, but the AI often struggles to maintain that character's appearance or the street's layout in the very next shot.

This is the core limitation of models that lack a robust internal simulation engine. They are pattern matchers, not world builders. For a storyteller, this is the difference between being handed a single, static photograph and being given a location to film in. The photograph is a prompt; the location is a canvas. True creative work, whether it’s interactive game design, dynamic cinematography, or emergent narrative requires a canvas.

The Engine of Imagination: What is a "World Model"?

This is where the concept of a world model becomes the critical lens for evaluation.

Pioneered in research and now powering the most advanced systems, a world model is an AI’s internal, compressed, and predictive understanding of an environment. It's a learned simulation engine that allows the AI to "dream" or "imagine" what will happen next based on a coherent set of rules.

A sophisticated world model doesn't just know what pixels go next to each other; it learns an intuitive grasp of the principles governing a world:

Spatial and Temporal Coherence: It understands object permanence: that a character who walks behind a pillar still exists. It maintains consistency over time, so that a shattered window remains broken in a later scene.
Cause and Effect: It learns that flipping a switch illuminates a room, pushing a box makes it move, and an action taken now will have logical consequences later. This is the foundation of agency.
Intuitive Physics and Affordances: It develops a feel for gravity, momentum, and how objects can be used. A chair affords sitting; a lever affords pulling. This is crucial for creating interactive environments that feel natural and responsive.

These aspects are fundamental with game engines like Unreal Engines. But that’s a tool where the game developer directs everything.

Yet, when an AI tool generates a playable video game from a single image, its true breakthrough isn't the pixel generation but the latent action model: an embedded world model that understands what "jump" or "move right" means within the context of that 2D world's physics.

A Creator's Checklist: What to Look For

With this understanding, we can formulate a checklist to cut through the hype of the next big AI announcement. Instead of being passive spectators, we can become active, critical examiners.

1. Does it Demonstrate Consistency and Coherence? Look past the 3-second clip. Ask for the 30-second version. Does the character’s clothing remain the same? Does the environment’s layout stay consistent when viewed from a different angle? Systems built on strong world models need to track relationships between different parts of a scene over time in order to exhibit far greater temporal and spatial stability.

2. Does it Enable Agency and Interaction? This is the litmus test that separates a passive video generator from a virtual world. Can a user act within the generated space? Is the environment playable? The world model provides a framework for user actions to have meaningful consequences. When examining a new tool, ask: Is this a movie, or is it a place?

3. Does it Support Emergent Behavior? The ultimate goal is not just a simulated world, but one that can surprise us. A world model with a robust understanding of its own rules can facilitate emergent narrative and gameplay. An NPC, guided by the world model's logic, might react to a player's unforeseen action in a way that is not explicitly scripted, but is consistent with the world's rules. This moves the AI from a simple executor of commands to an improvisation partner.

4. How Controllable is the Story and Art Direction in the World? A powerful simulation is useless if the creator cannot steer it. The most promising tools will not be black boxes. Look for evidence of controllability. Can we provide high-level textual guidance that designs game mechanics from stories? Can we define the world's physics, its aesthetic, or its core rules? The goal is not automation, but augmented creation.

As we move forward, the most transformative AI tools will not be those that simply offer to do our work for us. The most transformative tools will provide us with new worlds to work within. These tools will function less like a content factory and more like a physics engine for reality itself: a malleable, explorable, and interactive canvas for our imagination. The next time a new model makes headlines, look for signs of a world behind the curtain. That is where the future of storytelling truly lies.

From Hand-Crafted Worlds to AI-Generated Simulations

When we talk about world models in virtual reality or mixed reality, we often focus on designed systems. Whether it’s a digital twin of a city, a historical VR simulation, or an interactive museum, these world models are authored to deliver particular experiences.

But there’s another side to the story that involves systems that learn their own models of the world.

In 2018, David Ha and Jürgen Schmidhuber published a paper titled “World Models.”

Their central idea: instead of training an AI by forcing it to explore the real world over millions of iterations, why not let it learn a simplified model of the world and train itself inside that model?

They called this process “learning inside a dream.”

“Can we train our agent to learn inside its own dream, and transfer this policy back to the actual environment?
If our world model is sufficiently accurate for its purpose, and complete enough for the problem at hand, we should be able to substitute the actual environment with this world model. After all, our agent does not directly observe the reality, but only sees what the world model lets it see”

How Does It Work? (A Friendly Breakdown)

Ha and Schmidhuber’s architecture has three components:

Vision Model: Compresses raw input from the environment (like video frames) into a simplified, abstract representation known as a latent space.
Memory Model: A neural network that learns how the world changes over time, predicting future states based on past states and actions.
Controller: A decision-making model that learns how to act, using the abstracted, predicted states from the world model.

The key insight? The agent learns to act entirely inside its own dream world, a simplified simulation built from its own learned representations.

Why This Matters

The implications of Ha and Schmidhuber’s work extend far beyond AI research. When I teach my students about world models, I’m encouraging them to think about how systems understand and simulate reality.

World models are foundational architecture:

Digital twins of factories or cities, where the system must track, simulate, and optimize complex processes.
Interactive storytelling, where NPCs adapt and respond based on their internal models of the world.
Virtual education, where the system monitors and adjusts to how students interact within a learning environment.
Generative agents, which create realistic social simulations by modeling memory, motivation, and interaction dynamics.

What Ha and Schmidhuber introduced was a method of training AI inside its own imagination. It’s an approach that echoes how human creativity works by reflecting, simulating, and refining ideas internally.

Connecting World Models to Mixed Reality and VR

Whether designing virtual museums, historical simulations, or AI-driven learning environments, there are fundamental questions:

What parts of a world model can be learned from data, and what parts must be designed manually?
How can an AI agent be embedded within a virtual world so that it dreams and plans just as humans do?
What kinds of data architectures support these simulations?

By bridging these questions to AI research, we see that world models are more than tools for simulation. World models are tools for imagination.

Next Steps: Modeling a Simulated Reality

World models are the foundation for simulating reality, not just for games, but for any interactive system that models a world to make decisions, support presence, or generate outcomes.

World models represent the convergence of perception, structure, dynamics, and learning.

Are we talking about the same thing? Digging Deeper into the Definition of a World Model

When people talk about world models, the term lacks precision. An excellent paper attempts to provide a theoretical foundation for clarifying the terminology: Li, Kenneth, Fernanda Viégas, and Martin Wattenberg. What Does it Mean for a Neural Network to Learn a" World Model"?. arXiv preprint arXiv:2507.21513 (2025).

That essay is a great example of clear academic writing that provides a rigorous operational definition of world model. Li et al. provide a clear, mathematically precise definition of a 'world model' based on the idea of linear probing. According to their definition, a neural network contains a world model if its internal computations can be simplified into an interpretable representation, directly reflecting key structures or patterns from the external world.

In interactive storytelling, a well-defined world model might let an AI reliably 'remember' that a character left an object in a specific location, influencing future narrative interactions. Li et al.’s criteria provide a way to empirically test whether an AI truly builds and uses such representations.

For game developers using AI-generated environments, Li’s distinction between a 'learned' world model and trivial memorization helps clarify if a neural network has genuinely grasped game physics, level logic, or player behavior.

When is an AI’s World Model "Real"? Three Key Criteria

How can we be sure an AI is genuinely modeling a world internally, rather than simply memorizing superficial details?

According to their research, a genuine world model must satisfy three important conditions:

Nontrivial: The internal representation genuinely captures meaningful, structured aspects of the external world. It doesn't merely replicate or trivially repackage the input data.
Learned: The representation is not directly contained in the input data itself. Instead, it's actively constructed by the AI’s internal computations.
Emergent: The internal representation appears naturally within the AI, without being explicitly demanded by the AI’s primary tasks or outputs. In other words, it emerges organically from the AI’s attempt to solve problems.

Practical Ways to Experimentally Identify World Models

One of the strongest aspects of Li et al.’s essay is their emphasis on practical experimentation. Specifically, they suggest using "probing" techniques, like linear probes, to test whether a neural network’s internal representations correspond meaningfully to interpretable aspects of the external world.

Five Takeaways about World Models

World models are invisible yet essential.
They form the hidden architecture enabling coherent, interactive, and dynamic simulated realities, whether in storytelling, gaming, or AI-driven simulations.
Authentic creativity in AI comes from internal simulation, not surface-level generation.
Genuine AI creativity involves understanding relationships, possibilities, and consequences, not simply generating isolated images or scripted scenes.
Evaluate generative AI by the depth of its world model.
Use practical criteria like consistency, agency, emergent behavior, and art-direction to distinguish powerful world simulators from content generators.
Advanced AI systems learn by imagining.
Techniques like Ha and Schmidhuber’s “learning inside a dream” highlight how AI can internally simulate possibilities, much like human creativity does.
Clear definitions and empirical methods, such as those provided by Li et al., are crucial.
Operational criteria and experimental probes allow us to rigorously verify whether AI systems genuinely understand the worlds they simulate, paving the way toward richer, more meaningful interactions.

The Metaverse is Open