
Most of the AI industry is talking about world models.
According to Dr. Yanpei Cao, Chief Scientist at Tripo AI, most of the industry is also defining them incorrectly.
That claim arrives at an interesting moment. Tripo AI recently announced a funding round of up to $200 million alongside Project Eden, the company’s new world model initiative. While many companies are racing to improve AI video generation, Cao argues that the conversation around world models has drifted in the wrong direction.
The problem, he says, is that many systems being called world models today are still fundamentally video generators. That distinction may sound technical. Cao believes it changes everything.
The AI Colony Research and Development sits down with Dr. Yanpei Cao, Chief Scientist at Tripo AI, to get the full picture. Dr. Cao is a veteran AI researcher with publications at CVPR, NeurIPS, and SIGGRAPH, and prior roles at Tencent ARC Lab, Kuaishou Technology, and as CTO of Owlii, a volumetric video company acquired by Kuaishou. He does not mince words about where this technology is going or what the industry is getting wrong about it.
“Most World Models Are Actually Video Models”
Ask ten AI researchers to define a world model and you will likely get ten different answers.
Cao believes that the lack of clarity has created confusion among investors, builders, and researchers alike.
Many current systems generate future frames based on previous observations and user actions. They can produce impressive visual sequences, but their understanding of the environment is tied directly to what the camera can see. Once an object leaves the frame, the system no longer maintains a reliable representation of it.
“When the camera turns back, the model has to reconstruct what should be there based on available context,” Cao says.
The result is a set of familiar problems:
- Objects changing position unexpectedly
- Items disappearing and reappearing
- Inconsistent environments across sessions
- No reliable long-term memory
For short videos, those issues may not matter. For interactive worlds, they become a hard architectural ceiling.
A World Should Exist Even When Nobody Is Looking At It
Cao believes that a true world model must maintain an underlying state that exists independently of any camera view. That idea sits at the center of Project Eden.
Project Eden: A Three-Layer World Model Architecture

Instead of treating visual output as the world itself, Tripo separates the world’s state from the process used to render it. The architecture operates across three distinct layers:
- Structured State Layer: Maintains objects, geometry, identities, attributes, and event logic. The world exists independently of any camera position.
- State-to-Observation Layer: Translates world state information into view-specific rendering conditions. Every perspective derives from the same source of truth.
- Real-Time Rendering Layer: Generates visuals for users on demand, based entirely on what the state layer says is true.
The result is a system where the world persists regardless of where a user is looking. A chair remains in the same location. A destroyed object stays destroyed. A modification made by one user remains visible to the next.
For Cao, that persistence is not an added feature. It is the foundation.
“A true, decoupled world model is actually a computational engine. It is going to replace traditional game physics engines, redefine how we build social UGC platforms, and become the mandatory infrastructure for training robotics. The market is treating a platform-level paradigm shift as a content-generation feature.”
Dr. Yanpei Cao, Chief Scientist, Tripo AI
Why Tripo Split State and Rendering Into Separate Systems
One of the strongest positions Cao holds concerns architecture. State prediction and visual rendering are fundamentally different problems. One is focused on understanding how the world changes. The other is focused on producing realistic imagery. Forcing a single model to solve both creates unnecessary complexity and limits what the system can do at scale.
By separating those responsibilities, Tripo builds toward capabilities that become increasingly critical as interactive environments grow larger and more complex. The architecture was designed to support:
- Long-term environmental memory
- Reusable and editable worlds
- Persistent modifications across users and sessions
- Shared environments with multi-agent interaction
- Computing costs that scale predictably
Those requirements sound closer to online infrastructure than content generation. That comparison is intentional.
“We Don’t See Project Eden As a Game Development Tool”
Many AI companies position their products as creative tools. Cao frames Project Eden differently.
“We don’t see Project Eden as just a game development tool,” he says, “but as the underlying engine for the next generation of interactive content creation.”
That statement reveals how Tripo views its long-term role. The company originally became known for AI-powered 3D asset generation. Its models helped users produce production-ready 3D objects quickly and at scale. Project Eden extends that vision into a new dimension.
If Tripo’s generation models define what exists inside a world, Project Eden defines how that world evolves. Together, Cao describes them as components of what he calls a neural-native world engine. The long-term goal is direct: make building interactive worlds as accessible as posting a video online.
The Robotics Opportunity May Be Even Bigger Than Gaming
Gaming often dominates discussions around virtual worlds. Cao believes robotics could become the larger opportunity.
Current training environments force researchers into a difficult trade-off. Traditional simulators provide strong physical consistency but limited variety. Video generation models can create diverse environments but consistently struggle with causality, persistence, and object consistency across time.
Project Eden is designed to bridge both problems at once. The system maintains a persistent world state while generating diverse environments where actions produce lasting consequences. For embodied AI, that combination is critical. An agent cannot learn effectively if the world constantly contradicts itself. It also cannot generalize well if every training environment looks identical.
Cao sees world models as the infrastructure that closes that gap. He is direct that robotics training is not a secondary use case for Project Eden. It is one of the primary design targets.
Why Investors May Still Be Undervaluing World Models
Despite growing interest in spatial AI and embodied intelligence, Cao believes the market still misreads the category.
Many investors continue to view world models primarily as content-generation tools. He sees them as infrastructure. That distinction changes how the technology should be valued and what kind of returns it can eventually produce.
Physics engines became foundational technology for gaming. Networking layers became foundational technology for online applications. Cao believes world models will occupy the same foundational position for interactive digital environments. If that prediction holds, the long-term value will come less from media generation and more from becoming a core layer of future software systems.
The Connection to AGI
Toward the end of the conversation, the subject shifts to artificial general intelligence. Cao views spatial reasoning, simulation, and world models as connected parts of the same underlying challenge.
Spatial reasoning helps AI understand what exists and where. Simulation helps AI understand how things change over time. A world model combines both into a single operational environment.
“An intelligent system needs to understand the world it operates in, predict how that world may change, and learn from the consequences of its actions,” he says.
That capability sits at the heart of how humans learn. It may also become one of the defining requirements for advanced AI systems. For Tripo, Project Eden is not simply another product launch. It is an attempt to build the underlying infrastructure for persistent digital worlds, large-scale agent training, and the interactive experiences that follow.
The $200 million funding announcement may be the headline. The bigger story is the bet behind it.
About Dr. Yanpei Cao
Dr. Yanpei Cao is the Chief Scientist of Tripo AI and a leading researcher in 3D generative AI, computer graphics, and spatial intelligence. He previously served as Principal Researcher at Tencent ARC Lab and Tencent AI Lab, and was CTO of Owlii (acquired by Kuaishou). His work has appeared at CVPR, NeurIPS, and SIGGRAPH.
About Tripo AI
Tripo AI is an AI infrastructure company specialising in 3D generative models and world model technologies. Its foundation models, including Tripo P1.0 and H3.1, generate production-ready 3D assets from prompts. Project Eden is Tripo’s world model initiative, providing persistent, interactive, and multi-user digital environments for creators, developers, and agent training. Learn more at https://www.tripo3d.ai/
The AI Colony Research and Development covers the intersection of AI infrastructure, creator ecosystems, and the future of digital experiences. This feature was produced in partnership with Tripo AI.