Having raised $200 million, VAST aims to become the foundation of an interactive content ecosystem

Recently, VAST, a general artificial intelligence company, completed two rounds of financing—Series A+ and A++—raising a total of nearly $200 million.The funding rounds were led by Yueze Capital and China Life Yangtze River Delta Science and Technology Innovation Fund, with participation from industry and financial investors including the Shenzhen Artificial Intelligence Terminal Industry Fund, Shanghai Semiconductor Industry Investment, Shenzhen Venture Capital, and YuanSheng Capital.

Along with this funding round, the company also unveiled its world-model project, Project Eden (codename: Eden), for the first time.

Having raised 0 million, VAST aims to become the foundation of an interactive content ecosystem

This round of funding and the release of the new project’s demo have added an intriguing new variable to the already bustling world-modeling space.

The pace of AI advancements over the past year has been so rapid it’s almost overwhelming.Not long ago, the CEO of a gaming company told me that the industry’s reaction to AI advancements is a lot like receiving an earthquake warning: as soon as the alarm goes off, you have to run downstairs to take a look. Most of the time, it turns out to be a false alarm, but you can’t just stand there and do nothing—because you never know when the sky might actually fall.

This is particularly true of world models. Earlier this year, after Google DeepMind launched Genie 3, sentiment in the capital markets was instantly ignited, causing Unity to suffer its worst single-day drop since 2022, while gaming stocks such as Take-Two and Roblox also plummeted across the board.However, the industry quickly realized that this was more of an overreaction fueled by the AI narrative. Unity CEO Matthew Bromberg pointed out a more fundamental limitation: the outputs of such world models are probabilistic and non-deterministic.

From the perspective of the gaming industry, Genie 3’s technical approach has a fundamental flaw. While it can generate videos with highly realistic visuals, it cannot support a game world that is truly playable, re-enterable, or even capable of multiplayer gameplay.

In contrast, VAST’s launch of its first world model project, Project Eden, today really caught my attention.Before venturing into world models, VAST was best known for its Tripo series of AI 3D generative models, which were all about “creating everything.” In VAST’s narrative, the transition from creating everything to creating worlds is a natural next step.

“AI 3D assets and world models are inherently two driving forces.” VAST Chief Scientist Cao Yanpei told me that their goal from day one has been to build the next-generation interactive UGC content platform. AI 3D addresses the speed and low barrier to entry of “creating everything,” while world models are designed to handle the systematic simulation of “creating worlds.”“This is a natural extension of technological evolution and the logical next step.”

While most people are choosing between “action-conditional video generation” like Google Genie and “static 3D scene generation” like World Labs Marble, Project Eden has taken a third approach: architecturally decoupling the “evolution of the world state” from the “visual rendering of the scene” from the ground up.

This unique architecture naturally overcomes three major hurdles that previous world models could not surmount: long-term environmental persistence, flexible scene reuse, and concurrent multiplayer interaction.

Two Overrated Approaches and VAST’s Unique Solution

To understand the path VAST is taking, let’s first examine why previous world models have struggled to gain widespread adoption in the gaming industry.

Huang Xun, a renowned scientist in the field of generative AI, has proposed five attributes that distinguish “video generation” from “true world models.”A true world model must satisfy the principles of causality and interactivity while striving to achieve the highest possible levels of persistence, real-time performance, and physical accuracy. When measured against this standard, we find that the two mainstream approaches currently receiving significant attention in the gaming industry each face insurmountable limitations when confronted with the rigorous standards of production-grade games.

The first approach is the “video generation” school, which has seen the fastest progress and garnered the most attention, with Genie as its leading example. This approach essentially involves autoregressive prediction of 2D pixels, using existing frames to predict what each pixel in the next frame should look like.Cao Yanpei bluntly states that equating this end-to-end video generation with a world model is a misconception: “All states in an end-to-end video model are tightly bound to the current viewpoint. Once the camera pans around a corner, the model can only reconstruct the scene based on context; it is fundamentally incapable of providing the ‘long-range persistence’ required for games.”

Video generation models forcefully combine two tasks of completely different scales: one is predicting what will happen next in the world, which is relatively “light” in terms of information; the other is precisely rendering every single pixel, which is extremely “heavy” in terms of information.This fundamental failure to distinguish between “light” and “heavy” information makes it behave, in practical applications, much like a cartoonist without a script—forced to improvise the plot as they draw. This not only causes the image to drift or distort when the camera angle changes but also makes multiplayer online interaction extremely difficult to implement.

The second approach is the “static reconstruction school,” represented by World Labs’ Marble. The “spatial intelligence” advocated by this school does indeed capture the stability of spatial geometry, enabling the rapid reconstruction of navigable 3D spaces. However, its limitation lies in the fact that it currently offers only space, not time.It lacks the ability to simulate state changes over time, resulting in a static environment. By Huang Xun’s standards, this approach has a fundamental flaw in “interactivity”; it resembles a beautifully crafted 3D specimen—stunning to look at, but incapable of evolution.

Since the video-generation approach lacks a “script” and the static reconstruction approach lacks “evolution,” Project Eden’s solution is to separate the script-writing and image-generation tasks (decoupling). This makes it the first world model in the industry to allow for autonomous maintenance of the world state while applying deterministic control.

Separating State and Rendering: Building an Engine with AI

In a nutshell, the technical paradigm proposed by VAST involves architecturally separating “state maintenance and prediction” from “image rendering and presentation.”

At first glance, this statement may be difficult to understand. But to those in the gaming industry, this concept is actually very familiar: it closely resembles how a game engine works.The engine maintains a “world database” in memory that records the position, attributes, and state of all objects; when the camera moves, the rendering pipeline uses the viewing angle to render the scene within the field of view into a real-time image. State is one thing, and the image is another—the two are separate.

What VAST aims to do is use AI to build such an engine within a neural network. This is a very tough nut to crack: how to represent state within a neural network, what network architecture to use, and how to acquire and augment massive amounts of data—in Cao Yanpei’s view, this is actually a massive gap.

To bridge this gap, Project Eden has innovatively adopted a decoupled three-tier technical architecture.

At its core lies a structured state layer that centrally manages the scene’s geometric structure, object properties, and event logic, and is solely responsible for simulating the objective state.

The intermediate condition interface layer serves as a hub for converting between state and rendering, transforming 3D states into constraints based on different viewpoints, thereby ensuring physical consistency across shots from the ground up.

At the top is the generative rendering layer, which uses the objective state from the underlying layer and the constraints from the intermediate layer to render detailed visual images in real time as needed, filling in the dynamic details.

Cao Yanpei used the “firefighter extinguishing a fire” scenario in the demo to illustrate how Project Eden works: users describe the initial conditions of the kitchen, the firefighter, and the fire using a prompt, and this description is directly converted into an underlying implicit state.When the player controls the firefighter’s movement and presses the fire-extinguishing button, all determinations—such as how much powder is sprayed and whether the fire is extinguished—are simulated entirely within the underlying state layer. Even before a single frame of the scene has been rendered, this world is already “truly” operating according to a specific set of physical laws at the core.

The rendering layer on top of this acts like a realistic painter standing by at all times, ready to render a single frame based on the current state of the underlying layers and the player’s chosen viewpoint.Fluid dynamics—such as gas diffusion and flames licking at walls—which are extremely difficult to render in traditional computer graphics, can instead be simulated very naturally by the rendering layer within this architecture.

This decoupled design, which separates state from UI, is not only technically more elegant, but also naturally resolves several of the most pressing challenges faced by game developers.

First is the return of “consistency.” Since the world state is maintained independently of the viewpoint, if you turn away and then come back, that tree will still be sitting right there in the underlying database, waiting to be rendered again.The old problem common in video-generation approaches—where objects drift and scenes distort the moment you turn away—is fundamentally resolved here.

Secondly, it gives the scene a sense of vitality that allows it to be reused repeatedly. Since the world state exists independently and is both readable and writable, the actions players take within the scene are permanently preserved.If one player smashes a table, this change is written back to the underlying state in real time. When another player enters the same scene later, they will see the damaged result, rather than a room that has been mentally reconstructed to look as good as new.

Following this logic, the challenges of multiplayer interaction are easily resolved. Because there is a unified world state at the core that is independent of individual viewpoints, multiple players can share this state and then render their own perspectives.Only then can we speak of true real-time interaction between players and between players and NPCs. This is fundamentally similar to the “server-client” architecture found in today’s multiplayer games.

In addition, this model offers a significant commercial advantage: a dramatic reduction in computing costs.In a decoupled architecture, the rendering layer is computationally intensive, while the underlying state simulation involves lightweight high-dimensional computations. This directly addresses the underlying concern with video generation approaches—namely, that “generating frames at the pixel level for each viewpoint and each online user causes computational costs to skyrocket exponentially with user numbers.”

Of course, this framework is still in its early stages. However, in the Project Eden demo, we can see that it’s tackling some of the toughest challenges one by one: long-term consistent environment traversal, real-time multiplayer interaction, and deterministic mechanics.

What can it do in game development before the tipping point arrives?

Reading this, it’s hard not to gasp in astonishment. Once this technology is truly mature, will game development no longer require game engines?

Cao Yanpei told me that replacing game engines with world models is a very long-term goal fraught with difficulties. Only when the computational efficiency, stability, and controllability of state transitions in world model simulations thoroughly surpass the threshold for real-time responsiveness will this paradigm shift occur.

For now, rather than debating when the tide will turn, it’s better to see how this technology can be integrated into the game development process before that tipping point arrives.

In the early stages, we can break down this decoupled architecture and integrate it into the existing game pipeline on a case-by-case basis. For example, the backend rendering model could be adapted for “generative rendering.”Imagine you’re a game designer without strong art or programming skills. You can start by using very simple blockouts to build the spatial and level structures and validate the gameplay. Then, let the rendering engine transform these plain blockouts into any visual style you desire.

Lighting, detailed asset creation, and even the rendering of complex physical phenomena are all left to the model to handle in later stages.

For small and medium-sized teams, this is essentially equivalent to reducing the high costs of industrial-grade art production to a single model call. Art style will shift from being a rigid cost constraint that must be locked in before a project even begins to a “toggle switch” that can be flipped with a single click before launch.

At the same time, the front-end state prediction model can be isolated and used as an “intelligent state machine,” saving developers a great deal of repetitive work involved in writing scripts and building state machines.For example, the model can directly generate the animation or state machine output—such as how many degrees a door should rotate when kicked, how fast it should rotate, and whether it will bounce back and hit someone—without requiring programmers to specify these details line by line in code.

As the technology matures, the next step will be to have the world model partially take over “dynamic and open-world scenarios.” At this stage, the core game will still be driven by traditional code, but in certain highly open, complex, and less strictly deterministic dynamic scenarios, the world model can be invoked to perform offline or small-scale simulations.

For example, the damage caused by a random storm to a scene, or the spontaneous interactions among a large number of NPCs. The most practical approach at this stage is to reserve the highly deterministic parts for traditional code and entrust the open-ended, random, and intractable dynamic elements to models.

Only once a tipping point is crossed will we see the complete replacement of hard-coded values.

“We hope to replace these hard-coded logic and physical definitions with a more versatile neural network that has a lower barrier to entry.”Cao Yanpei said that, ideally, all the cumbersome logic currently specified line by line in code will eventually be replaced by data-driven reasoning based on large models, leading to a qualitative leap in development efficiency and lowering the barrier to entry.

However, the relationship between world models and traditional engines is not necessarily one of zero-sum replacement. As Epic founder Tim Sweeney once aptly observed, in the future we will see “engine-centric AI” and “world-model-centric AI” constantly catching up with and converging on one another, until one day the two merge into a single entity.

What VAST is doing is pushing the “world-model-centric” approach as far as possible, and every step taken before the tipping point arrives provides the gaming industry with a practical toolkit.

Conclusion

Achieving true technical maturity and industrial implementation for world models will still require a gradual process. In terms of computational efficiency, the absolute precision of physical laws, and the stability of large-scale scene simulations, neural network engines still have several technical hurdles to overcome.

However, the 3D assets that power this engine are becoming increasingly abundant.As VAST’s core business, its in-house Tripo series of 3D large language models has undergone rapid iteration in recent months. The Tripo H3.1 and Tripo P1.0 models, launched in March of this year, have reached industrial-grade usability in terms of both geometric accuracy and generation speed.

Recently, they also launched an 8K AI texturing algorithm on Tripo Studio, reducing a process that previously required days of manual painting or scanning to less than two minutes. Additionally, they introduced “Smart Part Segmentation V2,” which supports three levels of granularity control, allowing generated 3D assets to be automatically segmented and fed directly into downstream workflows.

In addition, VAST is promoting industry collaboration through open-source initiatives. To date, it has open-sourced more than 30 projects, including TripoSplat, AniGen, SkinTokens, and LegoACE—developed in collaboration with universities such as Tsinghua University and the University of Hong Kong—covering cutting-edge fields such as dynamic resolution and automatic skeleton binding.

At the application level, its all-in-one platform, Tripo Studio, has already attracted 20 million creators and established deep partnerships with leading companies such as Tencent, NetEase, Alibaba, and ByteDance.

In VAST’s technical roadmap, the world model and the large-scale 3D asset model are not isolated entities, but rather form a closed-loop system driven by two interconnected components. Their long-term vision is clear: to serve as the “foundation for a UGC interactive platform and a 3D content ecosystem.”

Building on this foundation, both professional developers and everyday creators may one day be able to create and explore interactive digital worlds with lower barriers to entry and greater freedom.

It is true that it takes time for neural network engines to mature, but with an abundance of underlying 3D assets and the successful implementation of decoupled architectures, a data-driven, real-time interactive digital space is moving from concept to reality. VAST is laying the groundwork for this at both the technological and ecosystem levels.

原创文章，作者：gallonwang，禁止转载：https://youxichaguan.com/en/archives/198136

Having raised $200 million, VAST aims to become the foundation of an interactive content ecosystem

相关推荐

Why have games like *Ghost Valley: Eight Wastes* and *Diver Dave* all entrusted their mobile versions to TapTap?

With One Bold Move After Another, Is It Still Stirring Up the SLG Red Ocean in Its Second Year?

1.5 Million Copies Sold on the First Day: A New Approach for a Long-Standing IP

From Blue-and-White Porcelain to Dunhuang Murals: League of Legends Brings Traditional Culture to Life in the Digital Age

After "Jade Gambling" unexpectedly went viral, it shot up to the top 10 on the bestseller list.

When Basketball Meets Teen Girls: Is the Gaming Genre on the Verge of a Revival?

Why have games like Ghost Valley: Eight Wastes and Diver Dave all entrusted their mobile versions to TapTap?