VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

University of Cambridge

Abstract: Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and co-dependently chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. Furthermore, VDAWorld can infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing higher-quality simulations across a wider range of dynamic scenarios than prior approaches.

Interactive Simulation

Because VDAWorld produces interpretable simulation code, we can edit the simulation to produce new scenarios, and interact with the 3D simulation environment. Click and drag to move the camera.

We can also intervene directly in the simulation by modifying the simulator code directly.

Input

Simulation

Replace Duck With Bunny

Reduce Flow Rate

Input

Simulation

Reduce Duck Mass

Invert Gravity

Finally, like with conventional video models, we can intervene in the simulation by changing the text caption passed to the model.

Input

Simulation with Caption: "A stack of objects. After the first frame, a new tennis ball falls from the top of the frame onto the stack."

Simulation with Caption: "A stack of objects. Over the course of the five seconds of video, the camera moves to an overhead view of the stack."

BibTeX

@misc{omahony_2025, title={{VDAWorld}: World Modelling via {VLM}-Directed Abstraction and Simulation}, author={O'Mahony, Felix and Cipolla, Roberto and Tewari, Ayush}, year={2025}, eprint={2512.11061}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.11061}, }

VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

In VDAWorld, a VLM is used to generate simulator code from an image and caption. The simulator code is then executed to generate future predictions. A code critic step automatically makes changes to the Simulator to correct any errors and improve the quality of simulation.

Interactive Simulation

Because VDAWorld produces interpretable simulation code, we can edit the simulation to produce new scenarios, and interact with the 3D simulation environment. Click and drag to move the camera.

Comparisons with Wan2.2

VDAWorld generates a scene abstraction and simulates it with a simulator chosen by the VLM, while Wan2.2 and Veo 3 directly generate videos. VDAWorld produces more physically accurate and temporally coherent results than these state-of-the-art methods.

Simulation Controllability

A central advantage of VDAWorld is the ability to control the simulation via user-provided control signals or interventions. Here, we show several examples of controllable simulations.

In this scene, we show fine-grained control over a robot arm moving blocks. The abstraction is produced by VDAWorld, and the user can direct the robot arm to desired positions. We show two different scenes from the Language Table Dataset, with two control sequences per scene.

We can also intervene directly in the simulation by modifying the simulator code directly.

Finally, like with conventional video models, we can intervene in the simulation by changing the text caption passed to the model.

Comparisons with Veo 3

We also compare with Veo 3. Like Wan2.2, Veo 3 struggles to accurately simulate physical rules, and fails to correctly model Conway's Game of Life.

More Results and Ground Truth Visualisation

BibTeX