VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

University of Cambridge
Input
VDAWorld
Veo 3
Input
VDAWorld
Veo 3
Input
VDAWorld
Wan2.2
Input
VDAWorld
Wan2.2
Input
Simulation
Change Camera Pose
Add White Block
Throw Balls into Scene
Change Force on Block

Abstract: Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and co-dependently chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. Furthermore, VDAWorld can infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing higher-quality simulations across a wider range of dynamic scenarios than prior approaches.


Figure 2: Example result

In VDAWorld, a VLM is used to generate simulator code from an image and caption. The simulator code is then executed to generate future predictions. A code critic step automatically makes changes to the Simulator to correct any errors and improve the quality of simulation.

Comparisons with Wan2.2


VDAWorld generates a scene abstraction and simulates it with a simulator chosen by the VLM, while Wan2.2 and Veo 3 directly generate videos. VDAWorld produces more physically accurate and temporally coherent results than these state-of-the-art methods.


Input
Ours
Wan2.2

We also evaluate on Conway's Game of Life, a cellular automaton with simple rules. VDAWorld correctly simulates the dynamics, while Wan2.2 fails to do so. The caption provided to both methods is "Conway's game of life on a 16 by 9 grid. Each frame constitutes one step of the game. The boundary condition is zero (pixels outside the grid are dead)."


Input
Ours
Wan2.2

Simulation Controllability


A central advantage of VDAWorld is the ability to control the simulation via user-provided control signals or interventions. Here, we show several examples of controllable simulations.


In this scene, we show fine-grained control over a robot arm moving blocks. The abstraction is produced by VDAWorld, and the user can direct the robot arm to desired positions. We show two different scenes from the Language Table Dataset, with two control sequences per scene.


Input
Abstraction
Control Signal
Simulation

We can also intervene directly in the simulation by modifying the simulator code directly.


Input
Simulation
Replace Duck With Bunny
Reduce Flow Rate

Input
Simulation
Reduce Duck Mass
Invert Gravity

Finally, like with conventional video models, we can intervene in the simulation by changing the text caption passed to the model.


Input
Simulation with Caption: "A stack of objects. After the first frame, a new tennis ball falls from the top of the frame onto the stack."
Simulation with Caption: "A stack of objects. Over the course of the five seconds of video, the camera moves to an overhead view of the stack."

Comparisons with Veo 3


We also compare with Veo 3. Like Wan2.2, Veo 3 struggles to accurately simulate physical rules, and fails to correctly model Conway's Game of Life.


Input
Ours
Veo 3

More Results and Ground Truth Visualisation


We show several results, along with the ground truth videos. Our method produces physically accurate results. Note that there are several valid futures for each scene, so our results do not exactly match the ground truth. We want to emphasise that our results show the correct physical interactions and dynamics, which is the main goal of our work.


Input
Ours
Ground Truth
Input
Ours
Ground Truth
Input
Ours
Ground Truth

BibTeX

@misc{omahony_2025,
  title={{VDAWorld}: World Modelling via {VLM}-Directed Abstraction and Simulation}, 
  author={O'Mahony, Felix and Cipolla, Roberto and Tewari, Ayush},
  year={2025},
  eprint={2512.11061},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.11061}, 
}