🫧PhysiFormer: Learning to Simulate Mechanics in World Space

Yiming Chen, Yushi Lan, Andrea Vedaldi

Visual Geometry Group, University of Oxford

TL;DR

Conditioned on initial per-vertex positions and velocities, as well as object materials, PhysiFormer (pronounced 🫧fizzy🫧former) generates physically plausible 4D multi-object dynamics by predicting full-horizon vertex trajectories in world coordinates, with the input mesh topology imposed at inference.

Abstract

We present PhysiFormer, a diffusion transformer for physically-plausible 3D object motion. Unlike video world models that operate in view-dependent pixel space, PhysiFormer represents objects as 3D meshes expressed in world coordinates. Given the initial vertex positions and velocities, as well as object material type, rigid or elastic, the model samples future vertex trajectories. While related neural physics approaches build on ad-hoc latent spaces or explicitly enforce rigidity and causality, PhysiFormer shows that excellent results can be obtained without any such inductive biases, by casting vertex trajectory prediction as a single denoising diffusion process directly in world coordinates. The probabilistic formulation captures uncertainty in the learned dynamics, enabling diverse plausible futures from initial conditions, making this framework potentially useful for applications with unobserved uncertainty. The model features attention factorised over time, space, and objects for efficiency, enabling permutation-invariant multi-object reasoning without needing explicit object encoding. Trained on over 100k simulated trajectories, PhysiFormer generates rigid and elastic mechanics, and generalises to mixed-material settings, unseen real-world geometries, and larger object counts. It substantially outperforms autoregressive baselines in trajectory accuracy, rigidity preservation, and momentum-based physical consistency. Our results position coordinate-space diffusion as a promising step toward view-invariant, geometry-aware world modelling for robotics, graphics, and physical design.

Key Contributions

1 3D coordinate diffusion for view-invariant modelling, without hand-specified simulation structure, learned shape latents, or explicit rigid-transform prediction.

2 Scalable diffusion-transformer framework with factorised attention across time, space, and objects, generalising to complex meshes and object counts unseen during training.

3 Generative modelling that captures uncertainty from implicit physical properties affecting observed motion but not given as inputs, enabling diverse plausible futures beyond deterministic prediction.

4 Unified model for rigid, elastic, and mixed-material dynamics, modelling collisions and objects both in motion and at rest, with material-independent inference speed unlike solver-based physics simulators.

Interactive Viewer

Toggle the settings to see the model simulate motion for complex meshes unseen in training!

Number of Objects

Material

Action (Initial Velocity)

First frame action preview

(precomputed and rendered)

Architecture

PhysiFormer architecture diagram showing coordinate diffusion over mesh vertex trajectories
PhysiFormer takes noisy mesh vertex coordinates as input and learns to denoise them into full 4D motion trajectories. \(T\) is number of timesteps; \(N\) is number of vertices; \(O\) is number of objects; \(N_{obj}\) is maximum per-object vertex number per scene. The vertex coordinates are first embedded into tokens, then conditioned on the object’s initial position, initial velocity, and material type. These tokens are passed through a factorised DiT backbone, where spatial, temporal, and object-level attention model geometry, motion over time, and interactions between objects. At inference time, the model iteratively denoises the tokens and maps them back to 3D vertex coordinates, which are assembled into 4D meshes using the original mesh topology.

In-distribution Test Set Inference

Elastic

Rigid

For a given initial position and velocity, we compare simulator ground truth with three PhysiFormer-generated samples.

Simulator Ground Truth

Sample 1

Sample 2

Sample 3

Comparison with Baselines

We implement a transformer-based autoregressive baseline \(\Phi_{AR}\) that predicts the next timestep mesh vertex positions conditioned on past model outputs. We evaluate variants with longer context windows, which provide stronger motion anchors, and noise-injected training contexts, which reduce the train–test gap between ground-truth conditioning and self-conditioned rollout.

TIE (Transformer Implicit Edges), another deterministic autoregressive transformer, uses attention-defined implicit edges to approximate message passing and outperforms graph neural network, continuous-convolution, and edge-aware transformer methods in particle-based dynamics modelling. We evaluate different values of the implicit-edge radius \(r\), which controls the range of interactions captured by the model.

We train PhysiFormer and all baselines on a 10k-sample dataset of rigid-body interactions on the ground plane.

Number of Objects
Example

Simulator Ground Truth

PhysiFormer

ΦAR

TIE

Generated Motion Reflects Physical Understanding

GT

Example 0

Sample

Example 0

GT

Example 1

Sample

Example 1

Denoising Step Numbers

1 step

5 steps

10 steps

25 steps

50 steps

Generalisation to Unseen Object and Vertex Numbers

PhysiFormer is trained on scenes with at most 10 objects and 356 vertices, yet generalises beyond both limits. With 15 objects and 356 vertices, it produces physically plausible dynamics and captures material-dependent behavior, suggesting that object-level attention maintains object identity without explicit object-ID embeddings. With 50 objects and 1,207 vertices (significantly beyond training specs), the model faces much denser multi-object collisions. While penetration and spurious contacts emerge, the generated dynamics still reflect material and physical understanding, such as the large object remaining elevated by the smaller objects beneath it.

Ground Truth All Rigid

Inference All Rigid

Inference All Elastic

Beyond-Training Long-Horizon Chunked Rollout

Number of Objects

PhysiFormer is trained to generate 49-frame sequences. To evaluate whether it can extend beyond this horizon, we perform chunked rollout, generating up to 193 frames by chaining four 49-frame chunks. Each chunk is initialized with the previous chunk’s final position, while the initial velocity is estimated by averaging the last three timestep velocities for stability.

PhysiFormer vs. Physics Simulators

While physics simulators produce high-fidelity trajectories by numerically integrating physical laws, PhysiFormer offers an efficient learned alternative at inference time: 1) It generates physically plausible motion from only initial positions and velocities, without requiring fully specified physical parameters such as density, friction, or material properties; 2) Once trained, PhysiFormer uses a fixed number of network evaluations, making inference cost largely independent of material type and contact complexity. On an 80-thread Intel Xeon Gold 6338 CPU node, Genesis rigid-body simulation averaged 1-6.5s per sample for 1-10 objects, while elastic-body simulation averaged 20-36s per sample for 1-5 objects excluding rendering, more than 5x the PhysiFormer inference time on a single H100 GPU with 25 denoising steps; and 3) PhysiFormer generalises to complex real-world mesh geometries and can produce plausible samples in challenging scenes where simulators may suffer from contact-resolution artifacts or objects leaving the simulation bounding box.

Concurrent Works

  • WorldParticle is an autoregressive transformer that simulates nonrigid systems over Lagrangian particles, combining explicit force integration with a learned transformer corrector at each timestep. Unlike our unified model, it uses a shared architecture but trains separate models for each material category.
  • RigidFormer learns rigid-body dynamics with an autoregressive, mesh-free point-cloud transformer. Unlike our method, it explicitly enforces rigidity by projecting predicted sparse anchor motion onto a rigid-body transform via Kabsch alignment, relies on explicit object labels per point, and supports only single-material scenes.

Acknowledgements

We are grateful for Gabrijel Boduljak, Koichi Namekata, Zeren Jiang, Stanislaw Szymanowicz, and Minghao Chen for insightful discussions. We thank Isambard-AI and Dawn AIRR supercomputers for supporting this project.

Citation

@article{chen2026physiformer,
  title     = {PhysiFormer: Learning to Simulate Mechanics in World Space},
  author    = {Chen, Yiming and Lan, Yushi and Vedaldi, Andrea},
  journal   = {arXiv preprint arXiv:TODO},
  year      = {2026}
}