1 3D coordinate diffusion for view-invariant modelling, without hand-specified simulation structure, learned shape latents, or explicit rigid-transform prediction.
TL;DR
Conditioned on initial per-vertex positions and velocities, as well as object materials, PhysiFormer (pronounced 🫧fizzy🫧former) generates physically plausible 4D multi-object dynamics by predicting full-horizon vertex trajectories in world coordinates, with the input mesh topology imposed at inference.
Abstract
We present PhysiFormer, a diffusion transformer for physically-plausible 3D object motion. Unlike video world models that operate in view-dependent pixel space, PhysiFormer represents objects as 3D meshes expressed in world coordinates. Given the initial vertex positions and velocities, as well as object material type, rigid or elastic, the model samples future vertex trajectories. While related neural physics approaches build on ad-hoc latent spaces or explicitly enforce rigidity and causality, PhysiFormer shows that excellent results can be obtained without any such inductive biases, by casting vertex trajectory prediction as a single denoising diffusion process directly in world coordinates. The probabilistic formulation captures uncertainty in the learned dynamics, enabling diverse plausible futures from initial conditions, making this framework potentially useful for applications with unobserved uncertainty. The model features attention factorised over time, space, and objects for efficiency, enabling permutation-invariant multi-object reasoning without needing explicit object encoding. Trained on over 100k simulated trajectories, PhysiFormer generates rigid and elastic mechanics, and generalises to mixed-material settings, unseen real-world geometries, and larger object counts. It substantially outperforms autoregressive baselines in trajectory accuracy, rigidity preservation, and momentum-based physical consistency. Our results position coordinate-space diffusion as a promising step toward view-invariant, geometry-aware world modelling for robotics, graphics, and physical design.
Key Contributions
2 Scalable diffusion-transformer framework with factorised attention across time, space, and objects, generalising to complex meshes and object counts unseen during training.
3 Generative modelling that captures uncertainty from implicit physical properties affecting observed motion but not given as inputs, enabling diverse plausible futures beyond deterministic prediction.
4 Unified model for rigid, elastic, and mixed-material dynamics, modelling collisions and objects both in motion and at rest, with material-independent inference speed unlike solver-based physics simulators.
Interactive Viewer
Toggle the settings to see the model simulate motion for complex meshes unseen in training!
Number of Objects
Material
Action (Initial Velocity)
(precomputed and rendered)
Architecture
In-distribution Test Set Inference
For a given initial position and velocity, we compare simulator ground truth with three PhysiFormer-generated samples.
Simulator Ground Truth
Sample 1
Sample 2
Sample 3
Comparison with Baselines
We implement a transformer-based autoregressive baseline \(\Phi_{AR}\) that predicts the next timestep mesh vertex positions conditioned on past model outputs. We evaluate variants with longer context windows, which provide stronger motion anchors, and noise-injected training contexts, which reduce the train–test gap between ground-truth conditioning and self-conditioned rollout.
TIE (Transformer Implicit Edges), another deterministic autoregressive transformer, uses attention-defined implicit edges to approximate message passing and outperforms graph neural network, continuous-convolution, and edge-aware transformer methods in particle-based dynamics modelling. We evaluate different values of the implicit-edge radius \(r\), which controls the range of interactions captured by the model.
We train PhysiFormer and all baselines on a 10k-sample dataset of rigid-body interactions on the ground plane.
Simulator Ground Truth
PhysiFormer
ΦAR
TIE
Generated Motion Reflects Physical Understanding
GT
Example 0Sample
Example 0GT
Example 1Sample
Example 1Denoising Step Numbers
1 step
5 steps
10 steps
25 steps
50 steps
Generalisation to Unseen Object and Vertex Numbers
PhysiFormer is trained on scenes with at most 10 objects and 356 vertices, yet generalises beyond both limits. With 15 objects and 356 vertices, it produces physically plausible dynamics and captures material-dependent behavior, suggesting that object-level attention maintains object identity without explicit object-ID embeddings. With 50 objects and 1,207 vertices (significantly beyond training specs), the model faces much denser multi-object collisions. While penetration and spurious contacts emerge, the generated dynamics still reflect material and physical understanding, such as the large object remaining elevated by the smaller objects beneath it.
Ground Truth All Rigid
Inference All Rigid
Inference All Elastic
Beyond-Training Long-Horizon Chunked Rollout
PhysiFormer is trained to generate 49-frame sequences. To evaluate whether it can extend beyond this horizon, we perform chunked rollout, generating up to 193 frames by chaining four 49-frame chunks. Each chunk is initialized with the previous chunk’s final position, while the initial velocity is estimated by averaging the last three timestep velocities for stability.
PhysiFormer vs. Physics Simulators
While physics simulators produce high-fidelity trajectories by numerically integrating physical laws, PhysiFormer offers an efficient learned alternative at inference time: 1) It generates physically plausible motion from only initial positions and velocities, without requiring fully specified physical parameters such as density, friction, or material properties; 2) Once trained, PhysiFormer uses a fixed number of network evaluations, making inference cost largely independent of material type and contact complexity. On an 80-thread Intel Xeon Gold 6338 CPU node, Genesis rigid-body simulation averaged 1-6.5s per sample for 1-10 objects, while elastic-body simulation averaged 20-36s per sample for 1-5 objects excluding rendering, more than 5x the PhysiFormer inference time on a single H100 GPU with 25 denoising steps; and 3) PhysiFormer generalises to complex real-world mesh geometries and can produce plausible samples in challenging scenes where simulators may suffer from contact-resolution artifacts or objects leaving the simulation bounding box.
Concurrent Works
- WorldParticle is an autoregressive transformer that simulates nonrigid systems over Lagrangian particles, combining explicit force integration with a learned transformer corrector at each timestep. Unlike our unified model, it uses a shared architecture but trains separate models for each material category.
- RigidFormer learns rigid-body dynamics with an autoregressive, mesh-free point-cloud transformer. Unlike our method, it explicitly enforces rigidity by projecting predicted sparse anchor motion onto a rigid-body transform via Kabsch alignment, relies on explicit object labels per point, and supports only single-material scenes.
Acknowledgements
We are grateful for Gabrijel Boduljak, Koichi Namekata, Zeren Jiang, Stanislaw Szymanowicz, and Minghao Chen for insightful discussions. We thank Isambard-AI and Dawn AIRR supercomputers for supporting this project.
Citation
@article{chen2026physiformer,
title = {PhysiFormer: Learning to Simulate Mechanics in World Space},
author = {Chen, Yiming and Lan, Yushi and Vedaldi, Andrea},
journal = {arXiv preprint arXiv:TODO},
year = {2026}
}