World Modeling Challenge
Event: CVPR 2025 · Duration: 0 min · ▶ Watch on YouTube
Abstract
This workshop session introduces the 1X World Modeling Challenge, a competition aimed at advancing world models for robotics. The challenge addresses critical issues in robotics scaling and evaluation, proposing world models as a solution for reproducible testing. It features two sub-challenges: compression and sampling. The winning team, Duke University, presents their ‘Pose-Modulated Diffusion Forcing Transformer and CNN’ solution, which leverages diffusion models with novel pose feature modulation for realistic future frame generation and a CNN for efficient latent compression, achieving first place in both categories.
Speakers
- Jack Monas — 1X
- Peter Liu — Duke University
- Annabelle Chu — Duke University
- Yiran Chen — Duke University
Talks (2)
- 00:01:25 — Jack Monas: World Modeling Challenge Introduction
- Introduction to the 1X World Modeling Challenge, its motivation, scaling issues in robotics, evaluation challenges, and the two sub-challenges (compression and sampling), announcing Team Duke as the overall winner.
- 00:07:31 — Peter Liu: Pose-Modulated Diffusion Forcing Transformer and CNN
- Presentation of Team Duke’s winning solution for the 1X World Model Challenge, detailing their Diffusion Forcing Transformer with pose feature modulation for realistic future frame generation and a CNN architecture for efficient latent compression, along with results and future work.
Key Takeaways
- Robotics models currently lag behind other ML domains in predictable scaling, partly due to challenges in consistent evaluation in physical environments.
- World models, acting as digital twins, offer a path to reproducible evaluation and scaling laws in robotics, but require high-fidelity modeling of robot actions and environmental responses.
- The Diffusion Forcing Transformer, enhanced with pose feature modulation, effectively addresses long-horizon video prediction and action conditioning in robotics.
- Combining local spatio-temporal context (convolutional layers) with global attention (transformers) and explicit pose conditioning (FiLM, AdaLN) is crucial for robust world model performance.
- Open challenges in world modeling include object permanence, complex object-object interactions, and developing more robust evaluation frameworks beyond simple metrics like PSNR.
Methods / Models / Datasets Mentioned
Octo 93MOcto 55BDROID datasetGENIEGR-1Nvidia COSMOSOpenSoraWanVideoDiffSynth-StudioStep-Video-T2VMochiHunyuanVideo-I2VStable Video DiffusionDiffusion Forcing TransformerUViT3D BackboneU-NetResBlocksTransformerBlocksFiLM layersAdaLNGaussian blurHistogram matching
Topics
World Models · Robotics · Diffusion Models · Pose Estimation · Video Prediction · Machine Learning Scaling · Evaluation Metrics · Humanoid Robotics · Latent Compression · Transformer Architectures
Notes
Open for commentary — connections to other work, critiques, follow-up reading.