World Modeling Challenge

Event: CVPR 2025 · Duration: 0 min · ▶ Watch on YouTube

Abstract

This workshop session introduces the 1X World Modeling Challenge, a competition aimed at advancing world models for robotics. The challenge addresses critical issues in robotics scaling and evaluation, proposing world models as a solution for reproducible testing. It features two sub-challenges: compression and sampling. The winning team, Duke University, presents their ‘Pose-Modulated Diffusion Forcing Transformer and CNN’ solution, which leverages diffusion models with novel pose feature modulation for realistic future frame generation and a CNN for efficient latent compression, achieving first place in both categories.

Speakers

Jack Monas — 1X
Peter Liu — Duke University
Annabelle Chu — Duke University
Yiran Chen — Duke University

Talks (2)

00:01:25 — Jack Monas: World Modeling Challenge Introduction
- Introduction to the 1X World Modeling Challenge, its motivation, scaling issues in robotics, evaluation challenges, and the two sub-challenges (compression and sampling), announcing Team Duke as the overall winner.
00:07:31 — Peter Liu: Pose-Modulated Diffusion Forcing Transformer and CNN
- Presentation of Team Duke’s winning solution for the 1X World Model Challenge, detailing their Diffusion Forcing Transformer with pose feature modulation for realistic future frame generation and a CNN architecture for efficient latent compression, along with results and future work.

Key Takeaways

Robotics models currently lag behind other ML domains in predictable scaling, partly due to challenges in consistent evaluation in physical environments.
World models, acting as digital twins, offer a path to reproducible evaluation and scaling laws in robotics, but require high-fidelity modeling of robot actions and environmental responses.
The Diffusion Forcing Transformer, enhanced with pose feature modulation, effectively addresses long-horizon video prediction and action conditioning in robotics.
Combining local spatio-temporal context (convolutional layers) with global attention (transformers) and explicit pose conditioning (FiLM, AdaLN) is crucial for robust world model performance.
Open challenges in world modeling include object permanence, complex object-object interactions, and developing more robust evaluation frameworks beyond simple metrics like PSNR.

Methods / Models / Datasets Mentioned

Octo 93M
Octo 55B
DROID dataset
GENIE
GR-1
Nvidia COSMOS
OpenSora
WanVideo
DiffSynth-Studio
Step-Video-T2V
Mochi
HunyuanVideo-I2V
Stable Video Diffusion
Diffusion Forcing Transformer
UViT3D Backbone
U-Net
ResBlocks
TransformerBlocks
FiLM layers
AdaLN
Gaussian blur
Histogram matching

Topics

World Models · Robotics · Diffusion Models · Pose Estimation · Video Prediction · Machine Learning Scaling · Evaluation Metrics · Humanoid Robotics · Latent Compression · Transformer Architectures

Notes

Open for commentary — connections to other work, critiques, follow-up reading.