World Modeling Challenge

Event: CVPR 2025 · Duration: 0 min · ▶ Watch on YouTube

Abstract

This workshop session introduces the 1X World Modeling Challenge, a competition aimed at advancing world models for robotics. The challenge addresses critical issues in robotics scaling and evaluation, proposing world models as a solution for reproducible testing. It features two sub-challenges: compression and sampling. The winning team, Duke University, presents their ‘Pose-Modulated Diffusion Forcing Transformer and CNN’ solution, which leverages diffusion models with novel pose feature modulation for realistic future frame generation and a CNN for efficient latent compression, achieving first place in both categories.

Speakers

  • Jack Monas — 1X
  • Peter Liu — Duke University
  • Annabelle Chu — Duke University
  • Yiran Chen — Duke University

Talks (2)

  • 00:01:25Jack Monas: World Modeling Challenge Introduction
    • Introduction to the 1X World Modeling Challenge, its motivation, scaling issues in robotics, evaluation challenges, and the two sub-challenges (compression and sampling), announcing Team Duke as the overall winner.
  • 00:07:31Peter Liu: Pose-Modulated Diffusion Forcing Transformer and CNN
    • Presentation of Team Duke’s winning solution for the 1X World Model Challenge, detailing their Diffusion Forcing Transformer with pose feature modulation for realistic future frame generation and a CNN architecture for efficient latent compression, along with results and future work.

Key Takeaways

  • Robotics models currently lag behind other ML domains in predictable scaling, partly due to challenges in consistent evaluation in physical environments.
  • World models, acting as digital twins, offer a path to reproducible evaluation and scaling laws in robotics, but require high-fidelity modeling of robot actions and environmental responses.
  • The Diffusion Forcing Transformer, enhanced with pose feature modulation, effectively addresses long-horizon video prediction and action conditioning in robotics.
  • Combining local spatio-temporal context (convolutional layers) with global attention (transformers) and explicit pose conditioning (FiLM, AdaLN) is crucial for robust world model performance.
  • Open challenges in world modeling include object permanence, complex object-object interactions, and developing more robust evaluation frameworks beyond simple metrics like PSNR.

Methods / Models / Datasets Mentioned

  • Octo 93M
  • Octo 55B
  • DROID dataset
  • GENIE
  • GR-1
  • Nvidia COSMOS
  • OpenSora
  • WanVideo
  • DiffSynth-Studio
  • Step-Video-T2V
  • Mochi
  • HunyuanVideo-I2V
  • Stable Video Diffusion
  • Diffusion Forcing Transformer
  • UViT3D Backbone
  • U-Net
  • ResBlocks
  • TransformerBlocks
  • FiLM layers
  • AdaLN
  • Gaussian blur
  • Histogram matching

Topics

World Models · Robotics · Diffusion Models · Pose Estimation · Video Prediction · Machine Learning Scaling · Evaluation Metrics · Humanoid Robotics · Latent Compression · Transformer Architectures


Notes

Open for commentary — connections to other work, critiques, follow-up reading.