Scalable Neural Simulation for Autonomy

Event: CVPR 2025 Workshop on Autonomous Driving · Duration: 28 min · ▶ Watch on YouTube

Abstract

This presentation explores the critical need for scalable neural simulation in the context of autonomous driving, particularly as the industry shifts towards end-to-end differentiable stacks. The speaker highlights four key aspects: quality (fidelity and consistency), generalizability (robustness and controllability), label-less learning (self-supervised methods), and efficiency (construction and inference). Recent research from Berkeley DeepDrive and Applied Intuition is showcased, demonstrating novel approaches to 3D/4D scene reconstruction, realistic 3D asset insertion via diffusion models, and cross-modality consistent data synthesis, all aimed at achieving more scalable and robust simulation environments for autonomous systems.

Speakers

  • Wei Zhan — Co-Director, Berkeley DeepDrive; Chief Scientist, Applied Intuition

Talks (1)

  • 00:04Wei Zhan: Scalable Neural Simulation for Autonomy
    • This talk discusses the challenges and recent advancements in scalable neural simulation for autonomous driving, focusing on quality, generalizability, label-less learning, and efficiency, with specific examples from Berkeley DeepDrive and Applied Intuition research.

Key Takeaways

  • The autonomous driving industry is moving towards end-to-end differentiable stacks, necessitating scalable neural simulation for training and evaluation.
  • Key challenges in neural simulation include achieving high fidelity, strong generalizability with controllability, reducing reliance on costly human labels, and improving computational efficiency.
  • Novel methods like DeSiRe-GS enable scalable 4D street Gaussian reconstruction and static-dynamic decomposition without 3D annotations, leveraging self-supervision.
  • R3D2 demonstrates realistic 3D asset insertion into reconstructed scenes using diffusion models, significantly enhancing realism and controllability for various scenarios (cross-scene, cross-dataset, text-to-3D).
  • X-Drive focuses on cross-modality consistent multi-sensor data synthesis using diffusion models, ensuring consistency between synthetic multi-view images and point clouds conditioned on 3D bounding boxes or text prompts.
  • The transition from per-scene optimization to feed-forward models (e.g., DrivingRecon, PixelGaussian) is crucial for improving efficiency and generalizability in 3D Gaussian reconstruction, while S2GO addresses streaming sparse Gaussian occupancy prediction with high runtime efficiency.

Methods / Models / Datasets Mentioned

  • DeSiRe-GS
  • DrivingGaussian
  • R3D2
  • X-Drive
  • DrivingRecon
  • PixelGaussian
  • S2GO
  • Gaussian Splatting
  • Diffusion Models
  • FID
  • FID-A
  • LMSCNET
  • MonoScene
  • Atlas
  • BEVFormer
  • TPVFormer
  • OccFormer
  • GaussianFormer
  • GaussianWorld
  • SSCNet
  • SurroundOcc
  • GaussianFormer-2

Topics

Neural Simulation · Autonomous Driving · Gaussian Splatting · Diffusion Models · 3D Reconstruction · 4D Reconstruction · Self-Supervised Learning · End-to-End Learning · Multi-Modal Data Synthesis · Occupancy Prediction


Notes

Open for commentary — connections to other work, critiques, follow-up reading.