Scalable Neural Simulation for Autonomy
Event: CVPR 2025 Workshop on Autonomous Driving · Duration: 28 min · ▶ Watch on YouTube
Abstract
This presentation explores the critical need for scalable neural simulation in the context of autonomous driving, particularly as the industry shifts towards end-to-end differentiable stacks. The speaker highlights four key aspects: quality (fidelity and consistency), generalizability (robustness and controllability), label-less learning (self-supervised methods), and efficiency (construction and inference). Recent research from Berkeley DeepDrive and Applied Intuition is showcased, demonstrating novel approaches to 3D/4D scene reconstruction, realistic 3D asset insertion via diffusion models, and cross-modality consistent data synthesis, all aimed at achieving more scalable and robust simulation environments for autonomous systems.
Speakers
- Wei Zhan — Co-Director, Berkeley DeepDrive; Chief Scientist, Applied Intuition
Talks (1)
- 00:04 — Wei Zhan: Scalable Neural Simulation for Autonomy
- This talk discusses the challenges and recent advancements in scalable neural simulation for autonomous driving, focusing on quality, generalizability, label-less learning, and efficiency, with specific examples from Berkeley DeepDrive and Applied Intuition research.
Key Takeaways
- The autonomous driving industry is moving towards end-to-end differentiable stacks, necessitating scalable neural simulation for training and evaluation.
- Key challenges in neural simulation include achieving high fidelity, strong generalizability with controllability, reducing reliance on costly human labels, and improving computational efficiency.
- Novel methods like DeSiRe-GS enable scalable 4D street Gaussian reconstruction and static-dynamic decomposition without 3D annotations, leveraging self-supervision.
- R3D2 demonstrates realistic 3D asset insertion into reconstructed scenes using diffusion models, significantly enhancing realism and controllability for various scenarios (cross-scene, cross-dataset, text-to-3D).
- X-Drive focuses on cross-modality consistent multi-sensor data synthesis using diffusion models, ensuring consistency between synthetic multi-view images and point clouds conditioned on 3D bounding boxes or text prompts.
- The transition from per-scene optimization to feed-forward models (e.g., DrivingRecon, PixelGaussian) is crucial for improving efficiency and generalizability in 3D Gaussian reconstruction, while S2GO addresses streaming sparse Gaussian occupancy prediction with high runtime efficiency.
Methods / Models / Datasets Mentioned
DeSiRe-GSDrivingGaussianR3D2X-DriveDrivingReconPixelGaussianS2GOGaussian SplattingDiffusion ModelsFIDFID-ALMSCNETMonoSceneAtlasBEVFormerTPVFormerOccFormerGaussianFormerGaussianWorldSSCNetSurroundOccGaussianFormer-2
Topics
Neural Simulation · Autonomous Driving · Gaussian Splatting · Diffusion Models · 3D Reconstruction · 4D Reconstruction · Self-Supervised Learning · End-to-End Learning · Multi-Modal Data Synthesis · Occupancy Prediction
Notes
Open for commentary — connections to other work, critiques, follow-up reading.