Scalable Neural Simulation for Autonomy

Event: CVPR 2025 Workshop on Autonomous Driving · Duration: 28 min · ▶ Watch on YouTube

Abstract

This presentation explores the critical need for scalable neural simulation in the context of autonomous driving, particularly as the industry shifts towards end-to-end differentiable stacks. The speaker highlights four key aspects: quality (fidelity and consistency), generalizability (robustness and controllability), label-less learning (self-supervised methods), and efficiency (construction and inference). Recent research from Berkeley DeepDrive and Applied Intuition is showcased, demonstrating novel approaches to 3D/4D scene reconstruction, realistic 3D asset insertion via diffusion models, and cross-modality consistent data synthesis, all aimed at achieving more scalable and robust simulation environments for autonomous systems.

Speakers

Wei Zhan — Co-Director, Berkeley DeepDrive; Chief Scientist, Applied Intuition

Talks (1)

00:04 — Wei Zhan: Scalable Neural Simulation for Autonomy
- This talk discusses the challenges and recent advancements in scalable neural simulation for autonomous driving, focusing on quality, generalizability, label-less learning, and efficiency, with specific examples from Berkeley DeepDrive and Applied Intuition research.

Key Takeaways

The autonomous driving industry is moving towards end-to-end differentiable stacks, necessitating scalable neural simulation for training and evaluation.
Key challenges in neural simulation include achieving high fidelity, strong generalizability with controllability, reducing reliance on costly human labels, and improving computational efficiency.
Novel methods like DeSiRe-GS enable scalable 4D street Gaussian reconstruction and static-dynamic decomposition without 3D annotations, leveraging self-supervision.
R3D2 demonstrates realistic 3D asset insertion into reconstructed scenes using diffusion models, significantly enhancing realism and controllability for various scenarios (cross-scene, cross-dataset, text-to-3D).
X-Drive focuses on cross-modality consistent multi-sensor data synthesis using diffusion models, ensuring consistency between synthetic multi-view images and point clouds conditioned on 3D bounding boxes or text prompts.
The transition from per-scene optimization to feed-forward models (e.g., DrivingRecon, PixelGaussian) is crucial for improving efficiency and generalizability in 3D Gaussian reconstruction, while S2GO addresses streaming sparse Gaussian occupancy prediction with high runtime efficiency.

Methods / Models / Datasets Mentioned

DeSiRe-GS
DrivingGaussian
R3D2
X-Drive
DrivingRecon
PixelGaussian
S2GO
Gaussian Splatting
Diffusion Models
FID
FID-A
LMSCNET
MonoScene
Atlas
BEVFormer
TPVFormer
OccFormer
GaussianFormer
GaussianWorld
SSCNet
SurroundOcc
GaussianFormer-2

Topics

Neural Simulation · Autonomous Driving · Gaussian Splatting · Diffusion Models · 3D Reconstruction · 4D Reconstruction · Self-Supervised Learning · End-to-End Learning · Multi-Modal Data Synthesis · Occupancy Prediction

Notes

Open for commentary — connections to other work, critiques, follow-up reading.