Scaling up Autonomous Driving via Large Foundation Models
Event: CVPR 2025 Workshop on Autonomous Driving · Duration: 23 min · ▶ Watch on YouTube
Abstract
Xianming Liu from XPeng presents their vision for scaling autonomous driving through large foundation models, termed ‘SW 3.0 - AI Factory’. The presentation showcases XPeng’s Turing AI Driving Assistance System, demonstrating its capabilities in challenging real-world scenarios like rain, fog, darkness, and complex urban environments, relying solely on cameras. The core methodology involves a continuous learning framework with inner and outer loops, encompassing pretraining, supervised finetuning (SFT) with Chain-of-Thought (CoT), reinforcement learning (RL), and distillation. This data-centric approach, leveraging massive real-world data and advanced model architectures, aims to build robust and generalizable autonomous driving systems.
Speakers
- Xianming Liu — Head of AI Team, Autonomous Driving Center, XPeng
Talks (1)
- 00:04 — Xianming Liu: Scaling up Autonomous Driving via Large Foundation Models
- This talk introduces XPeng’s approach to scaling autonomous driving using large foundation models, detailing their ‘SW 3.0 - AI Factory’ concept, inner and outer learning loops, and the integration of Chain-of-Thought (CoT) and Reinforcement Learning (RL) for enhanced capabilities in mass-production vehicles.
Key Takeaways
- XPeng’s autonomous driving system, deployed in mass-production vehicles, demonstrates robust performance in diverse and challenging conditions (rain, fog, darkness, complex urban scenarios) using only cameras, without LiDAR or HD maps.
- The ‘SW 3.0 - AI Factory’ paradigm shifts autonomous driving software development towards data-centric, continuously improving large foundation models, where AI models themselves become the software.
- A ‘Foundation Model: Inner and Outer Loops’ framework is crucial for continuous improvement, involving pretraining for scaling, post-training with SFT and RL for generalization and corner cases, and distillation for efficient deployment.
- Chain-of-Thought (CoT) and Meta Actions are introduced to enable the autonomous agent to ‘think’ and reason about complex driving situations, leading to more confident and human-like decision-making, particularly in ambiguous scenarios like navigating intersections.
- Scaling laws observed in model capacity and data volume for Vision-Language-Action (VLA) models provide confidence that increasing data and model size will continue to yield significant performance improvements in autonomous driving.
Methods / Models / Datasets Mentioned
Turing AI Driving Assistance SystemMONA M03 MaxSW 3.0 - AI FactoryFoundation ModelLarge Physical AI ModelVLM PretrainingAction PretrainingVLA (action)SFT (Supervised Finetuning)RL (Reinforcement Learning)CoT (Chain-of-Thought)Meta ActionWorld Model
Topics
Autonomous Driving · Foundation Models · AI Factory · Continuous Learning · Chain-of-Thought (CoT) · Reinforcement Learning (RL) · Supervised Finetuning (SFT) · Data-Centric AI · Mass Production · Scaling Laws
Notes
Open for commentary — connections to other work, critiques, follow-up reading.