Scaling up Autonomous Driving via Large Foundation Models

Event: CVPR 2025 Workshop on Autonomous Driving · Duration: 23 min · ▶ Watch on YouTube

Abstract

Xianming Liu from XPeng presents their vision for scaling autonomous driving through large foundation models, termed ‘SW 3.0 - AI Factory’. The presentation showcases XPeng’s Turing AI Driving Assistance System, demonstrating its capabilities in challenging real-world scenarios like rain, fog, darkness, and complex urban environments, relying solely on cameras. The core methodology involves a continuous learning framework with inner and outer loops, encompassing pretraining, supervised finetuning (SFT) with Chain-of-Thought (CoT), reinforcement learning (RL), and distillation. This data-centric approach, leveraging massive real-world data and advanced model architectures, aims to build robust and generalizable autonomous driving systems.

Speakers

Xianming Liu — Head of AI Team, Autonomous Driving Center, XPeng

Talks (1)

00:04 — Xianming Liu: Scaling up Autonomous Driving via Large Foundation Models
- This talk introduces XPeng’s approach to scaling autonomous driving using large foundation models, detailing their ‘SW 3.0 - AI Factory’ concept, inner and outer learning loops, and the integration of Chain-of-Thought (CoT) and Reinforcement Learning (RL) for enhanced capabilities in mass-production vehicles.

Key Takeaways

XPeng’s autonomous driving system, deployed in mass-production vehicles, demonstrates robust performance in diverse and challenging conditions (rain, fog, darkness, complex urban scenarios) using only cameras, without LiDAR or HD maps.
The ‘SW 3.0 - AI Factory’ paradigm shifts autonomous driving software development towards data-centric, continuously improving large foundation models, where AI models themselves become the software.
A ‘Foundation Model: Inner and Outer Loops’ framework is crucial for continuous improvement, involving pretraining for scaling, post-training with SFT and RL for generalization and corner cases, and distillation for efficient deployment.
Chain-of-Thought (CoT) and Meta Actions are introduced to enable the autonomous agent to ‘think’ and reason about complex driving situations, leading to more confident and human-like decision-making, particularly in ambiguous scenarios like navigating intersections.
Scaling laws observed in model capacity and data volume for Vision-Language-Action (VLA) models provide confidence that increasing data and model size will continue to yield significant performance improvements in autonomous driving.

Methods / Models / Datasets Mentioned

Turing AI Driving Assistance System
MONA M03 Max
SW 3.0 - AI Factory
Foundation Model
Large Physical AI Model
VLM Pretraining
Action Pretraining
VLA (action)
SFT (Supervised Finetuning)
RL (Reinforcement Learning)
CoT (Chain-of-Thought)
Meta Action
World Model

Topics

Autonomous Driving · Foundation Models · AI Factory · Continuous Learning · Chain-of-Thought (CoT) · Reinforcement Learning (RL) · Supervised Finetuning (SFT) · Data-Centric AI · Mass Production · Scaling Laws

Notes

Open for commentary — connections to other work, critiques, follow-up reading.