Scaling up Autonomous Driving via Large Foundation Models

Event: CVPR 2025 Workshop on Autonomous Driving · Duration: 23 min · ▶ Watch on YouTube

Abstract

Xianming Liu from XPeng presents their vision for scaling autonomous driving through large foundation models, termed ‘SW 3.0 - AI Factory’. The presentation showcases XPeng’s Turing AI Driving Assistance System, demonstrating its capabilities in challenging real-world scenarios like rain, fog, darkness, and complex urban environments, relying solely on cameras. The core methodology involves a continuous learning framework with inner and outer loops, encompassing pretraining, supervised finetuning (SFT) with Chain-of-Thought (CoT), reinforcement learning (RL), and distillation. This data-centric approach, leveraging massive real-world data and advanced model architectures, aims to build robust and generalizable autonomous driving systems.

Speakers

  • Xianming Liu — Head of AI Team, Autonomous Driving Center, XPeng

Talks (1)

  • 00:04Xianming Liu: Scaling up Autonomous Driving via Large Foundation Models
    • This talk introduces XPeng’s approach to scaling autonomous driving using large foundation models, detailing their ‘SW 3.0 - AI Factory’ concept, inner and outer learning loops, and the integration of Chain-of-Thought (CoT) and Reinforcement Learning (RL) for enhanced capabilities in mass-production vehicles.

Key Takeaways

  • XPeng’s autonomous driving system, deployed in mass-production vehicles, demonstrates robust performance in diverse and challenging conditions (rain, fog, darkness, complex urban scenarios) using only cameras, without LiDAR or HD maps.
  • The ‘SW 3.0 - AI Factory’ paradigm shifts autonomous driving software development towards data-centric, continuously improving large foundation models, where AI models themselves become the software.
  • A ‘Foundation Model: Inner and Outer Loops’ framework is crucial for continuous improvement, involving pretraining for scaling, post-training with SFT and RL for generalization and corner cases, and distillation for efficient deployment.
  • Chain-of-Thought (CoT) and Meta Actions are introduced to enable the autonomous agent to ‘think’ and reason about complex driving situations, leading to more confident and human-like decision-making, particularly in ambiguous scenarios like navigating intersections.
  • Scaling laws observed in model capacity and data volume for Vision-Language-Action (VLA) models provide confidence that increasing data and model size will continue to yield significant performance improvements in autonomous driving.

Methods / Models / Datasets Mentioned

  • Turing AI Driving Assistance System
  • MONA M03 Max
  • SW 3.0 - AI Factory
  • Foundation Model
  • Large Physical AI Model
  • VLM Pretraining
  • Action Pretraining
  • VLA (action)
  • SFT (Supervised Finetuning)
  • RL (Reinforcement Learning)
  • CoT (Chain-of-Thought)
  • Meta Action
  • World Model

Topics

Autonomous Driving · Foundation Models · AI Factory · Continuous Learning · Chain-of-Thought (CoT) · Reinforcement Learning (RL) · Supervised Finetuning (SFT) · Data-Centric AI · Mass Production · Scaling Laws


Notes

Open for commentary — connections to other work, critiques, follow-up reading.