Xie Chen: Data Survey — AI & Robotics Data Recipes

Duration: 156 min · ▶ Watch on YouTube

Guest: 谢晨 (Steve) · 光轮智能创始人兼CEO

Switch → 中文

Chapters (33)

  • 00:00 · 节目预告与开场
    • 主持人介绍本期主题为数据产业在具身智能时代的演进,并引出嘉宾谢晨。
  • 01:07 · 嘉宾自我介绍与职业路径
    • 谢晨分享其从北大物理到哥大金融,再到硅谷科技公司(Jet.com, Cruise, Nvidia, 蔚来)的跨界经历。
  • 04:47 · 寻找个人天赋与优势
    • 谢晨坦言自己物理天赋不足,通过不断试错寻找能产生最大价值的领域,最终锁定科技与产品。
  • 07:49 · 早期的创业尝试
    • 分享在读博期间为爱犬开发社交App的创业经历,意识到商业模式的重要性。
  • 12:14 · 为何深耕自动驾驶与机器人仿真
    • 讲述在Cruise、Nvidia和蔚来的经历,认识到仿真不仅是加速器,更是具身智能的先决条件。
  • 16:27 · 仿真的本质:时间机器与数据引擎
    • 解释仿真如何从早期的视觉展示工具,演变为真正能为算法提供有效训练数据的核心引擎。
  • 20:10 · AI数据发展的四个阶段
    • 将AI数据发展划分为ImageNet时代、Scale AI流水线时代、大模型RLHF时代以及具身智能仿真时代。
  • 27:21 · 数据标注的演进与合成数据
    • 探讨传统数据标注在具身智能领域的局限性,指出合成数据和仿真环境是解决数据荒的关键。
  • 34:41 · Zero-shot(零样本)泛化能力
    • 强调Zero-shot能力是衡量下一代机器人模型是否具备通用智能的核心指标。
  • 40:16 · 大模型与VLA(视觉-语言-动作)模型的关系
    • 分析云端世界模型(大模型)与端侧机器人大脑(VLA)的协同与分工。
  • 48:27 · 未来产业格局:大脑公司与本体公司
    • 预测未来机器人产业将分化为提供通用大脑的公司和提供硬件本体的公司,并探讨特斯拉的数据闭环优势。
  • 50:00 · Data Architecture for Embodied AI
    • Discussing the challenges of collecting real-world data for robots and the need for a multi-layered data architecture.
  • 52:15 · The Evolution of Data Companies
    • How data companies are shifting from static datasets to symbiotic partnerships with AI model developers.
  • 56:20 · LLMs vs. Robotics Data Needs
    • Comparing the data bottlenecks of LLMs (post-training) with robotics (pre-training and physical grounding).
  • 58:50 · The Importance of Evaluation and Shadow Mode
    • Why free evaluation mechanisms like ‘shadow mode’ in autonomous driving are crucial but missing in robotics.
  • 1:02:55 · Digital Agents vs. Physical Robots
    • Comparing the data and environment needs of software agents with physical embodied robots.
  • 1:06:55 · Three Phases of the AI Data Industry
    • Tracing the evolution from ImageNet to autonomous driving data, and finally to RLHF for LLMs.
  • 1:14:45 · Simulation as a Necessity
    • Arguing that simulation is not optional but a strict necessity for scaling robotics data and evaluation.
  • 1:20:30 · Real Data vs. Simulation Camps
    • Analyzing the shift in the robotics industry from relying solely on real data to embracing simulation.
  • 1:31:00 · Waymo vs. Tesla Approaches in Robotics
    • Comparing scenario-specific approaches (Waymo) with generalized brain approaches (Tesla) for future robotics.
  • 1:40:00 · Autonomous Driving vs. Embodied AI
    • Comparing the data approaches of autonomous driving companies like Tesla and Waymo to the challenges of embodied AI.
  • 1:41:40 · The Data Pyramid for Embodied AI
    • Introduction of the ‘Data Pyramid’ concept, consisting of real robot data, simulation data, and human/internet data.
  • 1:44:30 · Scaling Laws in Embodied AI
    • Discussing how recent projects like GROOT and UMI prove that scaling laws apply to embodied AI data.
  • 1:50:30 · Valuing Different Data Types
    • Analyzing why real robot data is currently overvalued, while simulation and human first-person data are undervalued.
  • 1:55:30 · The Cost of Data and Data Factories
    • Exploring the pricing of different data types and the evolution from manual data labeling to automated ‘Data Engines’.
  • 2:02:50 · Building a Data Engine
    • Detailing the internal workings of a data engine, emphasizing the critical role of simulation and real-world evaluation.
  • 2:16:20 · The Competitive Landscape
    • Predicting the industry structure: Big model companies will dominate the ‘brain’, while robotics companies focus on the ‘body’.
  • 2:28:50 · Defining New AI Paradigms
    • Clarifying the differences between Physical AI, Spatial Intelligence, and World Models.
  • 2:30:00 · The Evaluation Bottleneck
    • The guest argues that evaluation is currently the most critical bottleneck for AGI development.
  • 2:30:45 · Challenges with LLM Evaluation
    • Discussion on how evaluating Large Language Models requires increasingly capable humans to provide feedback.
  • 2:31:30 · The Future of the Data Problem
    • The guest predicts that the data problem will eventually become irrelevant as AI shifts towards self-learning.
  • 2:32:45 · Simulation and the End of Data Factories
    • The conversation covers how simulation environments will replace traditional data factories for AI training.
  • 2:34:59 · Einstein’s Thought Experiments as AI Analogy
    • The guest compares future AI simulation training to Einstein’s mental thought experiments.

Specific Numbers (15)

Time Fact Value Context
02:15 Joined Cruise 2018 谢晨加入硅谷自动驾驶公司Cruise,首次接触并验证自动驾驶仿真技术。
02:53 Joined Nvidia 2021 加入英伟达负责自动驾驶仿真,发现中国车企是其最大客户,促使他决定回国。
04:34 Founded Light Wheel AI 2023 与联合创始人共同创立光轮智能,致力于用合成数据加速机器人产业。
05:16 Physics ranking at Peking University 110th 谢晨提到自己在北大物理系时成绩排名靠后,意识到自己在物理方面缺乏顶尖天赋。
08:47 Financial Crisis 2008 在哥大交换期间亲历金融危机,这段经历促使他思考不同的人生路径。
10:11 Apps downloaded for research 500+ 为了开发宠物社交App,他下载并研究了500多个应用程序的UI/UX设计。
11:11 Duration of first startup 3 years 他的第一个宠物App创业项目持续了大约三年,直到博士毕业前关闭。
50:04 Number of robots Millions The world does not currently have millions of robots deployed to collect data.
1:01:01 Data readiness score 60 vs 0.6 LLM data readiness might be at 60 points, while robotics data is below 0.6 points.
1:04:18 RLHF emergence 2021-2022 The timeframe when RLHF became prominent for large models.
1:44:55 Amount of UMI gripper data used 270,000 hours Cited as evidence that scaling laws are beginning to work in embodied AI.
1:57:22 Cost range for embodied AI data Tens to thousands of RMB per hour Explaining that data pricing varies wildly based on quality, complexity, and whether it’s for pre-training or evaluation.
2:09:59 Size of the guest’s engineering team 100+ people Highlighting the engineering effort required to build a robust data engine and simulation platform.
2:19:19 Timeline for big tech shifting to embodied AI 3 to 6 months ago Noting when major AI companies started seriously allocating resources to embodied AI.
2:31:37 Timeframe for data problem irrelevance 15 to 20 years The guest estimates it might take 15 or 20 years for the data problem to become completely unimportant for AI.

Research Claims & Predictions (14)

  • [04:20] Simulation is a prerequisite, not just an accelerator, for robotics.
    • evidence: Based on his experience transitioning from autonomous driving to robotics, he realized that without simulation, the robotics industry cannot scale due to the lack of real-world data.
  • [18:13] Synthetic data via simulation is the only viable path for scaling robotics.
    • evidence: Real-world data collection for robots is too slow and lacks the necessary corner cases and failure-to-success trajectories needed for training robust models.
  • [33:08] The most effective training data is ‘failure-then-success’ data.
    • evidence: Models learn best not just from perfect demonstrations, but from seeing a mistake being made and then corrected, which is easily generated in simulation but hard to capture in reality.
  • [34:41] Zero-shot capability is the defining metric for next-gen robotics models.
    • evidence: If a model cannot generalize to unseen tasks or environments (zero-shot), it is not truly intelligent and cannot scale across different robotic form factors.
  • [51:20] Tesla’s data closed loop will not apply to the broader robotics industry.
    • evidence: The brain (model) and body (hardware) will be separated, with different companies specializing in each.
  • [1:14:45] Simulation is a strict necessity for robotics.
    • evidence: Real-world data collection is too expensive and unscalable for the evaluation and training needs of generalized robots.
  • [1:42:30] Without simulation and human data, embodied AI cannot achieve general intelligence.
    • evidence: Real robot data is too hard to scale; simulation and human data are necessary to bridge the gap.
  • [1:51:50] Human first-person perspective data is currently undervalued but is crucial for teaching robots.
    • evidence: Robots need to learn from human actions, and first-person video (like from smart glasses) provides the best cross-embodiment learning signal.
  • [2:02:00] Data companies must evolve from ‘Data Factories’ to ‘Data Engines’.
    • evidence: Manual labeling is insufficient; the future requires automated, feedback-driven systems integrating simulation and real-world testing.
  • [2:22:30] Big model companies (OpenAI, DeepMind) will likely win the ‘brain’ race for embodied AI.
    • evidence: They have the resources, talent, and scaling infrastructure, while hardware companies will likely focus on building the best physical bodies.
  • [2:30:04] Evaluation is the most critical problem for AGI right now.
    • evidence: Current pre-training and scaling laws are established, making evaluation the true bottleneck for measuring intelligence improvements.
  • [2:31:30] The data problem will eventually become unimportant as AI shifts to self-learning.
    • evidence: Similar to highly capable humans, advanced AI will stop learning from external sources and start competing with itself to generate new knowledge.
  • [2:32:31] Data factories will become obsolete.
    • evidence: They will be replaced by system-driven, evaluation-centric environments that help models find problems and improve through feedback.
  • [2:34:00] Future AI will rely heavily on Reinforcement Learning (RL) in simulation environments.
    • evidence: Models will train their ‘internal skills’ (内功) by interacting with simulated environments rather than just consuming static data.

Key Concepts (17)

  • [16:27] Simulation (仿真)
    • 在虚拟环境中构建物理世界的数字孪生,用于生成合成数据以训练和测试AI算法,特别是自动驾驶和机器人。
  • [27:21] Synthetic Data (合成数据)
    • 由计算机算法(如仿真引擎或生成式AI)人工生成的数据,而非从真实世界中收集的数据,用于解决真实数据稀缺和长尾场景问题。
  • [34:41] Zero-shot Generalization (零样本泛化)
    • 机器学习模型在没有见过特定任务或场景的训练样本的情况下,依然能够正确处理该任务的能力。
  • [39:13] VLA (Vision-Language-Action) Models
    • 视觉-语言-动作模型,一种多模态模型,能够理解视觉输入和语言指令,并直接输出机器人的物理动作控制指令。
  • [40:16] World Models (世界模型)
    • 能够理解和预测物理世界运行规律的AI模型,通常部署在云端,为端侧机器人提供常识和高级规划能力。
  • [50:00] Embodied AI (具身智能)
    • AI systems that interact with the physical world through a robotic body.
  • [58:50] Shadow Mode (影子模式)
    • Running an AI model in the background of a real system to evaluate its decisions against human actions without controlling the system.
  • [1:14:45] Simulation (仿真)
    • Creating virtual environments to train and evaluate robots safely and cheaply before real-world deployment.
  • [1:41:40] Data Pyramid (数据金字塔)
    • A framework for embodied AI data consisting of three layers: real robot data (top, high quality but hard to scale), simulation data (middle, scalable but has a reality gap), and human/internet data (bottom, massive scale but requires cross-embodiment transfer).
  • [1:43:45] Sim-to-Real (仿真到现实)
    • The process of training AI models in a simulated environment and then transferring those learned capabilities to operate successfully in the real physical world.
  • [2:02:00] Data Engine (数据引擎)
    • An automated, closed-loop system for generating, evaluating, and refining data, contrasting with a traditional ‘data factory’ that relies heavily on manual human labeling.
  • [2:28:50] Physical AI (物理世界AI)
    • Artificial intelligence systems designed to understand, interact with, and operate within the physical world, encompassing both autonomous vehicles and robotic systems.
  • [2:30:06] 评测 (Evaluation)
    • The process of assessing and measuring the capabilities of AI models, which becomes harder as models get smarter.
  • [2:32:50] 仿真 (Simulation)
    • Creating virtual environments where AI can test hypotheses, learn from trial and error, and generate its own data.
  • [2:33:06] Data Factory
    • Large-scale operations where humans manually annotate and generate data to train AI models.
  • [2:34:05] RL (Reinforcement Learning)
    • A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize rewards.
  • [2:35:22] 思想实验 (Thought Experiment)
    • Mental simulations used by scientists like Einstein to explore physical laws, analogous to how AI will use simulated environments to learn.

People Mentioned (8)

  • Elon Musk — 在片头预告中被引用,提到人类可能生活在仿真世界中。
  • Warren Buffett — 被用作例子,说明很早就发现自己天赋和热爱所在的人是幸运的。
  • Lang Lang — 被用作例子,说明很早就发现自己钢琴天赋的人。
  • Jensen Huang — 英伟达CEO,谢晨提到曾与他交流,了解到英伟达对机器人仿真的高度重视。
  • Fei-Fei Li — 斯坦福大学教授,因创建ImageNet和近期在具身智能评测集(Behavior)的工作被提及。
  • Fei-Fei Li (李飞飞) — Mentioned in relation to her work on behavior challenges and data scaling in AI.
  • Jim Fan — Mentioned as a key figure at Nvidia driving their embodied AI and simulation efforts.
  • Albert Einstein — Used as an analogy; his thought experiments are compared to the simulation environments needed for advanced AI.

Companies Mentioned (13)

Jet.com · Cruise · Waymo · Nvidia · Nio (蔚来) · Scale AI · OpenAI · DeepMind · Tesla · xAI · Meta · Zhiyuan (智元) · Yushu (宇树)

Notable Quotes (11)

其实最有效的数据是先失败再成功的数据。 — 谢晨 @ 00:52

仿真是一个玩具,或者说它更多的是给投资人展示的一个Demo。 — 谢晨 @ 12:38

他特别相信通过合成数据、通过仿真,这个是唯一的路径来真正的让机器人将来可以部署到世界各地。 — 谢晨 @ 18:13

数据对于智能,就有点类似于咱们人去获取知识来不断的去自我提升。 — 谢晨 @ 20:54

If LLMs are at 60 points, robotics data is not even at 0.6 points. — Guest @ 1:01:01

Simulation is a necessity for robotics. Without it, it definitely won’t work. — Guest @ 1:14:45

If there is no data pyramid, if there is no simulation and human data below it, I think the general intelligence of embodied AI will not emerge. — Male Guest @ 1:42:30

The ideal state is that people just like wearing these glasses, not that people wear these glasses to collect data for robots. — Male Guest @ 1:52:15

We are more like a data engine… a data factory is a bit like an assembly line, lacking technology and systems, and it’s not feedback-driven. — Male Guest @ 2:02:00

人可能越优秀,越希望去提升自己,他只不会变成从向别人学习,变成与自己去对标。 — Guest @ 2:31:52

就跟马斯克说的,咱们人可能就在一个仿真里头。 — Guest @ 2:32:47

Career Arc & Personal Stories (3)

  • [01:07] 谢晨本科就读于北大物理系,发现自己缺乏物理天赋后,前往哥伦比亚大学攻读量化金融博士。期间亲历金融危机,意识到金融行业的局限性,决定转向科技行业。毕业后加入Jet.com做算法,后进入自动驾驶领域,先后在Cruise、Nvidia和蔚来负责仿真技术,最终在2023年创立光轮智能,专注于具身智能的合成数据。
  • [07:49] 在哥大读博期间,谢晨养了一只名叫“土豆”的狗。为了给狗友提供交流平台,他自学编程和设计,开发了一款宠物社交App。虽然App获得了一定用户,但因为缺乏清晰的商业模式,最终在毕业前放弃。这段经历让他认识到商业模式和技术壁垒的重要性。
  • [1:38:45] The guest recalls joining Cruise and focusing entirely on making autonomous driving work in San Francisco before expanding to other cities, illustrating a scenario-first approach.

Tools & Models Discussed (10)

  • ImageNet: 李飞飞团队创建的大规模视觉数据库,开启了深度学习在计算机视觉领域的突破,代表了静态数据标注的早期阶段。
  • GPT (Generative Pre-trained Transformer): 大语言模型,代表了通过海量互联网文本数据预训练和人类反馈强化学习(RLHF)实现智能涌现的阶段。
  • VLA (Vision-Language-Action): 具身智能领域的核心模型架构,将视觉感知、语言理解和物理动作输出结合在一起,作为机器人的“小脑”。
  • ImageNet: A foundational static dataset for computer vision that kicked off the first wave of the AI data industry.
  • Optimus: Tesla’s humanoid robot, used as an example of embodied AI hardware.
  • VLA (Vision-Language-Action): A type of model architecture for robotics that integrates visual and language inputs to output physical actions.
  • GROOT: A general-purpose foundation model for humanoid robot learning developed by Nvidia.
  • UMI: A data collection framework (Universal Manipulation Interface) that uses human demonstrations with grippers to train robots.
  • GPT-2: Mentioned as an analogy for the current stage of embodied AI development—finding the right ‘recipe’ before massive scaling.
  • 大语言模型 (Large Language Models): AI models trained on vast amounts of text data, currently facing bottlenecks in evaluation and high-quality human feedback.

Topics

自动驾驶仿真 · 具身智能 (Embodied AI) · 合成数据 (Synthetic Data) · 数据标注的演进 · Zero-shot 泛化 · 大模型与机器人结合 · 科技创业路径 · Embodied AI · Data Scaling · Simulation · RLHF · Autonomous Driving · AI Data Industry · Embodied AI · Data Generation and Scaling · Simulation and Sim-to-Real · Human-in-the-loop Data · AI Industry Landscape · AGI (Artificial General Intelligence) · AI Evaluation Bottlenecks · Simulation Environments · Reinforcement Learning (RL) · Self-learning AI · The Future of Data Annotation

Takeaways

  • 在具身智能时代,真实世界的数据收集速度远落后于模型对数据的需求,合成数据和仿真环境是解决数据瓶颈的唯一可行路径。
  • 最有效的训练数据不仅包含成功的演示,更需要包含从失败中纠正的过程(先失败再成功的数据),这在仿真环境中更容易生成。
  • 下一代机器人模型的核心竞争力在于Zero-shot(零样本)泛化能力,即在未见过的场景中执行未见过的任务。
  • 未来的机器人产业格局可能会分化为提供通用“大脑/小脑”的软件公司和专注于特定形态“本体”的硬件公司。
  • The data architecture for robotics will likely separate the ‘brain’ (trained by big model companies) from the ‘body’ (built by hardware companies).
  • Robotics currently lacks the massive pre-training data and free evaluation mechanisms (like shadow mode) that fueled LLM and autonomous driving breakthroughs.
  • Simulation is not just an option but a strict necessity for the future of robotics to solve the data and evaluation bottlenecks.
  • The ‘Data Pyramid’ (real, sim, human data) is essential for scaling embodied AI.
  • Simulation is the core loop for evaluating and training robot models.
  • Human first-person data is currently undervalued but critical for cross-embodiment learning.
  • The industry will likely bifurcate: Big tech will build the ‘brains’ (foundation models), while robotics companies will focus on the ‘bodies’ (hardware).
  • Evaluation is currently the biggest bottleneck in AGI development, as smarter models require even smarter humans to assess them.
  • The reliance on human-generated data (Data Factories) will decrease as AI models become capable of self-learning.
  • Future AI training will heavily rely on Reinforcement Learning within complex simulated environments, akin to Einstein’s thought experiments.