Yao Shunyu: Training Models at Anthropic & Gemini

Duration: 228 min · ▶ Watch on YouTube

Guest: Shunyu Yao 姚顺宇 · Researcher at Google DeepMind (formerly Anthropic)

Chapters (42)

00:00 · Introduction
- Host Xiaojun Zhang introduces the guest, Shunyu Yao, and sets the context of the interview.
01:27 · Guest Background & The ‘Two Shunyu Yaos’
- Shunyu Yao discusses his academic background in physics and the coincidence of sharing a name with another prominent AI researcher.
05:36 · The ‘Second Half’ of AI & Model Homogenization
- Discussion on how AI has entered a phase where capabilities are homogenizing, and the challenge shifts from ‘can it be done’ to ‘what should be done’.
11:14 · OpenCloud and Minus
- Analysis of new AI agent products like OpenCloud and Minus, and why their approaches were already anticipated within big tech.
16:43 · Startups vs. Big Tech & Data Flywheels
- Exploration of why AI startups are selling to big tech and the lack of successful data flywheels outside of coding applications.
25:23 · Cursor and the Evolution of Coding Agents
- The competitive dynamics between AI coding startups like Cursor and big tech companies like Anthropic.
33:42 · Scaling Laws and Potential Bottlenecks
- Shunyu Yao shares his thoughts on whether scaling laws will hit a wall and the potential bugs or data limits in model training.
46:40 · The Future of Programmers
- Predictions on how AI will impact the software engineering profession and centralize technological power.
50:10 · ByteDance vs. Google and Gemini Evaluation
- Discussing ByteDance’s multi-modal capabilities and praising the technical strength of Google’s Gemini team.
53:31 · US-China AI Gap and Model Distillation
- Analyzing the narrowing gap between US and Chinese models, and the ethics and methods of model distillation.
57:50 · Doubao’s Voice Capabilities and Apple’s Strategy
- Evaluating Doubao’s superior voice generation and Apple’s pragmatic approach to integrating AI.
1:04:08 · The Bottleneck in AI Robotics
- Exploring why robotics hardware is cheap but software lacks generalizable, foundational models.
1:08:46 · Early Life and the Tsinghua Text Message
- Weitone shares his journey from Ningxia to Shanghai, and how a bold text message got him into Tsinghua.
1:16:30 · Undergrad Physics and Non-Hermitian Systems
- Researching open quantum systems at Tsinghua and the realization that most physics students leave the field.
1:26:26 · Stanford PhD and Leaving High-Energy Physics
- Switching to high-energy physics at Stanford, only to leave because the theories couldn’t be experimentally verified.
100:00 · Transition from Physics to AI
- The guest discusses his background in theoretical physics and his decision to pivot to AI research due to the slow pace of physics experiments.
104:22 · AI as a Black Box and Scaling Laws
- Exploring whether AI is a black box, comparing it to physics, and discussing how empirical scaling laws might evolve into scientific laws.
112:33 · Joining Anthropic
- How the guest joined Anthropic through physics connections and his initial impressions of the company’s RL team.
116:41 · Anthropic’s Culture and Claude 3.7
- Insights into Anthropic’s top-down culture, execution speed, and the development of Claude 3.7 for agentic coding.
131:39 · Moving to Google
- The guest explains his reasons for leaving Anthropic for Google and compares their organizational structures and cultures.
138:40 · The End of the Individual Hero Era in AI
- Reflections on how AI research has become a massive systems engineering effort rather than a field driven by individual breakthroughs.
150:00 · The Collectivist Nature of AI Research
- The guest explains that modern AI research is a collective effort rather than driven by individual heroes.
152:30 · Anthropic’s Approach to AI Safety
- Discussion on Anthropic’s founding motivations and why their approach to enforcing AI safety might be naive.
154:30 · AI Automating AI Research
- The guest predicts that AI will soon be able to conduct its own machine learning experiments from start to finish.
157:30 · Leaving Anthropic and Product Strategy
- The guest reflects on his departure from Anthropic and praises their recent product innovations like Claude Code.
161:10 · Work at Google DeepMind: ML Coding and Long Horizon
- Insights into the guest’s work at DeepMind focusing on ML coding and long-horizon tasks.
164:30 · Pre-training vs. Post-training
- An analysis of how long-context capabilities are achieved, contrasting pre-training and post-training methods.
167:20 · The Impact of Gemini and Market Dynamics
- The guest discusses the expectations for Gemini and how OpenAI’s aggressive product strategy disrupted the market.
172:00 · Search vs. Chatbots: Google’s Dilemma
- An exploration of Google’s innovator’s dilemma regarding search revenue and the rise of AI chatbots.
176:00 · Reinforcement Learning and Data Quality
- The differences between pre-training and post-training data, and the role of RL in model development.
180:00 · The State of AI Benchmarks
- A critique of current AI benchmarks like SWE-bench and how they are becoming saturated.
184:30 · Organizational Structures: Google vs. Anthropic
- A comparison of the top-down vs. bottom-up research cultures at major AI labs.
190:00 · World Models and Future AI Systems
- Discussion on what constitutes a ‘world model’ and the differing approaches to building them.
195:00 · Leadership at Google
- The guest shares his perspective on Google’s leadership, including Sergey Brin, Demis Hassabis, and Koray Kavukcuoglu.
200:00 · Technical Leadership
- Discussion on the importance of technical leaders having hands-on problem-solving skills and empathy.
201:18 · TPU vs GPU Architecture
- Comparison of TPU’s 3D torus topology for large-scale clusters versus GPU’s pod-based NVLink architecture.
203:33 · The Fate of New AI Labs
- Prediction that most newly formed AI labs spinning out of big tech will fail.
204:15 · Enterprise vs Consumer AI Markets
- Analysis of why the US excels in direct enterprise (B2B) software while China dominates complex consumer (B2C) products.
208:30 · AI Researcher Salaries and Heroes
- Thoughts on the inflated salaries of AI researchers and the end of the ‘individual hero’ era in AI.
211:13 · Interviewing AI Talent
- The guest shares his 24-hour reinforcement learning interview test to filter out candidates who rely too heavily on AI coding tools.
216:07 · Anthropic vs Google DeepMind
- Contrasting Anthropic’s focused, vertical approach to language models with Google’s horizontal, multi-directional research.
217:35 · Physicists in AI and ‘Old Lamps’
- Reflections on physicists transitioning to AI and the phenomenon of ‘old lamps’ (out-of-touch senior figures).

Specific Numbers (20)

Time	Fact	Value	Context
02:08	Postdoc duration	2 weeks	The amount of time Shunyu Yao spent as a postdoc at UC Berkeley before leaving for Anthropic.
02:18	Tenure at Anthropic	1 year	The duration Shunyu Yao worked at Anthropic before joining Google DeepMind.
04:51	Job transition date	September/October last year	The timeframe when Shunyu Yao left Anthropic to join Google DeepMind.
07:55	Benchmark scores	Around 80%	The current performance level of top AI models on coding benchmarks like SWE-bench.
34:02	Scaling law prediction horizon	4 months	The timeframe in which Yao predicts we will not see the definitive ‘end’ of scaling laws.
39:43	AI code generation	90%	The estimated percentage of code that AI will eventually be able to write for developers.
53:33	Q1 2026	2026	A hypothetical timeframe discussed regarding the AI capability gap between the US and China.
53:46	Gap narrowing timeframe	1 to 1.5 years	The observed trend of the AI gap between China and the US shrinking over the past year and a half.
1:01:40	Physics class retention	2/3	The estimated proportion of students in his Tsinghua physics class who did not end up pursuing physics long-term.
114:44	Timeframe of joining Anthropic’s RL team	August/September 2024	Before OpenAI’s O1 model was released.
118:13	Anthropic employee count when joining	700-800	The size of the company when the guest first joined.
126:44	Claude 3.7 development time	4-5 months	The time it took from starting training to release.
141:44	Anthropic employee count when leaving	Nearly 2000	The size of the company when the guest left.
156:36	Timeframe for AI to conduct its own experiments	6 to 12 months	The guest predicts that within this timeframe, AI will be able to write code, run experiments, analyze results, and iterate autonomously.
173:31	Guest’s high expectations for Gemini	End of September last year	The guest mentions having high expectations for Gemini around this time.
177:46	SWE-bench scores	80+	The guest notes that models are now hitting scores in the 80s on coding benchmarks, indicating saturation.
182:13	Gemini market share estimate	20%	The guest estimates Gemini’s current market share in the chatbot space.
202:24	Hopper GPU pod size	8 cards	Mentioned as the typical number of cards in a pod connected by NVLink before needing external networking.
205:35	Enterprise software economics	Cost 150, sell for 200	Used as an example of the straightforward, direct monetization model of US enterprise software.
211:49	Interview test duration	24 hours	The time given to candidates to complete a reinforcement learning project from scratch.

Research Claims & Predictions (15)

[05:36] AI has entered a phase where the primary challenge is defining the right problems, not whether the model can solve them.
- evidence: Current industry state where top models (OpenAI, Anthropic, Google) are highly capable and homogenized.
[15:28] Coding is currently the only AI-native application scenario that has successfully formed a data flywheel.
- evidence: Current market observation; other AI applications have not yet achieved this self-reinforcing loop.
[34:00] Scaling laws may face bottlenecks due to data exhaustion, algorithmic bugs, or fundamental limits of the current paradigm.
- evidence: Next few years; researchers are already debating if the scaling law is reaching its limit.
[47:15] AI will act as a centralized technology that empowers a small group of people while diminishing the unique value of the majority.
- evidence: Long-term societal and industry impact.
[53:55] China’s main AI bottleneck is compute, not algorithmic innovation.
- evidence: The gap in model capabilities is narrowing, but the lack of compute forces Chinese companies into distillation rather than scaling.
[1:05:36] Robotics lacks a generalizable foundational model.
- evidence: Current robotics rely too heavily on specific reinforcement learning environments rather than a generalized ‘Vision-Language-Action’ model.
[1:29:00] High-energy theoretical physics is currently disconnected from experimental verification.
- evidence: Theories developed in this field cannot be tested with current or near-future colliders, making it difficult to prove their validity.
[106:01] Scaling laws are currently empirical but may become scientific laws.
- evidence: As technology stabilizes and microscopic mechanisms are better understood, empirical laws often transition to scientific ones, similar to the history of thermodynamics.
[141:01] The era of individual heroes in AI research is over.
- evidence: Modern AI training is a massive systems engineering problem requiring large teams, making individual contributions less dominant.
[156:36] AI will fully automate the machine learning research pipeline.
- evidence: Within 6 to 12 months, AI will be able to write code, execute experiments, analyze results, and propose new hypotheses.
[163:00] Long horizon tasks require selective memory retrieval, not just infinite context windows.
- evidence: The guest argues that humans forget irrelevant details and retrieve necessary context, which is a more efficient approach for AI than processing infinitely long contexts.
[166:37] Post-training is the key to unlocking long-context capabilities.
- evidence: While pre-training requires massive data, post-training allows models to learn how to manage and utilize long contexts effectively with less data.
[202:33] TPUs offer better large-scale communication efficiency than GPUs due to their 3D torus design.
- evidence: TPUs connect in a 3D torus, reducing communication bounds across massive clusters compared to GPU pods.
[203:46] The vast majority of new AI labs will die.
- evidence: Many lack a clear purpose or product delivery mechanism, merely spinning out of big tech without a solid business plan.
[213:47] Pure language modeling is no longer a blue ocean.
- evidence: The field is saturated, and the next big opportunities lie in robotics, multimodal generation, and AI for science.

Key Concepts (19)

[05:36] Model Homogenization
- The phenomenon where top-tier AI models from different companies achieve very similar capabilities and benchmark scores.
[11:14] Agentic AI / Agents
- AI systems designed to execute long-horizon tasks, make decisions, and interact with environments autonomously.
[15:28] Data Flywheel
- A self-reinforcing cycle where a product’s usage generates data, which improves the AI model, which in turn attracts more usage.
[33:42] Scaling Laws
- The empirical observation that AI model performance improves predictably as compute, data, and model size increase.
[54:59] Hard Distillation
- Generating tokens from a superior model (like GPT-4) and directly training a smaller model on that data, which the guest views as unoriginal.
[56:00] Smart Distillation
- Using a superior model as an evaluator or integrating multiple models into a multi-agent training environment to generate higher-quality synthetic data.
[1:22:00] Non-Hermitian Systems
- A branch of quantum physics dealing with open systems that interact and exchange energy/information with their environment, rather than being isolated.
[1:31:16] Quantum Entanglement
- A phenomenon where quantum particles become interconnected such that the state of one instantly influences the state of another, regardless of distance.
[106:01] Scaling Laws
- Empirical observations showing that AI model performance predictably improves as compute, data, and model size increase.
[114:37] RLHF (Reinforcement Learning from Human Feedback)
- A technique to align AI models with human preferences, which the guest worked on at Anthropic.
[126:28] Agentic Coding
- The ability of an AI model to autonomously write, debug, and execute code to solve complex software engineering tasks.
[129:24] Policy Gradient
- A foundational reinforcement learning algorithm used in training models, mentioned as a basic but crucial component.
[150:00] Collectivism in AI Research
- The idea that modern AI breakthroughs require large teams working cohesively towards a single goal, rather than individual ‘hero’ researchers.
[154:30] ML Coding
- The process of using AI models to write, execute, and debug machine learning code autonomously.
[161:10] Long Horizon Tasks
- Tasks that require an AI agent to operate over extended periods, necessitating advanced memory management and selective retrieval.
[164:30] Pre-training vs. Post-training
- Pre-training involves feeding massive amounts of raw data to a model, while post-training refines the model’s behavior and capabilities using targeted data and reinforcement learning.
[190:00] World Models
- AI systems designed to understand, simulate, and predict the physical world and its dynamics.
[202:46] 3D Torus Topology
- A network architecture used in Google’s TPUs that connects chips in a three-dimensional grid, optimizing large-scale cluster communication.
[219:52] Old Lamps (老灯)
- A slang term for senior industry figures who are out of touch with modern technology but still try to micromanage and dictate direction.

People Mentioned (18)

Shunyu Yao (The other one) — Another AI researcher with the exact same name who worked at Tencent and OpenAI, causing industry confusion.
Chen-Ning Yang — Nobel laureate who founded the advanced physics institute at Tsinghua University.
Zhong Wang — Weitone’s undergraduate research advisor at Tsinghua University.
Shoucheng Zhang — Prominent physicist and Weitone’s PhD advisor at Stanford University.
Dario Amodei — CEO of Anthropic, mentioned as a key decision-maker in their top-down structure.
Jared Kaplan — Co-founder of Anthropic, involved in technical leadership and scaling laws.
Sam McCandlish — Co-founder of Anthropic, involved in technical leadership.
Ilya Sutskever — Former Chief Scientist at OpenAI, mentioned in the context of technical leadership and decision-making.
Tom Brown — Researcher mentioned as being on key papers like GPT-3.
Ben Mann — Researcher mentioned as being on key papers like GPT-3.
Boris — A researcher at Anthropic who was instrumental in developing Claude Code.
Sergey Brin — Google co-founder, described as the ultimate decision-maker and ‘hero’ behind major AI pushes at Google.
Demis Hassabis — CEO of Google DeepMind, mentioned in the context of leadership.
Koray Kavukcuoglu — CTO of Google DeepMind, seen as the primary technical leader on the ground.
Fei-Fei Li — Mentioned in the context of differing approaches to building world models.
F.D.M. Haldane — Nobel laureate in physics, mentioned as an example of a visionary scientist who pushed topological concepts decades before they became mainstream.
Geoffrey Hinton — Mentioned as a potential ‘hero’ figure in AI who persisted in his research direction for decades.
Noam Shazeer — Co-author of the Transformer paper, cited as part of the ‘hero collective’ of modern AI.

Companies Mentioned (13)

Google DeepMind · Tencent · Anthropic · OpenAI · Meta · Cursor · GitHub · ByteDance · Apple · Amazon · Google (DeepMind) · DeepSeek · Google

Notable Quotes (14)

AI这个事本来也不太需要脑子…最重要特质是靠谱。 — Shunyu Yao @ 01:05

我觉得AI进入下半场…大家都不再担心AI能不能做到，而是担心怎么定义好问题。 — Shunyu Yao @ 05:48

目前没有任何一个场景真正形成了数据飞轮，甚至AI纯粹原生的应用场景，目前除了写代码之外，没有哪个场景变得非常成功。 — Shunyu Yao @ 15:28

AI是一个很centralized的technology，它会让少部分人变得更强，但会让大部分人失去他们的独特价值。 — Shunyu Yao @ 47:15

胆子要大。你不争取是永远得不到的，争取了也有可能得不到，但不争取就绝对得不到。 — Weitone @ 1:15:30

读书不在于读得多，而在于读得深。 — Weitone @ 1:15:52

Ideas are cheap. 想法是便宜的。很多想法其实很显然，所有人都在知道，难的是怎么把实现，怎么把它变成一个一个小的可实现的步骤把它做出来。 — AI Researcher @ 131:17

现代的AI训练是一个大系统，你其实要把了解这个系统的方方面面才能有一个全局的认识… 个人英雄主义时代对于Language Model来说可能过去了。 — AI Researcher @ 141:01

You can’t stop AI progress. If you stop, others will keep going. The world is pushing us forward. — Guest @ 151:54

Anthropic’s idea that everyone has to listen to them for AI safety is very naive. — Guest @ 153:16

OpenAI saved Google’s life by forcing them to act. — Guest @ 175:01

我的感觉是绝大多数的new lab都会死。 — Guest @ 203:46

纯做语言模型已经不是一个蓝海了。我觉得晚了，就是末班车已经发车了。 — Guest @ 213:47

人年纪大了不一定会变成老灯… 另一种人就是老灯，自己也不懂，还爱指手画脚。 — Guest @ 219:52

Career Arc & Personal Stories (11)

[01:27] Shunyu Yao started by studying physics at Tsinghua University, then pursued a PhD in high-energy physics at Stanford. He briefly did a postdoc at UC Berkeley for just two weeks before leaving academia to join Anthropic. After a year at Anthropic, he moved to Google DeepMind to work on Gemini.
[1:08:46] Born in a small coal-mining town in Ningxia, moved to Shanghai for better education, but ended up in underperforming middle and high schools.
[1:14:05] Despite not being in a top high school, he boldly texted a Tsinghua admissions officer to secure a spot in their independent recruitment exam, which changed his life trajectory.
[1:26:26] Switched from condensed matter physics to high-energy theory at Stanford because he wanted a harder challenge, but eventually left the field because the lack of experimental validation felt meaningless to him.
[101:30] The guest started in theoretical physics but felt it was too disconnected from practical impact, likening it to ‘charity’. He pivoted to AI because it offered faster iteration and more tangible results.
[112:33] He joined Anthropic largely due to connections with former physics colleagues who had already transitioned there. He started on the RL team right before OpenAI’s O1 was released.
[131:39] After contributing to Claude 3.7 and seeing Anthropic grow rapidly, he left for Google to seek a new environment and learn different approaches, feeling he had absorbed what he could at Anthropic.
[157:30] The guest explains his decision to leave Anthropic, initially driven by pessimism over their API-centric business model. However, he later realized he underestimated their strong product strategy, particularly tools like Claude Code.
[161:10] After leaving Anthropic, the guest joined Google DeepMind to focus on ML coding and long-horizon tasks, seeking a new environment to push his research forward.
[211:38] The guest explains his unique interview process: giving candidates 24 hours to build a reinforcement learning project from scratch, followed by a 1-hour deep dive to ensure they didn’t just blindly use AI to write the code.
[218:18] The guest mentions his background in topology before transitioning into AI, using it to draw parallels between visionary physicists and AI researchers.

Tools & Models Discussed (23)

Gemini: Google’s flagship multimodal AI model.
Claude: Anthropic’s AI model, known for its reasoning and coding capabilities.
Cursor: An AI-powered code editor that has gained massive popularity among developers.
OpenCloud: An AI agent startup that was acquired by OpenAI.
Minus: An AI agent startup that was acquired by Meta.
SWE-bench: A benchmark used to evaluate the software engineering capabilities of AI models.
Gemini: Google’s foundational multi-modal AI model, praised for its technical execution.
Doubao: ByteDance’s AI model, noted specifically for having the best voice generation and emotional interaction capabilities.
Claude: Anthropic’s LLM, which the guest mentions using frequently for coding and work-related tasks.
Claude 3: Anthropic’s previous generation model that gave the company confidence to push further.
Claude 3.5 / 3.6 / 3.7: Successive iterations of Anthropic’s models, with 3.7 specifically focused on strong agentic coding capabilities.
GPT-4: OpenAI’s model, used as a benchmark that Claude 3 surpassed, boosting Anthropic’s internal confidence.
O1 (Strawberry): OpenAI’s reasoning-focused model, which was highly anticipated while the guest was working on RL at Anthropic.
Claude: Anthropic’s flagship large language model.
Claude Code: An AI-powered coding and collaboration tool developed by Anthropic.
Gemini: Google’s multimodal AI model family.
ChatGPT / GPT-4: OpenAI’s conversational AI models.
DeepSeek: An AI model noted for its advancements in sparse attention mechanisms.
Strawberry (OpenAI o1): An OpenAI model focused on advanced reasoning and reinforcement learning.
TPU: Google’s custom tensor processing unit, optimized for large-scale AI training using a 3D torus network.

Topics

AI Scaling Laws and Bottlenecks · The Evolution of AI Coding Agents · Startup vs Big Tech Dynamics in AI · The Future of Software Engineering · Career Transitions from Academia to AI Industry · AI Model Evaluation · US-China AI Competition · Model Distillation Techniques · Voice Generation and HCI · Robotics and VLA Models · Quantum Physics · Career Transitions · Transitioning from Physics to AI · AI Scaling Laws and Black Box nature · Anthropic's top-down execution culture · Development of Claude 3.7 and Agentic Coding · Google's bottom-up research culture · The shift from individual AI research to systems engineering · AI Research Methodologies · AI Safety and Corporate Strategy · Automation of Machine Learning · Pre-training vs. Post-training · Corporate Cultures (Google, Anthropic, OpenAI) · Long Horizon AI Agents · World Models · AI Hardware Infrastructure (TPU vs GPU) · AI Industry Trends and Startups · US vs China Software Markets · AI Talent and Interviewing · Corporate Culture in AI Labs (Google vs Anthropic)

Takeaways

AI models are becoming homogenized in capabilities; the true differentiation now lies in product execution and defining the right problems.
Coding is currently the only AI-native application that has successfully created a data flywheel.
Scaling laws may face unexpected bottlenecks, such as data exhaustion or fundamental algorithmic limits, rather than continuing indefinitely.
AI will centralize technological power, significantly altering the landscape for software engineers by making a few highly productive while displacing others.
The gap between US and Chinese AI models is narrowing, but compute limitations force Chinese companies to rely heavily on distillation.
Voice generation is fundamentally a model capability problem, not just a product UI feature, and ByteDance’s Doubao currently leads in this area.
The robotics industry has cheap, mature hardware, but is bottlenecked by the lack of generalizable foundational models.
Boldness and taking unconventional risks (like texting an admissions officer) can drastically alter one’s career trajectory.
Theoretical physics is losing talent to AI because AI offers immediate, verifiable feedback loops, whereas high-energy physics is currently stalled by a lack of experimental data.
A background in physics provides a strong systematic thinking framework useful for AI research.
Anthropic’s success is heavily driven by a top-down, highly aligned execution culture led by its founders.
The development of models like Claude 3.7 relies less on secret algorithms and more on rigorous engineering and execution of known techniques.
The era of individual breakthroughs in LLMs is fading, replaced by massive, coordinated systems engineering efforts.
AI research has shifted from individual breakthroughs to massive, collective engineering efforts.
AI models will soon be capable of autonomously conducting machine learning experiments.
Post-training is becoming the critical differentiator for advanced model capabilities, such as long-context reasoning.
OpenAI’s aggressive product strategy forced Google to overcome its innovator’s dilemma and accelerate AI deployment.
Different AI labs have distinct cultures: Google is more top-down and engineering-heavy, while Anthropic balances safety with strong product execution.
TPUs have a structural advantage over GPUs for massive-scale training due to their 3D torus network topology.
The era of pure language modeling as a startup opportunity is over; the focus is shifting to robotics, multimodal, and AI for science.
Many new AI labs spinning out of big tech lack a viable business model and are likely to fail.
The US excels in direct, high-margin enterprise software, while China dominates in complex, indirectly monetized consumer products.
Relying entirely on AI coding tools without understanding the underlying logic is a fatal flaw for new AI researchers.