Saining Xie: 7-hour marathon — World Models, AMI Labs, Yann LeCun, Fei-Fei Li

Duration: 405 min · ▶ Watch on YouTube

Guest: Saining Xie 谢赛宁 · Chief Science Officer at AMI Labs

Chapters (49)

00:00 · Introduction and Teaser
- Host Xiaojun Zhang introduces Saining Xie and his new entrepreneurial venture with Yann LeCun.
01:19 · New York Vibe and NYU
- Saining discusses the artistic vibe of New York and his reasons for choosing NYU for his academic career.
04:33 · Childhood and Early Internet Days
- Saining shares his childhood memories of traveling, reading, and experiencing the early explosion of the internet in China.
08:43 · SJTU ACM Class and High Light Moment
- He reflects on his time in the prestigious SJTU ACM class and humorously calls his pre-college gaming summer his ‘high light’ moment.
12:30 · Comparing with Top Students
- Saining explains why he didn’t fit the traditional ‘top student’ mold and how he navigated his own path.
16:30 · Xiaodi Hou’s Influence
- The profound impact of Xiaodi Hou’s ‘Survival Manual’ and his 7-line code CVPR paper on Saining’s career choice.
20:00 · Choosing Computer Vision
- Saining explains his fascination with computer vision, comparing it to the Cambrian explosion of biological evolution.
24:52 · PhD Application Journey
- The story of cold-emailing Professor Zhuowen Tu and securing a PhD position after a 3 AM phone call.
30:00 · Early Research and Breakthroughs
- Discussion on his early influential papers, including Deeply Supervised Nets and Holistically-Nested Edge Detection.
2:30:00 · Reflections on FAIR and Self-Supervised Learning
- Saining discusses his early work at FAIR on self-supervised learning and its applications across different domains.
2:32:10 · Kaiming He’s Infrastructure and Baselines
- Saining shares how Kaiming He single-handedly built a robust TPU infrastructure at FAIR, emphasizing the importance of strong baselines.
2:36:30 · The Excel Spreadsheet Method
- Saining reveals the rigorous practice of using Excel to track experiments, control variables, and predict outcomes.
2:41:30 · Kaiming He’s Philosophy and Interests
- A look into Kaiming He’s broad interests outside of AI, including Hearthstone, physics, and philosophy.
2:44:20 · Defining ‘Research Taste’
- Saining defines research taste using a quote from the Diamond Sutra, highlighting the need to see through illusions to find the essence.
2:51:50 · Research as Storytelling
- Saining compares writing a research paper to directing a movie, emphasizing the importance of narrative and storytelling.
2:54:30 · Anti-fragile Research
- Discussion on building an anti-fragile research system that benefits from shocks and failures.
2:57:50 · The Development of DiT
- The story behind the Diffusion Transformer (DiT) paper, its initial rejection at CVPR, and its eventual success.
3:08:40 · Transitions to OpenAI and NYU
- Bill Peebles joins OpenAI to work on Sora, while Saining transitions to a professorship at NYU.
3:14:00 · The Cambrian Project
- Saining discusses his current research on vision-language models, using the Cambrian explosion as a metaphor for AI evolution.
03:20:00 · Academic Compute Challenges
- Discussing the struggles of academic AI research with limited funding and compute resources like TPUs.
03:26:30 · The Importance of Video Understanding
- Exploring why video is the ultimate medium for AI to understand the physical world and causality.
03:34:00 · Redefining Computer Vision
- Arguing that computer vision is a perspective and a generalization process, not just a set of tasks.
03:43:00 · LLMs vs. True Physical Intelligence
- Explaining why language models are a ‘crutch’ and true intelligence requires grounding in the physical world.
03:55:00 · Representation Learning and REPA
- Discussing the need for high-dimensional representation alignment without relying solely on language.
4:10:00 · Defining World Models
- The guest defines what a world model is and traces its historical roots back to 1943 and control theory.
4:16:00 · State Representation and LLM Limitations
- Discussion on how to represent states in a world model and why LLMs fall short of true physical understanding.
4:26:00 · Industry Approaches to World Models
- Analyzing how companies like OpenAI, Runway, and World Labs are approaching world models through video generation and 3D assets.
4:32:00 · The Flaws of Tokenizing Video
- The guest explains why treating video frames as 1D token sequences for LLMs is fundamentally flawed.
4:40:00 · Scaling Laws and Compression
- Exploring the concept that compression is intelligence and how scaling laws apply differently to world models.
4:46:00 · The Data Bottleneck
- The challenge of acquiring high-bandwidth physical world data compared to the relatively low-bandwidth text data on the internet.
4:50:00 · Applications: Wearables and Robotics
- How world models will enable always-on AI wearables and general-purpose robotics.
4:56:00 · Entrepreneurship and the Academic Trap
- The guest shares his motivation for leaving academia to start a company and avoid the ‘middle-income trap’ of paper publishing.
5:00:00 · Leaving Big Tech and Academia
- Saining discusses his decision to leave Meta and NYU to start a company focused on world models.
5:05:00 · The Problem with Current AI Research
- He explains how the LLM arms race and benchmark culture stifle fundamental research in big tech.
5:11:00 · Defining the World Model
- Saining defines world models and emphasizes the need for AI to understand the physical world.
5:20:00 · The ‘Reverse OpenAI’ Strategy
- He outlines his startup’s strategy to build an alliance for physical data collection, comparing it to Mastercard.
5:30:00 · Yann LeCun and AI Philosophy
- Saining shares anecdotes about Yann LeCun’s hobbies and his pure scientific approach to AI.
5:40:00 · Startup Progress and Vision
- Discussion on the startup’s funding, team building, and the challenges of maintaining a research-focused culture.
5:50:00 · Naming the Startup and the Underdog Mentality
- The guest discusses naming his startup after the movie ‘Solaris’ and embracing an underdog, grassroots mentality in the AI industry.
5:55:00 · Transitioning from Researcher to Entrepreneur
- Exploring the differences between academic research and entrepreneurship, emphasizing the need for courage and a balanced approach.
6:00:00 · Yann LeCun’s Vision and Building the Team
- The guest shares how Yann LeCun’s vision inspired the startup and details the process of assembling a diverse, six-person founding team.
6:05:00 · Deconstructing AGI and Animal Intelligence
- A critical discussion on the definition of AGI, arguing that human intelligence is highly specialized and that building ‘squirrel-level’ intelligence is a massive challenge.
6:15:00 · Robotics and the Future of World Models
- Discussing the application of world models in robotics, the limitations of current VLA models, and the need for true physical understanding.
6:25:00 · Personal Reflections and Dealing with Startup Stress
- The guest shares personal stories of managing stress by observing everyday life in New York parks and recommends influential books and media.
6:35:00 · Industry Observations and Future Outlook
- Thoughts on competitors like ByteDance, the importance of data in generative models, and a concluding philosophical remark on letting go of Wittgenstein.
6:40:00 · Misuse of Philosophical Quotes in AI
- The guest criticizes the trend of using quotes from Wittgenstein and Feynman to justify LLMs and unified models in AI papers.
6:41:18 · Wittgenstein’s Language Games
- Discussion on how language derives its meaning from real-world practice and action, rather than just being a system of symbols.
6:42:12 · Feynman’s Quote on Creation and Understanding
- The guest explains that Feynman’s quote refers to physical creation and action, not just training a diffusion model.
6:43:27 · Destiny and the Universe as a World Model
- A philosophical take on destiny, viewing the universe as a giant world model that requires immense compute to predict.

Specific Numbers (29)

Time	Fact	Value	Context
00:52	AMI Labs team size	25	The current number of employees at Saining’s startup, AMI Labs.
02:55	Years in the US	13	The amount of time Saining Xie has lived in the United States.
06:18	Age getting first computer	9	Saining got his first computer at age 9, sparking his interest in the digital world.
16:53	Lines of code in a CVPR paper	7	Xiaodi Hou published a highly influential CVPR paper using only 7 lines of code.
26:31	Cambrian Explosion timeframe	530 million years ago	Used as a metaphor for the sudden emergence of vision in biological history.
2:32:34	TPU cores rented by FAIR	5000	FAIR rented 5000 TPU cores from Google Cloud to experiment with new hardware.
2:47:58	Paper completion timeline	1 month	Kaiming He typically finishes writing his papers a full month before the submission deadline.
2:49:15	Minimum text width on a line	60%	Kaiming He’s aesthetic rule that no line of text in a paper should occupy less than 60% of the column width.
3:08:50	Bill Peebles joining OpenAI	End of 2022	Bill Peebles joined OpenAI at the end of 2022 to continue his work on generative models.
3:16:31	Cambrian explosion timeline	538 million years ago	Used as a metaphor to describe the rapid evolution of visual capabilities in biological history.
03:23:22	NSF Grant Total	$500,000	Total amount provided by an NSF grant over 5 years.
03:23:46	NSF Grant Annual	$100,000	Annual amount of an NSF grant, enough to fund one student.
03:24:08	Industry Grant	$100,000 - $150,000	Typical one-time grant amount from industry.
03:24:15	Grant Competition	100	Number of schools competing for a single industry grant.
4:12:35	First proposal of the world model concept	1943	Kenneth Craik proposed that the human brain has a world model to predict the consequences of actions.
4:28:15	Autodesk investment in World Labs	$200 million	Mentioned as an example of industry investment in 3D representation and world models.
4:46:15	Human sensory bandwidth	1 to 10 billion bits per second	The amount of data humans process through senses like vision and hearing.
4:46:25	Human language bandwidth	10 to 100 bits per second	The relatively low bandwidth of human speech compared to sensory input.
5:08:00	Time spent on a research paper	Almost 1 year	Saining and his student spent nearly a year on a paper, a luxury big tech researchers don’t have.
5:08:54	Time Google researchers spent on a similar project	2 weeks	Google researchers were forced to abandon a similar project after two weeks due to product cycle pressures.
5:27:00	Number of initial offices	4	The startup plans to have offices in Paris, New York, Montreal, and Singapore from day one.
5:41:46	Initial team size	Around 25 people	The target size for the initial founding team.
5:50:45	Movie release years	1970s / 2000s	Referring to the different versions of the movie ‘Solaris’ (Tarkovsky vs. Soderbergh).
6:01:10	Number of co-founders	6	The total number of co-founders in the guest’s startup.
6:04:48	Publication year of JEPA paper	2022	The year Yann LeCun published the foundational paper on the JEPA cognitive architecture.
6:08:08	Visual nerve fibers	2 million	The number of visual nerve fibers in humans, illustrating the massive bandwidth of visual input.
6:14:53	Evolutionary timeline	530 million years	The time it took for biological intelligence to evolve, compared to the difficulty of building AI.
6:38:46	Importance of data	90-95%	The estimated percentage of the problem that relies on data quality and processing in generative models.
6:44:27	The ultimate answer to life, the universe, and everything	42	Referencing ‘The Hitchhiker’s Guide to the Galaxy’ when discussing the compute needed to simulate the universe.

Research Claims & Predictions (20)

[26:00] Vision is the primary way humans perceive the world.
- evidence: Supported by the biological fact that a large portion of the brain’s cortex is dedicated to visual processing.
[28:31] Solving vision is solving intelligence itself.
- evidence: Vision is the only sense exposed directly to the real world, making it the most critical component of artificial general intelligence.
[32:00] Deeply Supervised Nets solve the vanishing gradient problem.
- evidence: By adding auxiliary supervision to intermediate layers, the network can be trained more effectively, a concept later echoed by ResNet.
[2:33:36] The upper bound of your research depends on how good your baseline is.
- evidence: A weak baseline leads to false positive signals, while a strong baseline forces true breakthroughs.
[2:40:13] Researchers must predict the outcome of an experiment before running it.
- evidence: Predicting outcomes verifies if the researcher’s mental model of the system is correct or needs adjustment.
[3:04:17] Diffusion models will shift from U-Net architectures to Transformers.
- evidence: Transformers offer better scalability, efficiency, and cleaner codebases compared to complex U-Net structures.
[03:34:28] Computer vision is a perspective, not a specific task.
- evidence: It’s a fundamental way of understanding the world through continuous, high-dimensional, noisy signals, essential for future AI.
[03:43:42] Language models are a ‘crutch’ for true intelligence.
- evidence: True intelligence requires grounding in the physical world, which language alone cannot provide, echoing Yann LeCun’s views.
[03:55:00] High-dimensional representation learning is crucial and shouldn’t be bypassed by language.
- evidence: Projects like REPA show that aligning internal representations directly is more effective than forcing everything through a language bottleneck.
[4:22:20] LLMs do not fully embody the ‘Bitter Lesson’.
- evidence: LLMs still rely heavily on human-designed language structures and logic, whereas the Bitter Lesson advocates for minimizing human heuristics in favor of computation and search.
[4:30:45] Applying LLM architectures directly to video by flattening frames into 1D sequences is a dead end.
- evidence: It destroys the spatial relationships and continuous nature of the physical world, making it an inefficient way to learn physical laws.
[4:42:00] Scaling laws for world models will focus on compressing physical phenomena rather than human knowledge.
- evidence: Future models will need to compress high-bandwidth sensory data to truly understand physics, rather than just memorizing text.
[5:05:00] LLM benchmarks dictate resource allocation, stifling fundamental research.
- evidence: Evidenced by big tech researchers having exploratory projects killed to focus on product timelines.
[5:18:00] Internet data is insufficient for training true world models.
- evidence: YouTube data is aligned with human entertainment, lacking the physical signals needed for true understanding.
[5:26:00] A ‘Reverse OpenAI’ approach is necessary for physical AI.
- evidence: Future models will require alliances with industries to collect real-world data, rather than just scraping the web.
[6:04:35] JEPA is not just an algorithm, but a comprehensive cognitive architecture.
- evidence: It is viewed as a pathway to universal intelligence, moving beyond simple self-supervised learning to true world understanding, prediction, and planning.
[6:08:40] Human intelligence is highly specialized, not purely general.
- evidence: Humans can only process a tiny fraction of all possible visual functions, meaning our intelligence is tailored to our specific evolutionary environment.
[6:13:15] Building a ‘squirrel-level’ intelligence is harder than writing code or going to the moon.
- evidence: Quoting Rich Sutton, the guest argues that creating an AI with intrinsic motivation, survival instincts, and physical understanding is the true hard problem of AI.
[6:40:27] Wittgenstein’s early philosophy does not justify equating LLMs with world models.
- evidence: Wittgenstein’s later work overturned his early ideas, emphasizing that language meaning comes from real-world action.
[6:43:57] The universe is a giant world model, but we cannot predict destiny due to lack of compute.
- evidence: Predicting the universe would require a computer the size of the Earth or the universe itself.

Key Concepts (27)

[16:53] CVPR
- Computer Vision and Pattern Recognition, one of the top academic conferences in the field of AI and computer vision.
[26:31] Cambrian Explosion
- An evolutionary event where most major animal phyla appeared; used as a metaphor for the rapid advancement of visual AI.
[32:00] Deeply Supervised Nets (DSN)
- A neural network architecture that adds supervision to hidden layers to alleviate the vanishing gradient problem during training.
[34:25] Holistically-Nested Edge Detection (HED)
- A deep learning model for edge detection that utilizes multi-scale and multi-level feature learning.
[2:44:20] Research Taste
- The ability to see past the superficial claims of a paper to its core essence, combined with high aesthetic standards in executing and presenting research.
[2:54:30] Anti-fragile Research
- A research methodology or organizational structure that actually benefits and grows stronger from unexpected shocks, failures, or random events.
[2:57:50] DiT (Diffusion Transformers)
- A generative model architecture that replaces the traditional U-Net with Transformers, allowing for better scaling and efficiency in image and video generation.
[3:15:50] Cambrian Explosion
- A biological metaphor used to describe the current rapid diversification and evolution of vision-language AI models.
[03:20:45] TPU Research Cloud (TRC)
- A Google program that provides free TPU compute resources to academic researchers.
[03:35:55] Continuous High-Dimensional Noisy Signals
- The nature of visual data from the real world, which computer vision systems must process, unlike clean text data.
[03:45:55] Moravec’s Paradox
- The observation that high-level reasoning requires little computation, but low-level sensorimotor skills require enormous computational resources.
[04:01:18] Representation Alignment (REPA)
- A method to align the internal representations of generative models with self-supervised models without relying on language.
[4:11:25] World Model
- A system that takes a current state and an action to predict the next state of an environment, allowing for planning and reasoning.
[4:13:40] Model Predictive Control (MPC)
- A control algorithm that uses a predictive model to simulate future states and optimize a sequence of actions to achieve a goal.
[4:22:20] The Bitter Lesson
- An essay by Rich Sutton arguing that AI methods leveraging massive computation and search ultimately outperform human-designed, domain-specific heuristics.
[4:41:55] Compression is Intelligence
- The theoretical perspective that the ability to compress data efficiently requires a deep understanding of the underlying patterns, equating compression with intelligence.
[5:11:00] World Model
- An AI model designed to understand the physical world and its dynamics, going beyond text or 2D video generation.
[5:26:00] Reverse OpenAI
- A strategy of building AI by forming alliances to collect proprietary physical world data, rather than scraping public internet data.
[5:33:00] JEPA (Joint Embedding Predictive Architecture)
- Yann LeCun’s architecture for world models that predicts in an abstract representation space rather than generating pixels.
[6:03:30] JEPA (Joint Embedding Predictive Architecture)
- A cognitive architecture proposed by Yann LeCun that focuses on understanding and predicting the world in an abstract representation space, rather than just predicting raw pixels or tokens.
[6:04:40] World Model
- An AI system designed to understand the physical laws and dynamics of the real world, enabling it to predict future states and plan actions.
[6:07:45] AGI (Artificial General Intelligence)
- Discussed as a potentially flawed concept, with the guest arguing that intelligence is inherently specialized and bounded by physical and evolutionary constraints.
[6:17:50] VLA (Vision-Language-Action)
- Models used in robotics that map visual and linguistic inputs directly to physical actions, which the guest views as currently lacking true physical understanding.
[6:40:08] LLM (Large Language Model)
- AI models trained on vast amounts of text, which some researchers mistakenly equate to complete world models.
[6:41:19] Language Game
- Wittgenstein’s philosophical concept that language symbols have no inherent meaning without real-world practice and action.
[6:42:58] Diffusion Model
- A type of generative AI model that the guest argues does not truly embody Feynman’s concept of ‘creation’.
[6:43:57] World Model
- A system that simulates and predicts the environment, which the guest scales up to describe the entire universe.

People Mentioned (30)

Yann LeCun — Turing Award winner and co-founder of AMI Labs with Saining Xie.
Martin Scorsese — Famous film director mentioned as an NYU alumni.
Chloe Zhao (Zhao Ting) — Oscar-winning director mentioned as an NYU alumni.
Richard Courant — Mathematician and namesake of the Courant Institute at NYU.
Xiaodi Hou — A senior at SJTU who wrote a ‘Survival Manual’ and deeply inspired Saining to pursue research.
Zhuowen Tu — Professor at UCSD who became Saining’s PhD advisor after a cold email.
Jiashi Feng — A mentor and collaborator during Saining’s early research days.
Kaiming He — Colleague at FAIR who deeply influenced Saining’s research methodology, infrastructure building, and paper aesthetics.
Ross Girshick — Colleague at FAIR who contributed to building strong research infrastructure and baselines.
Yuxin Wu — Colleague at FAIR who also contributed to the robust infrastructure.
Bill Peebles — Saining’s intern at FAIR, co-author of the DiT paper, and later a key researcher on OpenAI’s Sora.
Robert McKee — Author of the book ‘Story’, which Saining recommends for learning how to structure research papers.
Aravind Srinivas — Showed Saining an early AI demo in a Palo Alto coffee shop.
Jia Zhangke — Chinese film director mentioned for his use of long takes, relating to video understanding.
Bi Gan — Chinese film director mentioned for his use of long takes, relating to video understanding.
Fei-Fei Li — Provided advice on spatial intelligence for a research paper.
Alex Kirillov — Collaborator at OpenAI on the ‘Think with Image’ project.
Kenneth Craik — A physiologist who first proposed the concept of a mental world model in 1943.
Rich Sutton — Author of the Dyna paper, which integrated learning and planning in reinforcement learning.
Feifei Li — Founder of World Labs, working on 3D spatial intelligence.
Jitendra Malik — Berkeley professor who quipped about preferring ‘World Models’ over ‘Word Models’.
Hou Xiaodi — A peer Saining consulted for advice on building products.
Zhang Tao — Founder of Minus, who advised Saining that building good products requires loving life.
Ilya Sutskever — Co-founder of OpenAI, described by Saining as a ‘fighter’ in contrast to LeCun’s scientific purity.
Andrei Tarkovsky / Steven Soderbergh — Directors of the movie ‘Solaris’, which inspired the startup’s name.
Pascal — The CRIO (Chief Research and Innovation Officer) of the startup.
Mike — The VP of World Model at the startup, formerly a director at Meta.
Jurgen Klopp — Former Liverpool football manager, quoted by the guest (‘I’m the normal one’) to describe his own leadership style.
Ludwig Wittgenstein — Philosopher whose quotes on language and the world are frequently cited in AI research.
Richard Feynman — Physicist whose quote ‘What I cannot create, I do not understand’ is often misused in AI papers.

Companies Mentioned (18)

AMI Labs · Tencent (QQ) · Fanfou · Microsoft Research Asia (MSRA) · FAIR · Google · OpenAI · DeepMind · Perplexity · ByteDance · Runway · Luma · World Labs · Autodesk · YouTube · Meta · Visa / Mastercard · Black Forest Labs

Notable Quotes (23)

I hope myself and the people around me can look at the world with a more open mind. — Saining Xie @ 09:14

My high light moment in life was those two months playing games in the dorm. — Saining Xie @ 11:46

The world has changed, but we haven’t. — Saining Xie (quoting Xiaodi Hou) @ 16:32

If you don’t do this, it will never happen in this world. — Saining Xie @ 29:16

你的research的上限其实取决于你baseline的好坏。 — Saining Xie @ 2:33:36

你要学会做预测。在你跑每一个实验的时候，你要预测这个实验的结果应该是怎么样的。 — Saining Xie @ 2:40:15

凡所有相，皆是虚妄。若见诸相非相，即见如来。 — Saining Xie @ 2:45:15

不是因为看见所以相信，是因为相信所以看见。 — Saining Xie @ 2:55:52

Vision is a perspective. It’s not a specific task, it’s not even a specific domain. — Saining Xie @ 03:34:28

Language is a drug. You add more language, you always feel happier. — Saining Xie @ 03:48:46

LLM completely lacks the Bitter Lesson… you should minimize human knowledge. — Saining @ 4:22:20

He said his favorite thing about ‘world model’ is that it tells everyone I’m building a ‘world model’, not a ‘word model’. — Saining @ 4:57:45

大家其实对学术界对这种纯粹的探索性的research其实是有点抵触的。 — Saining @ 5:02:30

世界需要一个世界模型。 — Saining @ 5:11:17

我们想要build这样一个反向的OpenAI。 — Saining @ 5:26:32

Career Arc & Personal Stories (16)

[04:33] Saining grew up in a relaxed environment with a father who loved reading and a mother who loved traveling. He got his first computer at age 9 and became deeply immersed in the early Chinese internet culture.
[08:43] He entered the highly competitive SJTU ACM class but realized he wasn’t the typical ‘top student’ who excelled at competitions, choosing instead to explore his own interests.
[16:30] Inspired by a ‘Survival Manual’ written by senior Xiaodi Hou and a brilliant 7-line code paper, Saining decided to dedicate his career to computer vision research.
[24:52] During his PhD applications, he faced rejections but took the initiative to cold-email Professor Zhuowen Tu. A 3 AM phone call secured his position at UCSD.
[30:00] During his PhD, he co-authored highly influential early deep learning papers like Deeply Supervised Nets and HED, establishing his reputation in the field.
[2:30:00] Saining spent four years at FAIR, initially focusing on self-supervised learning and expanding it to 3D and medical domains.
[2:32:10] He worked closely with Kaiming He, learning rigorous methodologies like tracking experiments in Excel and predicting outcomes before running code.
[2:57:50] He mentored intern Bill Peebles, leading to the creation of DiT. Despite an initial rejection from CVPR, they persisted and got it accepted at ICCV.
[3:08:40] Saining left FAIR to become a professor at NYU, while his intern Bill joined OpenAI to build Sora based on their DiT research.
[03:25:08] Instead of meeting in an office, he went hiking with a Google collaborator to discuss their contributions to TPU infrastructure, highlighting a unique collaboration style.
[03:33:10] His students went out to the streets of New York with cameras to film footage to test their ideas for a predictive world model, showing a hands-on approach to research.
[4:58:30] The guest explains his decision to leave academia and start a company. He felt that staying in research would lead to a ‘middle-income trap’ of publishing decent papers without making a breakthrough, and he wanted to build a real, impactful system.
[5:00:00] Saining decided to leave his positions at Meta and NYU because the environments were no longer conducive to the fundamental research needed for world models. He shared this decision with Yann LeCun in a 1-on-1 meeting.
[5:08:00] Saining spent nearly a year working on a research paper with a student. After publishing, Google researchers reached out to say they had tried the same thing but were forced to stop after two weeks due to product pressures, validating his decision to leave big tech.
[5:55:00] The guest describes the mental shift required to transition from a pure researcher to a startup founder, emphasizing the need to ‘lean into the slope’ (embrace the fear) rather than pulling back.
[6:25:00] He shares how moving to New York and dealing with the immense stress of running a startup led him to find comfort in simply sitting in Washington Square Park, observing ordinary people living their lives.

Tools & Models Discussed (24)

Deeply Supervised Nets (DSN): Improves the training of deep neural networks by providing integrated direct supervision to hidden layers.
Holistically-Nested Edge Detection (HED): Performs image edge detection and object boundary detection using a deep learning model that leverages multi-scale features.
AlexNet: A pioneering convolutional neural network that sparked the deep learning revolution in computer vision in 2012.
ResNet: A residual neural network architecture that solved the vanishing gradient problem, conceptually related to Saining’s earlier DSN work.
TPU: Hardware accelerators used by FAIR to train large-scale models, requiring custom infrastructure built by Kaiming He.
Excel: Used as a strict organizational tool to track experiment configurations, variables, and gradient signals.
DiT: A scalable diffusion model architecture that uses Transformers instead of U-Nets for generative tasks.
Sora: OpenAI’s text-to-video generation model, which heavily relies on the DiT architecture.
Cambrian-1: A family of multimodal large language models designed to improve vision-centric tasks and visual representations.
CLIP: A vision encoder model that Saining notes has certain flaws and ‘shortcuts’ in true visual understanding.
DiT (Diffusion Transformer): A generative model architecture discussed in the context of early TPU infrastructure work.
Cambrian: A multimodal project mentioned as a step towards handling more complex visual tasks.
V-STAR: A system designed to test scaling behaviors in multimodal models.
REPA (Representation Alignment): A method for aligning representations in models without using language as an intermediary.
Large Language Models (LLMs): Predicts the next word based on text data; criticized in the video for lacking true physical understanding.
Sora: OpenAI’s video generation model, discussed as an early attempt at a world simulator.
Video Diffusion Models: Generative models used to create video content, currently serving as rudimentary physics simulators.
LLMs (Large Language Models): Current dominant AI models that excel at text but lack true understanding of the physical world.
JEPA: An architecture proposed by Yann LeCun for predictive world modeling in abstract spaces.
JEPA / V-JEPA: A predictive architecture for learning world models by predicting abstract representations of missing or future data.

Topics

Early Internet Culture in China · SJTU ACM Class Experience · Evolution of Computer Vision · PhD Application Journey · Deep Learning Architecture (DSN, HED) · Research Methodology · Infrastructure and Baselines · Research Taste and Aesthetics · Diffusion Transformers (DiT) · Vision-Language Models · AI Industry Culture Shifts · Academic Research Funding · Compute Constraints (TPU vs GPU) · Video Understanding · The Definition of Computer Vision · LLMs vs. Physical Grounding · Representation Learning · World Models · Large Language Models (LLMs) · Reinforcement Learning · Representation Learning · Scaling Laws · AI Data Bottleneck · Robotics · AI Entrepreneurship · AI Research Environment · World Models · Physical AI · Startup Strategy · Yann LeCun's Philosophy · Startup Entrepreneurship · World Models · JEPA Architecture · AGI Definitions · Animal vs. Human Intelligence · Robotics and VLA Models · Data Quality in AI · Mental Health for Founders · AI Philosophy · World Models · Large Language Models · Wittgenstein's Language Games · Feynman's Philosophy of Understanding · Compute Limits and Determinism

Takeaways

Saining Xie’s journey shows that you don’t have to fit the traditional ‘top student’ mold to achieve greatness in research.
The ‘Cambrian Explosion’ of AI vision mirrors biological evolution, highlighting vision as a core component of intelligence.
Taking bold initiative, such as cold-emailing a professor, can drastically change one’s career trajectory.
Early deep learning research focused heavily on overcoming training difficulties, leading to foundational innovations like Deeply Supervised Nets.
A strong baseline is critical; without it, performance gains are illusory and true breakthroughs are impossible.
Rigorous experiment tracking (like using Excel) and predicting outcomes before running code are essential for developing a correct mental model.
‘Research Taste’ involves seeing through the hype of papers to their core essence and maintaining high aesthetic standards in writing.
Research is a form of storytelling; papers should be crafted to guide the reader smoothly toward the core insight.
The shift from U-Net to Transformers in diffusion models (DiT) was a bottom-up innovation that eventually powered state-of-the-art models like Sora.
Academic AI research faces severe funding and compute constraints, forcing researchers to be resourceful and rely on programs like Google’s TRC.
Video is the crucial next frontier for AI, as it provides the continuous, high-dimensional data needed to understand causality and the physical world.
Relying solely on language models (LLMs) is a ‘crutch’; true artificial general intelligence requires systems grounded in physical reality, likely through advanced computer vision and robotics.
World models aim to understand the physical world by predicting future states, unlike LLMs which only predict text.
Applying LLM architectures directly to video by flattening frames is inefficient and loses spatial context.
The next major leap in AI requires moving beyond low-bandwidth text data to high-bandwidth sensory data to achieve true physical understanding.
True world models will enable advanced applications like always-on AI wearables and general-purpose robotics.
The current AI research environment in big tech is stifled by the LLM arms race and leaderboard chasing, leaving little room for fundamental exploration.
True world models require data from the physical world, which cannot be obtained simply by scraping the internet (like YouTube).
Saining’s startup aims to build a ‘Reverse OpenAI’ by forming alliances with industries to collect physical data and build a universal world model.
Yann LeCun’s approach to AI, focusing on scientific integrity and abstract representations (JEPA), heavily influences Saining’s vision.
Transitioning from research to entrepreneurship requires a fundamental shift in mindset, embracing risk and focusing on team building.
True intelligence (like that of a squirrel) involves intrinsic motivation and physical understanding, which is harder to achieve than current LLM capabilities.
JEPA is viewed not just as an algorithm, but as a comprehensive cognitive architecture necessary for building true world models.
Current robotics models (VLAs) often lack deep physical understanding, relying instead on mapping language to actions.
In generative AI, 90-95% of the success comes down to meticulous data curation and processing, rather than just model architecture.
AI researchers should avoid superficially quoting philosophers like Wittgenstein and Feynman to justify their models.
Language models alone are not world models because true meaning requires grounding in real-world physical action and practice.
While the universe can be viewed as a massive world model, predicting the future (destiny) is impossible due to the unimaginable computational resources required.