Episode 133 — Saining Xie

Host: Xiaojun · Duration: 405 min · ▶ Watch on YouTube

A 7-hour marathon interview with Saining Xie: World Models, AMI Labs, Two Rejections of Ilya, Yann LeCun, Fei-Fei Li, and 42

Switch language → zh

Chapters (76)

  • 00:00:00 · Podcast Debut and Early Family Influence
    • Host Xiaojun introduces this podcast episode and guest Xie Saining, who states this is his first podcast interview. He shares his childhood experiences traveling with his mother and his father’s love for reading as a ‘homebody,’ with a rich collection of books at home, shaping his unique upbringing.
  • 00:00:36 · The Enlightenment of the Internet and the Emergence of the Desire to Express
    • Xie Saining recalls getting his first computer at age 9 and experiencing the ‘information explosion’ brought by the internet. He began writing on blog platforms (such as Sina Blog, Fanfou), discovering a new channel for self-expression and thus cultivating a wide range of interests.
  • 00:01:05 · Choosing a Non-Traditional Academic Path
    • He describes his academic career as a ‘Type B trajectory,’ contrasting it with the paths of many ‘Type A’ peers. He was admitted to Shanghai Jiao Tong University’s ACM class through competitions, choosing SJTU over Tsinghua or Peking University due to personal affection and identification with the city and the university’s computer science program.
  • 00:02:00 · Serendipity in Interviews and the Power of Role Models
    • Xie Saining recounts that during his ACM class interview, Professor Shen Shaoye asked him about his favorite book. He mentioned Richard Courant’s ‘What Is Mathematics?’, which created a wonderful connection to his later work at the Courant Institute of Mathematical Sciences at NYU. He also mentioned his senior, Hou Xiaodi, a legendary figure who published a CVPR paper as an undergraduate and wrote ‘SJTU Student Survival Guide,’ becoming his role model.
  • 00:03:00 · Persistence and Rebellion in Computer Vision
    • He explained his deep interest in computer vision, stemming from how he perceives the world. He repeatedly emphasized, ‘The world always tries to stop me from doing what I want to do,’ but he firmly pursued his passion, even proactively contacting Professor Tu Zhuowen at the National University of Singapore for a research internship opportunity, rather than choosing the then more mainstream Microsoft Research Asia.
  • 00:04:00 · The Evolution of Vision and the Cambrian Explosion
    • Xie Saining delves into the biological importance of vision, pondering which sense he would give up if he had to. He connects the evolution of vision with the ‘Cambrian Explosion,’ believing that the emergence of vision triggered an ‘arms race’ in the biological world, driving rapid species diversification.
  • 00:05:00 · First Paper and the Beginning of Deep Learning
    • He shared his experience of publishing his first research paper (BMVC) during his undergraduate internship. This period, around 2012-2013, coincided with the ‘AlexNet moment’ and the rise of the deep learning revolution, becoming a critical turning point for his in-depth research and entry into the field of deep learning.
  • 00:06:00 · Twists and Turns in PhD Applications and Following a Mentor
    • Xie Saining recounts the twists and turns in his PhD application process, where he again faced difficulties in securing opportunities at his desired computer vision labs. Ultimately, he contacted Professor Tu Zhuowen, who had moved to UCSD, via email and was admitted. He chose to follow his mentor to UCSD, prioritizing the mentor and research direction over university rankings, demonstrating his firm commitment to his field and advisor.
  • 00:40:33 · Professor Tu’s Mentorship and Leading by Example
    • Saining Xie shared how his PhD advisor, Professor Tu, helped him when his applications were not going well, and guided him in research by leading by example, even sitting beside him to review code line by line. Professor Tu himself is a respected scientist who independently completed a large amount of low-level code in an era without modern tools.
  • 00:42:45 · Paying Tribute to Pioneering Scientists
    • The guest paid tribute to pioneering Chinese scientists in the US like Professor Tu, Song-Chun Zhu, and Fei-Fei Li, believing that they paved the way for the opportunities available to today’s generation of researchers. They brought Chinese researchers in computer vision from the periphery to the mainstream.
  • 00:45:23 · Representative Works During PhD: DSN and HED
    • The guest introduced two important works during his PhD. DSN (Deeply Supervised Nets) was rejected by NeurIPS due to a mathematical formula error, but received the Test of Time Award from AISTATS ten years later. HED (Holistic Edge Detection) received a Marr Prize nomination at ICCV, giving him the feeling of ‘early fame’.
  • 00:49:25 · Five Internships in Five Years: Finding Direction Through Exploration
    • During his PhD, with the support of his advisor, Saining Xie completed five internships at NEC Labs, Adobe, Meta, Google Research, and DeepMind. He hoped to explore different environments and directions, understand the world outside academia, and validate his original passion for research.
  • 00:54:46 · Working with Kaiming He: The Birth of ResNeXt
    • During his internship at Meta (FAIR), Saining Xie collaborated with the newly joined Kaiming He. In just one month, they developed a simple idea into ResNeXt, achieving second place in the ImageNet competition. Saining Xie believes Kaiming He possesses a ‘reality distortion field’ that can turn ordinary ideas into gold.
  • 01:02:57 · Impressions of DeepMind: Ambitious Goals and Unique Organization
    • His internship experience at DeepMind left a deep impression. He believes DeepMind’s organizational and management model is very unique, featuring both bottom-up exploration and top-down efficient execution. Founder Demis Hassabis’s goal is for the company to win multiple Nobel Prizes, an ambition he finds admirable.
  • 01:06:51 · The Unifying Theme of the PhD Thesis: Representation Learning
    • Although his research directions during his PhD seemed disparate, Saining Xie ultimately unified all his work under the PhD thesis title ‘Deep Representation Learning with Structured Priors’. He believes representation learning is an eternal and fundamental research topic, unlike some popular directions that quickly become outdated.
  • 01:21:06 · Rejecting OpenAI, Choosing FAIR
    • Saining Xie recounted his experience of rejecting OpenAI’s offer and choosing FAIR, explaining FAIR’s attractiveness at the time in terms of academic environment and salary, as well as Ilya’s surprised reaction to this.
  • 01:22:25 · The Meaning and Impact of Research
    • Saining Xie discussed the true purpose of research, believing that publishing papers is not the ultimate goal, but rather to share knowledge and inspire others. He cited Hannah Arendt’s view on the term ‘impact,’ expressing his resistance to overemphasizing ‘impact,’ and valuing understanding and enhancing overall human intelligence more.
  • 01:24:50 · The Importance of Networking and Collaboration
    • Saining Xie emphasized the importance of academic networking and collaboration, believing that research is a vast organism where trust and appreciation among people are built on scientific discoveries, not merely personal relationships.
  • 01:29:03 · Yann LeCun and NYU’s Center for Data Science
    • Saining Xie explained his reasons for choosing NYU, with Yann LeCun’s foresight and the open environment of NYU’s Center for Data Science being important factors. He described the data science center’s unique glass-door offices and interdisciplinary collaboration model.
  • 01:35:55 · Fei-Fei Li and Defining Problems
    • Saining Xie spoke about Professor Fei-Fei Li’s influence on him, particularly admiring her ability to define problems, which he believes is more important than simply building datasets. He mentioned collaborating with Fei-Fei Li on the Thinking Space and Canbens papers, which expanded his research boundaries in world models and video understanding.
  • 01:42:02 · The Rise and Challenges of Self-Supervised Learning
    • Saining Xie reviewed the development history of self-supervised learning, explaining its difference from traditional supervised learning and why self-supervised learning is considered the future of computer vision. He pointed out that early self-supervised learning faced challenges of poor performance, but its core idea is to enable AI to acquire common sense.
  • 01:47:08 · Contrastive Learning and MoCo’s Breakthrough
    • Saining Xie detailed the basic logic of contrastive learning, which is to pull similar samples closer and push dissimilar samples further apart in the representation space. He pointed out that MoCo (Momentum Contrast) was the first work to truly achieve a breakthrough for the contrastive learning framework in the self-supervised field, and emphasized Kaiming He’s foresight in promoting model scaling.
  • 02:01:42 · The Power of Focus: Taking Kaiming He as an Example
    • The guest uses Kaiming He as an example to illustrate how top researchers demonstrate ‘focus’. This focus manifests as dedicating almost all ‘mental cycles’ to a specific problem, not thinking about anything else besides this problem, which is very difficult to achieve.
  • 02:04:06 · Essential Qualities of Top Researchers
    • A top researcher needs to possess multiple qualities: sufficient focus, good research taste, and the steadfastness to not follow trends. Additionally, strong engineering capabilities, research intuition (research sense), and the ability to quickly grasp key points and establish high-dimensional abstract connections when reading literature are also required.
  • 02:05:46 · How to ‘Find’ a Good Idea: Exploration, Not Epiphany
    • The guest shares the research methodology learned from Kaiming He: good ideas are not conjured out of thin air, but are ‘sought out’ through extensive exploration, reading, and thinking. Ideas that come easily are either already being pursued by others or are bad ideas that have already been proven to fail.
  • 02:06:55 · Research is Like Stochastic Gradient Descent: Finding the ‘Gradient’
    • A research cycle is about six months, with one to two months of exploration being crucial. This process is like Stochastic Gradient Descent (SGD); the focus is not on getting from point A to point B, but on finding the ‘gradient’ (signal) that can guide the direction during the process. This gradient itself is the true source of the researcher’s own ideas.
  • 02:15:15 · The Non-Linearity of Research Impact: Masterpieces and ‘Max’ Optimization
    • The guest cites MIT Professor Bill Freeman’s view, explaining that research impact is highly non-linear. A large number of mediocre or decent works have an impact close to zero, while a truly top-tier masterpiece can bring exponential returns. Therefore, the goal of research is to optimize the ‘maximum’ (Max) of one’s career works, rather than the ‘average’ (Average).
  • 02:17:44 · Infinite Game: Researchers and Inventors
    • Research is an ‘infinite game’; researchers are more like inventors, needing to succeed only once in their lifetime. This differs from chess players or athletes, who play ‘finite games’ where one mistake can lead to complete loss. However, the current finite competition among large companies is dragging academia into a finite game model as well.
  • 02:22:02 · Inventory of Masterpieces in the AI Field
    • The guest lists about 20-25 masterpieces that he believes truly influenced the progress of deep learning, such as LeNet, AlexNet, ResNet, Transformer, GPT-3, GAN, NeRF, etc. He humbly states that his own work (e.g., DiT) has only advanced the frontier a small step and is far from reaching that level.
  • 02:33:53 · ‘To Do Good Work, One Must First Sharpen One’s Tools’: The Importance of Baselines and Scaffolding
    • The guest shares Kaiming He’s experience of single-handedly building TPU infrastructure at FAIR, and from it extracts a methodology: the upper limit of research depends on the quality of the baseline. Only by perfecting the baseline and engineering, building a solid ‘scaffolding,’ can a platform for true exploration be provided, preventing misleading signals.
  • 02:42:12 · Kaiming He’s Influence and ‘Research Taste’
    • The guest shared his mentor Kaiming He’s influence beyond research, including his interest in philosophy, physics, and evolutionary biology. Kaiming He once gifted him ‘The Diamond Sutra’ and emphasized that a PhD, as a Doctor of Philosophy, should understand philosophy, which led to a deep discussion on ‘research taste’.
  • 02:44:20 · The Essence of Research Taste: Breaking Illusions, Pursuing Truth
    • The core of research taste lies in breaking the illusions on the surface of papers and pursuing their underlying essence and truth. This is not just about methods, but a philosophical level of thinking, avoiding obsession with false ‘appearances’ like paper acceptance or fame. The guest also emphasized the importance of writing and refining details (such as typesetting) as communication interfaces.
  • 02:52:05 · Research as Creation: Commonalities Between Scientific Research and Filmmaking
    • The guest compared doing research to making a movie, both being a process of storytelling. The key is not the background, but the decisions made at specific moments, which bring conflicts and changes, driving the plot’s development.
  • 02:55:36 · The Birth of ConvNeXt: Questioning Consensus, Returning to Origins
    • The guest recounted the birth of ConvNeXt. The project originated from questioning the consensus that ‘ViT is powerful because of self-attention.’ Through extensive ablation experiments, the team found that macroscopic architectural design was more important than self-attention itself, ultimately leading to the design of a pure convolutional network.
  • 03:00:30 · DiT’s Unexpected Discovery and FAIR’s Cultural Shift
    • The guest shared the origin story of DiT. Initially, the team wanted to study the representations learned by Diffusion models, but unexpectedly found that using ViT to replace U-Net as the backbone network yielded better, more efficient, and more scalable results. This project, which persisted despite tight resources and a changing culture within FAIR at the time, ultimately achieved great success.
  • 03:11:13 · Missed Opportunities and an ‘Antifragile’ Research Mindset
    • The guest admitted to missing opportunities to join OpenAI and early Perplexity, but he has no regrets. He cited the concept of ‘antifragility,’ believing that research itself is a process that makes one increasingly antifragile; setbacks like paper rejections can instead bring benefits and make one immune.
  • 03:15:42 · ‘Cambrian’ Series: Systematically Deconstructing Multimodal Large Models
    • The guest introduced the latest ‘Cambrian’ series of work. This series aims to systematically examine and deconstruct various components of multimodal large models, such as visual encoders, data composition, and architecture, to find the truly important factors. This continues the spirit of tracing back to the source that was behind ConvNeXt and DiT.
  • 03:22:45 · Challenges in North American Academia: Funding and Resource Crisis
    • The speaker complains about the resource predicament faced by North American academia, especially funding issues. He points out that despite severe inflation, research funding like NSF has not significantly increased for decades, and corporate sponsorship is also very limited and competitive. This forces scholars to ‘beg for resources’ like entrepreneurs.
  • 03:25:00 · Academic Circle’s ‘Fundraising’ Story: Pitching Ideas to Google During a Hike
    • The speaker shares his experience of how he pitched ideas to Google collaborators like an entrepreneur to obtain TPU computing resources. He emphasizes that it was this process of seeking collaboration and support under extremely limited resources that made subsequent research possible, and he thanks his students for their significant contributions under challenging conditions.
  • 03:27:37 · Inspiration from Cinema: From Long Takes to Video Understanding
    • The speaker explains his motivation for shifting from image research to video research, deeply influenced by directors Bi Gan and Jia Zhangke. He believes Bi Gan’s ‘long take’ perfectly interprets humanity’s continuous perception of the world, while Jia Zhangke’s discussion on ‘spatial expansion on a timeline’ highlights the importance of understanding space in temporal sequences, which became the philosophical foundation for his video understanding research.
  • 03:29:38 · Evolutionary Blueprint for Multimodal AI: From L0 to L4
    • The speaker proposes an evolutionary framework for multimodal AI, similar to autonomous driving levels. L0 is a pure language model, L1 is the current image-text Q&A system, L2 is streaming event cognition, L3 is spatial cognition, and the ultimate goal L4 is to build a Predictive World Model.
  • 03:39:57 · Is CV Being Marginalized? This is a Huge Opportunity
    • Responding to the host’s question about CV being marginalized by LLMs, the speaker expressed no discouragement, but rather saw it as a huge opportunity. He pointed out that current multimodal tasks overly rely on language as a ‘crutch,’ neglecting visual representations truly grounded in the physical world, which is precisely where visual research can thrive.
  • 03:44:12 · Real Intelligence vs. Virtual Intelligence: Limitations of Language Models
    • The speaker defines ‘real intelligence’ as intelligence capable of interacting with the physical world. He believes LLMs primarily operate in digital virtual spaces, whereas real-world tasks like robotics and industrial control involve continuous, high-dimensional, noisy signals, which are difficult for LLMs to handle, highlighting the fundamental importance of vision.
  • 03:49:13 · Debate on Scaling Law: Language is Strongly Supervised, Vision Doesn’t Need Scaling Law
    • The speaker put forward a bold view: the training of language models is essentially ‘strong supervised learning,’ not self-supervised. This is because language itself is a highly structured knowledge compression, accumulated by human civilization over thousands of years; it is a communication tool, not a thinking tool, and a large amount of continuous physical world information is lost during this compression. Therefore, vision may not need to follow the same Scaling Law as language.
  • 03:58:10 · From VISTAR to Think with Image: How Academic Research Inspires Industry
    • The speaker recounted how his VISTAR project inspired OpenAI’s ‘Think with Image’ project. Through discussions with OpenAI researchers, his ideas on visual reasoning and Test-time Scaling were adopted and eventually productized. This experience made him feel the value of academic research, but he also expressed regret about the increasing closed nature of industrial research, with less attribution and citation.
  • 04:03:18 · Don’t Fear High Dimensions: The Cornerstone of Representation Learning
    • The guest started by discussing the representation layer of Autoencoders, citing Professor Ma Yi’s view, emphasizing that high dimensionality is the cornerstone of all machine learning. Whether it’s past kernel methods or current Transformers, high-dimensional spaces can solve problems that low-dimensional spaces cannot, so one should not fear high dimensionality.
  • 04:05:22 · Betting on the Future: Representation is Core, Language is Interface
    • The guest put forward a core judgment about the future: learning good representations is the only important thing. In the future, language models will degrade into simple communication interfaces, while true intelligence will be driven by underlying, sufficiently good representations (i.e., world models).
  • 04:11:08 · What is a World Model? From Cybernetics to Cognitive Science
    • The guest defined a world model: a system that can predict future states based on current states and actions. He traced the history of this concept, from Kenneth Craik’s cognitive theory in 1943 to Model Predictive Control (MPC) in cybernetics, explaining that its core is prediction and planning.
  • 04:15:28 · Dyna and the Two Systems of Intelligence: Is the Father of RL Against RL?
    • The guest cited Rich Sutton’s Dyna paper, the father of reinforcement learning, to discuss two systems of intelligence: reactive and model-based, analogous to System 1 and System 2 from ‘Thinking, Fast and Slow’. Sutton also believed that pure RL is primitive and requires a world model for planning.
  • 04:20:27 · The Core of World Models: What Exactly is ‘State’?
    • The guest delved into the essence of ‘State’ in world models, arguing that it should not be a pixel-level precise reconstruction, but rather an abstract, hierarchical representation useful for decision-making. How to construct such an effective state representation is precisely the core task of representation learning.
  • 04:25:47 · All Roads Lead to Rome: All AI Research is Heading Towards World Models
    • The guest believes that whether it’s LLMs, video generation, or 3D reconstruction, all current technological paths in the AI field are essentially approaching the ultimate goal of ‘world models’ from different angles. Therefore, current debates over different approaches might seem ridiculous in the future.
  • 04:31:31 · Four Key Characteristics of an Ideal World Model
    • The guest (citing Yann LeCun) summarized several key characteristics of an ideal world model: understanding the physical world, possessing long-term memory, being capable of planning and reasoning, and being controllable and safe. This fundamentally differs from how current LLMs rely on fine-tuning for safety.
  • 04:36:43 · Fundamental Flaws of LLMs in Processing Continuous Signals
    • The guest criticized the current LLM approach of discretizing and sequentializing continuous spatiotemporal signals like video into tokens, deeming it completely unreasonable. He pointed out that this method ignores the global state of world representation and violates ‘The Bitter Lesson’, as language itself is an imposed human knowledge structure.
  • 04:43:51 · Differences in Scaling Laws between Language Models and World Models
    • The guest points out that the Scaling Law for language models is based on knowledge representation, while world models, especially those based on visual intelligence, may have very different Scaling Laws, and the model size does not necessarily need to be very large.
  • 04:45:30 · Core Capabilities of World Models: Understanding and Filtering
    • World models do not need to memorize all details; instead, they answer questions by understanding and filtering information. The guest uses the example of the human brain processing high-bandwidth sensory information and outputting low-bandwidth behavioral patterns to emphasize the importance of filtering systems.
  • 04:47:05 · Data Challenges of World Models and the Concept of ‘Downloading Humans’
    • Training world models faces immense data challenges, far exceeding those of language models. The guest proposes the concept of ‘downloading humans,’ which involves collecting human sensory data, and points out that platforms like YouTube have massive amounts of video data, but data crawling and copyright issues are significant obstacles.
  • 04:50:48 · Potential Applications of World Models: AI Glasses and Robots
    • The guest believes that AI glasses (personal assistants) and robots are two important application outlets for world models. AI glasses require world models to understand the environment and provide decision support, while robots need a more powerful ‘brain’ to achieve general intelligence.
  • 04:56:23 · From Academia to Entrepreneurship: Defining Problems and Finding New Paradigms
    • The guest explains his reasons for choosing entrepreneurship, believing that within academia and large tech companies, constrained by resources and product cycles, it is difficult to conduct truly cutting-edge, problem-defining research, easily falling into the ‘medium paper trap’.
  • 05:02:18 · The ‘Invisible World’ and Real Needs Under the Silicon Valley Narrative
    • The guest points out that under the Silicon Valley LLM narrative, there exists an ‘invisible world,’ which refers to a large number of real-world problems and needs in the physical world not directly addressed by LLMs, such as in farms and hospitals. These problems require world models to solve, but their data and problem definitions are invisible under the current paradigm.
  • 05:07:00 · Company Vision: Building a General World Model and Research-Driven Approach
    • The guest’s company aims to build a general world model and use it as a foundation to support various downstream applications such as language, vision, action, and robotics. The company considers research breakthroughs as its most important product, attracting like-minded young researchers to jointly explore the frontier.
  • 05:11:51 · Researchers and Team Culture: Avoiding the ‘Superhero’ Model
    • The guest emphasizes that the company culture is research-driven, not pursuing ‘superhero’ star researchers, but rather hoping to attract young people with a sense of mission and a willingness to grow together. He believes that past successful individuals may find it difficult to create breakthroughs again, and places more importance on overall team collaboration and the spirit of exploring the frontier.
  • 05:24:24 · Amy Labs: A Bridge for Top Academic Talent
    • Zhang Xiaojun explained his original intention for founding the company: to create a channel for talented students who are superior to many industry researchers but lack opportunities in academia. He hopes to connect them with the historical process of building general artificial intelligence.
  • 05:29:23 · ‘Reverse OpenAI’ and the Alliance Model
    • He introduced the concept of ‘Reverse OpenAI,’ which means not downloading data from the internet, but rather building a world model through an alliance of partners who possess specific data and problems. He used Mastercard as an analogy for this model, comparing it to an alliance of small banks competing with Visa.
  • 05:31:51 · A Global, Decentralized Startup
    • From day one, the company will be global, with offices in Paris, New York, Montreal, and Singapore. This decentralized structure, led by the neutral figure Yann LeCun, aims to attract global partners and resist monopolies.
  • 05:33:02 · Why Yann LeCun? The Charisma Behind the ‘Internet Troll’
    • Zhang Xiaojun explained why he chose to join LeCun. He described LeCun as a principled, warm, and inspiring person, which contrasts with his public image as an ‘internet troll.’ He also shared LeCun’s diverse hobbies and artistic flair, including model airplanes, astrophotography, electronic music, and sailing.
  • 05:43:21 · The Three Realms of JEPA: From Doubt to Understanding to Becoming
    • He described his personal cognitive journey with LeCun’s JEPA architecture. From initial skepticism to a deep understanding that JEPA is not just a model but a complete cognitive architecture, he eventually became a believer himself.
  • 05:54:12 · The Courage of Entrepreneurship: A Skiing Metaphor
    • He compared entrepreneurship to skiing, emphasizing the necessity of balance and the counter-intuitive courage to lean one’s body downhill. He quoted and agreed with the creed, ‘The anthem of humanity is the anthem of courage.’
  • 05:57:14 · Recruitment Philosophy: Seeking Passion and Persistence
    • His recruitment focus is on finding individuals who possess an obsessive passion and persistence for a particular problem. He shared Kaiming He’s advice, which is to identify true researchers by observing whether they are thinking about the same problem while eating, showering, and even sleeping.
  • 06:04:57 · Entrepreneurship and the Future of AI
    • The guest discussed the path of entrepreneurship and the broad prospects of AI, emphasizing the role of Large Language Models (LLMs) in AI development. He believes companies should focus on solving big problems and exploring new breakthroughs.
  • 06:11:29 · Entrepreneurial Insights and the Essence of Intelligence
    • The guest shared his true feelings about entrepreneurship, including challenges and joys, and firmly believes his choice was correct. He delved into the definition of intelligence, arguing that human intelligence is specialized, and AI should pursue human-like intelligence while discarding human arrogance. He cited Rich Sutton’s view that building squirrel-level intelligence is more challenging.
  • 06:18:51 · Definition of Intelligence and the Role of Robotics
    • The guest continued to discuss the definition of intelligence, emphasizing that it is more than just language models. He pointed out that robotics is the ‘appropriate outlet’ for AI, and attention should be paid to robots’ ability to perform practical tasks like household chores, which are simple for human children but difficult for current robots.
  • 06:27:19 · Personal Philosophy and Overcoming Setbacks
    • The guest explained his personal motto ‘You are not the chosen one, you are just an ordinary person,’ linking it to his favorite football team’s coach. He described research as a journey of fumbling in the dark and emphasized the importance of finding inspiration and human connection.
  • 06:33:24 · World Models and Real-World Interaction
    • The guest emphasized that AI needs to transcend research boundaries and interact with the real world. He shared observations from living in New York, where different people lead their own lives, making him realize that AI is not a core concern for everyone.
  • 06:37:19 · AI-Related Media Recommendations
    • The guest recommended TV series and films exploring AI themes, such as ‘Person of Interest,’ ‘Battlestar Galactica,’ and ‘Full Pixel Space,’ noting their profound depiction of AI’s impact.
  • 06:40:51 · Profound Books and the Essence of Understanding
    • The guest discussed two books that profoundly influenced him: ‘Gödel, Escher, Bach’ and ‘Zen and the Art of Motorcycle Maintenance,’ highlighting their philosophical depth and personal impact. He reflected on how these books shaped his understanding of the world and himself.
  • 06:44:38 · Data, Architecture, and the Importance of Connection in AI
    • The guest discussed the importance of data in generative AI models like Stable Diffusion, pointing out that 90-95% of the challenges lie in data. He also emphasized the core value of human connection and communication, both in research and entrepreneurship.

Notable Quotes (69)

  • 00:00:39 — 谢赛宁:

    Original (中文): 我不知道,我觉得我更适合做一个听众。我很喜欢 podcast。我经常听很多的 podcast。 I don’t know, I think I’m more suited to be a listener. I really like podcasts. I often listen to many podcasts.

    • Reveals his preference for consuming content over creating it, and his passion for podcasts, setting the stage for his first interview.
  • 00:00:57 — 谢赛宁:

    Original (中文): 我爸是一个纯粹的死宅。从不外出。但是他最爱看的事情就是看书。所以我家里反正有一个书房吧,然后几面墙都是都是书。 My dad is a pure homebody. Never goes out. But what he loves most is reading books. So I have a study at home, and several walls are just books.

    • Emphasizes the critical influence of his family environment (especially his father’s reading habits) on his knowledge accumulation and interest development during childhood.
  • 00:00:59 — 谢赛宁:

    Original (中文): 我这个后训练现在有点崩,所以中英夹杂的问题,对,观众朋友们不好意思,我尽量尽量解释。 My post-training is a bit broken now, so for the mixed Chinese and English issue, yes, dear audience, I will try my best to explain.

    • He humorously self-deprecates about the change in his language habits after living abroad for a long time, showing humility and sincerity.
  • 00:01:07 — 谢赛宁:

    Original (中文): 我第一次知道什么叫做内容。然后那时候就会觉得,自己突然有了更多的表达欲。 For the first time, I knew what content was. And at that time, I felt that I suddenly had more desire to express myself.

    • Describes the profound impact of the internet on his desire for self-expression, which became one of the intrinsic drivers for his later engagement in scientific research and entrepreneurship.
  • 00:01:43 — 谢赛宁:

    Original (中文): 我其实就是说我看跟谁比对吧,跟那些最顶尖的竞赛选手,像我刚刚描述的这颗非常顺利的这个对吧,姚班大神,然后斯大PhD,斯大教授来比,那我真的是远远不如。 I mean, it depends on who I’m comparing myself to, right? Compared to the top competition participants, like the very smooth ‘Type A’ path I just described, right, the Yao Class prodigies, then Stanford PhDs, Stanford professors, I’m far from them.

    • He humbly evaluates his academic background, contrasting it with the ‘elite’ path, highlighting his non-traditional and more personally chosen growth trajectory.
  • 00:02:50 — 谢赛宁:

    Original (中文): 我觉得这个世界总是不想让我去做我想要做的事情。但是,但是我偏偏要做我想要做的事情。 I feel like the world always tries to stop me from doing what I want to do. But, but I insist on doing what I want to do.

    • This is a core philosophy running through his personal experience, demonstrating his strong willpower and determination to stick to his own choices.
  • 00:05:05 — 谢赛宁:

    Original (中文): 因为我觉得我感受这个世界的方式就是通过视觉。 Because I feel that the way I perceive the world is through vision.

    • Directly clarifies his deep personal motivation for choosing research in computer vision, which is his profound understanding of visual perception and interaction with the world.
  • 00:09:18 — 谢赛宁:

    Original (中文): 你要想,如果你不做这件事情,这件事情在这个世界上永远不会发生。 You have to think, if you don’t do this thing, this thing will never happen in this world.

    • A philosophical statement emphasizing the importance of individual action and everyone’s unique potential for contribution in the world.
  • 00:41:39 — 谢赛宁 (Saining Xie):

    Original (中文): 涂老师是那种,坐在你的显示器旁边,跟你一行一行代码往后去对的这样一个老师。 Professor Tu is the kind of teacher who would sit next to your monitor and go through your code line by line with you.

    • Vividly describes the advisor’s hands-on guidance style, reflecting the rigor and legacy spirit of older generation scientists.
  • 00:42:58 — 谢赛宁 (Saining Xie):

    Original (中文): 他们其实是闯出了一条路,对,本来这条路是不存在的。 They actually blazed a trail, yes, a path that didn’t exist before.

    • Highly praises the pioneering contributions of senior Chinese scientists in expanding the academic landscape in the US.
  • 00:50:47 — 谢赛宁 (Saining Xie):

    Original (中文): 你要说什么一鸣惊人,我当初确实觉得,嗯,你看我也是年少成名了…很不幸,这是我最后一次拿 best paper。 If you talk about making a splash, I really thought at the time, ‘Hmm, look, I’ve achieved early fame…’ Unfortunately, that was the last time I won a best paper award.

    • Humorously and self-deprecatingly reflects on the highlights of his PhD career and the subsequent calm, illustrating the serendipitous and long-term nature of research.
  • 01:14:35 — 谢赛宁 (Saining Xie):

    Original (中文): 一个线性的 research 永远不是好的 research。 A linear research is never good research.

    • Succinctly summarizes the non-linear, uncertain nature of research.
  • 01:19:23 — 谢赛宁 (Saining Xie):

    Original (中文): 我只考虑的事情是,我应该去做哪里,做我最想做的事情,然后最好是跟我最想要共事的人一起共事。 The only thing I consider is where I should go to do what I most want to do, and preferably work with the people I most want to collaborate with.

    • Clearly articulates his core principles in career choice: pursuing interests and collaborating with excellent people, rather than chasing fame, fortune, or a predetermined path.
  • 01:21:14 — 谢赛宁 (Saining Xie):

    Original (中文): 我什么都没说我就把OpenAI拒了,他们发给我一个offer,然后说我不去,抱歉。 I rejected OpenAI without saying anything. They sent me an offer, and then I said I wouldn’t go, sorry.

    • Directly and decisively rejected OpenAI’s offer, showing his preference for FAIR at the time.
  • 01:21:22 — 晓军:

    Original (中文): 你为什么不讨论一下就把这个offer拒了?是我们给的钱不够吗? Why did you reject this offer without discussing it? Was the money we offered not enough?

    • Ilya Sutskever’s surprise at Saining Xie rejecting OpenAI indirectly reflects OpenAI’s status and attractiveness in the industry at the time.
  • 01:21:34 — 谢赛宁 (Saining Xie):

    Original (中文): 我觉得至少在那个时间点上,我身边的所有人如果有这样的选择的话,除非他们是确实要做一些OpenAI已经在做得很擅长的这些事情,我觉得大家还是会倾向于FAIR的。 I think at least at that point in time, if everyone around me had such a choice, unless they really wanted to do things that OpenAI was already very good at, I think most people would still lean towards FAIR.

    • Explained why top PhD graduates at the time preferred FAIR over OpenAI, emphasizing FAIR’s academic environment advantages.
  • 01:22:25 — 谢赛宁 (Saining Xie):

    Original (中文): 我觉得research的意义,我觉得research不是为了发论文,我不认为发论文是是是这件事情的一个目的。 I think the meaning of research, I think research is not for publishing papers. I don’t think publishing papers is the goal of this matter.

    • Expressed a deep understanding of the purpose of research, going beyond mere paper publication, emphasizing knowledge sharing and inspiration.
  • 01:23:30 — 谢赛宁 (Saining Xie):

    Original (中文): 我不在乎什么impact,我不在乎影响力这件事情。他觉得impact这个词是一个过于aggressive,过于男性化的一个词。 I don’t care about impact, I don’t care about this ‘influence’ thing. He thinks the word ‘impact’ is too aggressive, too masculine.

    • A unique perspective on the concept of ‘impact,’ considering it too aggressive and preferring to foster resonance through understanding.
  • 01:24:47 — 谢赛宁 (Saining Xie):

    Original (中文): 如果能让这个世界上所有的人因为我们做的研究,能够对问题多了一层新的认识,多了一层新的了解,那这个地球上的智能总量就会被提上去。但地球上智能总量提升这件事情永远不是一件错误的事情。 If our research can give everyone in the world a new layer of understanding and knowledge about problems, then the total amount of intelligence on Earth will be raised. And raising the total amount of intelligence on Earth is never a wrong thing.

    • Elaborated on his ultimate research goal: to increase the total intelligence on Earth, considering it an eternal pursuit beneficial to the world.
  • 01:25:49 — 谢赛宁 (Saining Xie):

    Original (中文): 我从来没有一次要求过任何一家这样的媒体去做这样的宣传。我跟我学生说,你们千万不要去什么去小红书啊,去什么知乎去宣传自己的工作。 I have never once asked any such media to do this kind of promotion. I tell my students, ‘You absolutely must not go to Xiaohongshu or Zhihu to promote your work.’

    • Expressed a cautious attitude towards promoting research results, opposing excessive personalization and hype, emphasizing the essence of the work and the visibility of young people.
  • 01:27:14 — 谢赛宁 (Saining Xie):

    Original (中文): 我觉得这件事情也很难刻意地做到。或者说这件事情也有点玄学。我会觉得你可以叫它某种吸引力法则,或者说你可以认为大家想法一致的人,最终都会聚合在一起。 I think this is also very difficult to achieve intentionally. Or rather, this matter is a bit mystical. I would say you can call it a law of attraction, or you can believe that people with similar ideas will eventually gather together.

    • Explained the ‘mystical’ process of connecting with top researchers, believing that like-minded individuals will naturally converge.
  • 01:28:23 — 谢赛宁 (Saining Xie):

    Original (中文): 我觉得他是一个有某种极致的专注力,然后这个专注力能够让他有某种心流,他能够沉浸在这个问题上,不需要考虑这个世界上发生的所有其他事情。 I think he has a kind of extreme focus, and this focus allows him to enter a state of flow, where he can immerse himself in the problem without having to consider anything else happening in the world.

    • Highly praised Kaiming He’s extreme focus and ability to enter a ‘flow state,’ considering it a characteristic of top researchers.
  • 02:09:50 — guest:

    Original (中文): 这个T度(gradient)本身,这件事情,才是你真正的idea的来源。……一开始你想的这个idea不是你的idea,这个idea不属于你,探索中的idea才是属于你的idea。 This gradient itself, this matter, is the source of your true idea. … The idea you initially think of is not your idea; that idea does not belong to you. The idea found through exploration is your idea.

    • A incisive summary of the core idea that the ‘exploration process’ is more important than ‘initial ideas’ in research.
  • 02:11:38 — guest:

    Original (中文): 最差的research是什么样的research?就是一开始你定义好了一个问题,……最后你发了一篇论文,这个论文的idea跟你一开始想的idea完全一致,你没有遇到任何的障碍,你没有遇到任何的困难。……这件事情说明你的这个idea是一个boring idea。 What kind of research is the worst research? It’s when you define a problem at the beginning… and finally publish a paper where the idea is completely consistent with your initial idea, you didn’t encounter any obstacles, you didn’t encounter any difficulties. … This indicates that your idea is a boring idea.

    • Presents a counter-intuitive but profound point of view, suggesting that smooth-sailing research often implies mediocre ideas.
  • 02:17:15 — guest:

    Original (中文): 你这辈子只需要成功一次就好了。 You only need to succeed once in your life.

    • Vividly describes the non-linear characteristic of research impact, emphasizing the immense value of a single breakthrough.
  • 02:19:07 — guest:

    Original (中文): 现在制定这个去哪的人是OpenAI,是Google……他们是有限游戏,但导致他们把学术界也带成了一个有限游戏的这种决策的这样一个链条。……我们怎么样在这个范式下面,用这种叫做’peanuts of resources’,用花生米一样少的这种资源,然后尝试去追赶。 Now, the ones deciding where to go are OpenAI, Google… they are playing a finite game, but this leads to a decision chain that drags academia into a finite game model as well. … How do we, under this paradigm, use ‘peanuts of resources,’ resources as scarce as peanuts, and try to catch up?

    • Acutely points out the current dilemma faced by AI academia: being swept into the finite game rhythm of the industry.
  • 02:20:43 — guest:

    Original (中文): 我之所以去Google做这个工作,原因是我先看看Google大家在做什么,这样我就知道我在学术界不做什么。因为如果你在做这件事情的话,我为什么要跟你一起做呢? The reason I went to Google to do this work is that I first wanted to see what everyone at Google was doing, so I would know what not to do in academia. Because if you are doing this, why should I do it with you?

    • Reveals a clever research strategy in situations of unequal resources: actively avoiding the main battlegrounds of giants and seeking differentiated paths.
  • 02:41:32 — guest:

    Original (中文): 不好不差就没有信号。一个negative的信号的反方向就是一个正向的信号,一个positive的结果的正方向也是一个好的信号。 If it’s neither good nor bad, there’s no signal. The opposite direction of a negative signal is a positive signal, and the positive direction of a positive result is also a good signal.

    • Clearly explains how to extract information from experimental results, emphasizing that bad results can be even more valuable than no results.
  • 02:41:51 — guest:

    Original (中文): 你要学会做预测。在你跑每一个实验的时候,你要预测这个实验的结果应该是怎么样的。……如果你想对了,说明你前面的这个思维链条是可以往前继续延伸、往前继续推的。如果你想错了,again,这也是一个surprise,也是一样的,也给了你一个信号。 You need to learn to make predictions. Before running each experiment, you should predict what the result of that experiment should be. … If you predicted correctly, it means your previous chain of thought can be extended and pushed forward. If you predicted incorrectly, again, that’s also a surprise, and it gives you a signal.

    • Provides a concrete, actionable scientific methodology, namely accelerating cognition and iteration through a ‘predict-verify’ cycle.
  • 02:43:26 — 谢赛宁 (Saining Xie):

    Original (中文): 他一直劝我们的事情是说…欸,那个赛宁,你们在美国读博士,你们的title可都是PhD啊, it’s a Doctor of Philosophy, 是哲学博士。但为什么你们培养出来的人一点哲学都不懂呢? He always advised us… ‘Hey, Saining, you’re doing your PhD in the US, and your title is PhD, it’s a Doctor of Philosophy. But why do the people you train not understand any philosophy?’

    • Reveals mentor Kaiming He’s emphasis on researchers’ philosophical literacy, providing context for understanding ‘research taste’.
  • 02:44:36 — 谢赛宁 (Saining Xie):

    Original (中文): 研究审美…它真的是一个内化的东西…包含我其实上述所说的所有的这些事…具体怎么做事情,我觉得这些事情都包含在之内。但…也涉及到一些更high-level的这种这种这种哲学…部分的这种考量。 Research aesthetics… it’s really an internalized thing… it includes all the things I mentioned above… how to do things specifically, I think all these things are included. But… it also involves some higher-level philosophical… considerations.

    • Defines ‘research taste’ as a comprehensive concept that transcends specific methods and delves into philosophical considerations.
  • 02:45:51 — 谢赛宁 (Saining Xie):

    Original (中文): 凡所有相,皆是虚妄。若见诸相非相,即见如来…你看到的这个事情的本题…你看到的世界也不是实至。 All phenomena are illusory. If one sees that all phenomena are not phenomena, then one sees the Tathagata… The essence of what you see… the world you see is not real.

    • Cites the core idea of ‘The Diamond Sutra’ to analogize the pursuit of scientific research, emphasizing the need to see through appearances and explore the essence of things.
  • 02:52:05 — 谢赛宁 (Saining Xie):

    Original (中文): 我觉得做research的过程跟拍电影过程其实没什么不一样。 I think the process of doing research is actually no different from the process of making a movie.

    • Proposes a core analogy, comparing the essence of scientific research to the storytelling and decision-making process in filmmaking.
  • 02:55:52 — 谢赛宁 (Saining Xie):

    Original (中文): 不是因为看见,所以相信。是因为相信,所以看见。 It’s not because you see, therefore you believe. It’s because you believe, therefore you see.

    • Cites his undergraduate teacher’s view, emphasizing the leading role of belief in scientific discovery as an important driving force for exploring the unknown.
  • 03:14:18 — 谢赛宁 (Saining Xie):

    Original (中文): Research其实必须得要是一个反脆弱的系统…一个可能的一个random的event,某种黑天鹅事件发生,或者说某种shock…这件事情如果对于这个组织,对于这个人或者对于这个事情来说,你的收益要比你的损失要大,那你的这个组织就是一个反脆弱的组织。 Research actually must be an antifragile system… a possible random event, some black swan event, or some shock… If, for this organization, this person, or this matter, your gains are greater than your losses, then your organization is an antifragile organization.

    • Clearly explains the application of the ‘antifragile’ concept in scientific research, suggesting that research systems should benefit from uncertainty and setbacks.
  • 03:27:44 — guest:

    Original (中文): 人活在这个世界上就是长镜头。我们的眼睛就是我们的相机, 我们不停歇地在这个世界上面做各种各样的事情, 对吧, 然后我们看到的东西, 这个介质都是video, 都是视频。 Humans living in this world are like a long take. Our eyes are our cameras, we continuously do all sorts of things in this world, right? And what we see, this medium, is all video, all video.

    • Vividly analogizes human continuous visual perception with the concept of a film ‘long take,’ using it to argue for the fundamental nature of video understanding compared to static image understanding.
  • 03:28:42 — guest:

    Original (中文): 贾樟柯说了一句话, 我觉得我非常有认同。他说这个电影之所以很有意思, 是因为你如果只看这个timeline的话, 这是一根时间轴, 它是一个线性的时间轴。但是在这个时间轴的每一个点上, 你需要一个空间去扩展它的时间。 Jia Zhangke said something I strongly agree with. He said the reason film is so interesting is that if you only look at the timeline, it’s a single timeline, a linear timeline. But at every point on this timeline, you need a space to expand its time.

    • Citing Jia Zhangke’s perspective, it profoundly reveals the dialectical relationship between time and space in visual narrative and understanding, providing a philosophical basis for his research.
  • 03:44:27 — guest:

    Original (中文): 现在大家都是只是拄着拐杖, 这个拐杖就是语言模型本身。虽然你可以走走路, 然后你会觉得我走得挺好的, 但是你可能跑不起来, 你也没有办法去参加这个奥运会。对, 因为你有一根腿, 这部分是所谓的视觉的表征的这一根腿, 现在还是还是还是不够好。 Right now, everyone is just leaning on a crutch, and this crutch is the language model itself. Although you can walk around, and you might feel you’re walking quite well, you probably can’t run, nor can you participate in the Olympics. Yes, because you have one leg, this part, the so-called visual representation leg, is still not good enough.

    • Citing Yann LeCun’s analogy, it vividly illustrates the current multimodal systems’ over-reliance on language models and the fundamental flaw of lacking strong visual representations.
  • 03:51:20 — guest:

    Original (中文): 一个东西免费不代表它没有label。语言是什么? 语言是人在过去这么几千年的civilization, 经过不断的演化, 然后在不管是社会学的意义上, 还是每一个单独的个体的意义上, 然后process了所有的关于这个世界的一切, 然后以一个tokenized的方式把它存储下来。 Something being free doesn’t mean it has no label. What is language? Language is what humans, over thousands of years of civilization, through continuous evolution, and in both a sociological sense and for each individual, have processed everything about this world and stored it in a tokenized way.

    • Proposes a groundbreaking view, arguing that language model training data is not unsupervised, but rather ‘strongly supervised’ data processed and annotated over long periods by human civilization, challenging traditional perceptions of self-supervised learning.
  • 03:53:58 — guest:

    Original (中文): Language is a communication tool, it’s not a thinking tool. 它是一个交流的工具。如果它是一个交流的工具的话, 你总要make一些trade-off, 你总要牺牲掉一些东西。 Language is a communication tool, it’s not a thinking tool. It is a tool for communication. If it’s a communication tool, you always have to make some trade-offs, you always have to sacrifice some things.

    • Clearly defines the essential function of language and points out the inevitable information compression and loss it entails as a communication tool, explaining why pure language models cannot fully understand the physical world.
  • 04:04:01 — 张小军 (Zhang Xiaojun):

    Original (中文): 你们一定不能害怕高维度。高维度是所有机器学习里面非常非常重要的一个一个基石。不管是之前的所谓的这种核学习的方式,还是现在为什么一个Transformer里面,我们得要有这种up-projection layer。 You must not be afraid of high dimensions. High dimensionality is a very, very important cornerstone in all machine learning. Whether it’s the previous so-called kernel learning methods, or why in a Transformer now, we must have these up-projection layers.

    • Emphasized the core importance of high-dimensional representations in the history and modern architecture of machine learning.
  • 04:06:29 — 张小军 (Zhang Xiaojun):

    Original (中文): 这个世界上只有一件事情是重要的,就是怎么学习到这个表征,这件事情是重要的。当你有了一个足够好的表征之后,在上面处理其他的问题都是简单的。你的language model会逐渐会退化掉到一个简单的communication interface。 There is only one thing that is important in this world, and that is how to learn this representation. When you have a good enough representation, processing other problems on top of it becomes simple. Your language model will gradually degrade into a simple communication interface.

    • Clearly articulated his core argument and prediction for the future: representation is the core, and language models are auxiliary interfaces.
  • 04:26:11 — 张小军 (Zhang Xiaojun):

    Original (中文): 我们所有人,不管你在做LLM还是做什么video diffusion model,还是做这个gaussian splatting,我们所有人都在通往世界模型的道路上。所以,我说我有的时候这些竞争或者说这些arguments,听起来我觉得过不了多久,可能过一到两年时间,都会显得异常可笑。 All of us, whether you are doing LLMs or video diffusion models, or gaussian splatting, we are all on the path to world models. So, I say sometimes these competitions or these arguments, I think in a short while, maybe in one to two years, will seem exceptionally ridiculous.

    • Proposed a unified vision, suggesting that current debates over different AI technological paths are temporary and will eventually converge on building world models.
  • 04:39:28 — 张小军 (Zhang Xiaojun):

    Original (中文): 像素本身也是一个接口,它不是一个…它是给人和看的。语言也是一个接口,它是给人和看的。但它不是world model的核心。world model的核心是它在自发地去学到更好的表征,去做更好的预测。 Pixels themselves are also an interface; they are not… they are for humans to see. Language is also an interface; it is for humans to see. But it is not the core of a world model. The core of a world model is that it spontaneously learns better representations to make better predictions.

    • Clearly distinguished human perception interfaces (pixels, language) from the core required for machine intelligence (underlying representations).
  • 04:44:18 — 谢赛宁 (Saining Xie):

    Original (中文): 语言模型的scaling law是基于一个对knowledge的这种representation所得来的这样一种scaling law。 The scaling law of language models is a scaling law derived from the representation of knowledge.

    • Explains the underlying logic of language model Scaling Law, paving the way for subsequent comparison with world models.
  • 04:44:51 — 谢赛宁 (Saining Xie):

    Original (中文): 世界模型,尤其是基于这种visual intelligence的世界模型,我觉得它会有一个非常非常不一样的scaling law。 World models, especially those based on visual intelligence, I think will have a very, very different scaling law.

    • Emphasizes the fundamental difference between world models and language models in terms of Scaling Law, hinting at new research directions.
  • 04:45:11 — 谢赛宁 (Saining Xie):

    Original (中文): 它不需要通过解一个什么确定的方程,在一个巨高维的空间里面,的方式去判断一颗苹果是不是落下来。 It doesn’t need to determine if an apple falls by solving some definite equation in a super high-dimensional space.

    • Vividly illustrates that the core capability of world models lies in understanding and filtering, rather than rote memorization or complex calculations.
  • 04:46:32 — 谢赛宁 (Saining Xie):

    Original (中文): 我们大腦是怎么样一个模型,能够在20瓦的功率下面,把10亿bits per second的信息,通过我们眼睛还有各种各样感官输入进来,转化成我们10个bits per second的一个行为模式。 How is our brain a model that can take 1 billion bits per second of information, input through our eyes and various senses, and convert it into a behavioral pattern of 10 bits per second, all under 20 watts of power?

    • Using the example of the human brain, it explains the mechanism by which world models efficiently process and filter information, ultimately converting it into decisions and actions.
  • 04:47:52 — 谢赛宁 (Saining Xie):

    Original (中文): 我觉得过去时代是dump这个download Internet的时代。现在时代是download human的时代。 I think the past era was the era of ‘dumping’ or ‘downloading’ the Internet. Now is the era of ‘downloading humans’.

    • Proposes a bold and imaginative view, indicating the future direction of AI data acquisition, shifting from internet data to human experience data.
  • 04:52:00 — 谢赛宁 (Saining Xie):

    Original (中文): 我觉得这件事情可能才是一个真正的难点。可能是一个比数据还要更难的问题。 I think this might be the real difficulty. It might be an even harder problem than data itself.

    • Points out that the biggest challenge for world models might not be data itself, but how to define and build their ultimate product form.
  • 04:57:47 — 谢赛宁 (Saining Xie):

    Original (中文): 我唯一喜欢World Model这一点,是因为它能告诉大家我做的是World Model,而不是Word Model。 The only thing I like about World Model is that it tells everyone I’m doing a World Model, not a Word Model.

    • Citing Yann LeCun’s view, it cleverly emphasizes the fundamental difference between ‘World Model’ and ‘Language Model,’ highlighting its ability to understand the physical world.
  • 05:04:47 — 谢赛宁 (Saining Xie):

    Original (中文): 我从质疑JAX到理解JAX到成为JAX,经历了人生的三个stage。 I went through three stages in life: from questioning JAX to understanding JAX to becoming JAX.

    • Humorously describes his journey from skepticism to acceptance and embrace of JAX (or new paradigms in general), hinting at the challenges and opportunities of research paradigm shifts.
  • 05:05:08 — 谢赛宁 (Saining Xie):

    Original (中文): 世界需要一个世界模型。 The world needs a world model.

    • Concise and powerful expression of the necessity and importance of world models, serving as a core driving force for his entrepreneurship.
  • 05:05:51 — 谢赛宁 (Saining Xie):

    Original (中文): 这个隐形的世界是在这个硅谷的叙事逻辑下面不可见的。但我觉得这是一个很大的市场。 This invisible world is not visible under the Silicon Valley narrative logic. But I think this is a huge market.

    • Reveals the huge market potential beyond the current focus of the AI field, namely real-world problems in the physical world that are overlooked under the existing Silicon Valley narrative.
  • 05:29:23 — Zhang Xiaojun:

    Original (中文): 我们想要build的这样一个反向的OpenAI…正向的OpenAI是说, 我现在有互联网作为我的数据的发源地, 然后我把数据download下来, train一个transformer…反向的OpenAI是说, 要做这个model本身, 这件事情没办法直接从互联网上download下来。 The ‘reverse OpenAI’ we want to build… A ‘forward OpenAI’ means I now have the internet as my data source, and I download the data to train a transformer… A ‘reverse OpenAI’ means that to build this model itself, this cannot be directly downloaded from the internet.

    • Clearly defines the core concept of his startup, contrasting it with mainstream approaches.
  • 05:31:51 — Zhang Xiaojun:

    World model needs the world.

    • Summarizes in one sentence the philosophy that building a world model requires global collaboration.
  • 05:35:16 — Zhang Xiaojun:

    Original (中文): 他(Yann LeCun)这件事情是不受到外界的任何事情的干扰的…但他这件事情不代表他完全是一个固执的, 听不进任何话的人…他说我完全可以被move, 但我需要基于事实来被move。 He (Yann LeCun) is not disturbed by anything external in this matter… but this doesn’t mean he is completely stubborn and won’t listen to anything… He said he can be moved, but he needs to be moved based on facts.

    • Profoundly depicts Yann LeCun’s personality as a scientist who adheres to principles yet respects facts.
  • 05:36:36 — Zhang Xiaojun:

    Original (中文): 我作为一个科学家的正直 (My integrity as a scientist) … cannot accept this. My integrity as a scientist… cannot accept this.

    • Quotes Yann LeCun, revealing the deep reason behind LeCun’s departure from Meta, based on scientific integrity.
  • 05:55:17 — Zhang Xiaojun:

    Original (中文): 人类的赞歌就是勇气的赞歌。我觉得这也是我的一个对于创业的认知。 The anthem of humanity is the anthem of courage. I think this is also my understanding of entrepreneurship.

    • Expresses the core of his entrepreneurial spirit: the courage to embrace uncertainty and challenges.
  • 05:57:44 — Zhang Xiaojun:

    Original (中文): 你一天起床要想这个问题, 吃饭的时候要想这个问题, 洗澡的时候要想这个问题, 睡觉的时候可能可以不用想, 但可能带着这个问题睡觉。 When you wake up, you think about this problem; when you eat, you think about this problem; when you shower, you think about this problem; maybe you don’t have to think about it when you sleep, but you might go to sleep with this problem on your mind.

    • Vividly describes the state of obsession and dedication he values in researchers.
  • 06:02:22 — Zhang Xiaojun:

    Original (中文): 我们不是含着金汤匙, 我们完全没有这种感觉。我觉得我们是一个underdog。 We were not born with a silver spoon in our mouths; we don’t feel that way at all. I think we are an underdog.

    • Despite significant funding, he positions the company as an ‘underdog’ challenging mainstream paradigms, reflecting his entrepreneurial mindset.
  • 06:05:04 — 谢赛宁 (Saining Xie):

    Original (中文): 他需要有这种world understanding的能力,他需要理解世界的能力,然后他他需要能够有做prediction的能力,然后他他需要有能做planning的能力。 It needs to have the ability for world understanding, it needs the ability to understand the world, then it needs to have the ability to make predictions, and then it needs to have the ability to do planning.

    • Defines the core capabilities required for advanced AI beyond simple learning, including world understanding, prediction, and planning.
  • 06:05:55 — 谢赛宁 (Saining Xie):

    Original (中文): 这件事情让我觉得嗯,这个公司可以做,并且有很大的机会可以做成功。原因它不是把事情做小了。 This made me feel that, hmm, this company can be done, and there’s a great chance it can succeed. The reason is that it’s not making things smaller.

    • Expresses confidence in the company’s potential by focusing on solving big problems, contrasting with common trends.
  • 06:09:02 — 谢赛宁 (Saining Xie):

    Original (中文): 我看了这本书之后,我会放弃更多这种人类的自大。我觉得我觉得这种智能演进是一个连续的过程,它不是一个说,哎,人就真的是独一无二。 After reading this book, I would give up more of this human arrogance. I think this evolution of intelligence is a continuous process; it’s not that, oh, humans are truly unique.

    • Advocates for humility regarding human intelligence, viewing it as part of a continuous evolutionary process rather than a unique existence.
  • 06:13:13 — 谢赛宁 (Saining Xie):

    Original (中文): 我觉得能够打造出来一只松鼠的智能,这件事情才是难的问题。 I think being able to create the intelligence of a squirrel, that’s the difficult problem.

    • Cites Rich Sutton’s view, emphasizing the complexity of creating seemingly simple animal intelligence, challenging traditional perceptions of AI difficulty.
  • 06:20:14 — 谢赛宁 (Saining Xie):

    Original (中文): 我希望鼓励大家的事情是说,不要只关注那些我们每一个个体做不到的事情。关注一下我们现在做的很好的事情。 What I hope to encourage everyone is not to just focus on things that each of us individually cannot do. Let’s focus on what we are doing well now.

    • Encourages people to focus on existing human capabilities and strengths, rather than solely on AI’s limitations.
  • 06:32:21 — 谢赛宁 (Saining Xie):

    Original (中文): 我每天最解压的时光就是这大概五到十分钟的路。我发现这个世界比我们想象的大的多。不是所有人都关心什么叫做AI。 My most stress-relieving time every day is this roughly five to ten-minute walk. I found that this world is much bigger than we imagine. Not everyone cares about what AI is.

    • Reflects on the broader human experience beyond the AI bubble, emphasizing the diversity of concerns in the world.
  • 06:46:58 — 谢赛宁 (Saining Xie):

    Original (中文): 我只是不喜欢看到大家paper里面,开篇先拉一句话放在这,然后我觉得这件事情不符合我的审美。 I just don’t like seeing people’s papers start by pulling a quote and putting it there; I feel that doesn’t align with my aesthetic.

    • Expresses dissatisfaction with the academic practice of quoting philosophers without deep understanding, emphasizing personal aesthetics and the pursuit of depth.
  • 06:48:39 — 谢赛宁 (Saining Xie):

    Original (中文): 我还是相信人与人之间的交流这件事情很重要。 I still believe that communication between people is very important.

    • Summarizes his core personal belief that human connection and communication are crucial in both personal and professional development.

Predictions (6)

  • 01:08:16 (长期) — Demis Hassabis: DeepMind will eventually become a company capable of winning multiple Nobel Prizes.
  • 02:25:56 (长期) — guest: LLMs (Large Language Models) will wither away. They are not the cornerstone for building a universal intelligent system; they are not the foundation of the edifice of this world model.
  • 04:06:29 (未来) — 张小军 (Zhang Xiaojun): In the future, Large Language Models (LLMs) will no longer be the core driving force of intelligence, but will instead degrade into a simple communication interface interacting with underlying world models.
  • 04:26:11 (1-2年) — 张小军 (Zhang Xiaojun): The competition and debates between different technological paths in the current AI field (e.g., LLMs vs. video generation) will seem ridiculous within one to two years, as all paths will ultimately converge on the goal of building world models.
  • 05:49:19 (Short-term (by March of the recording year)) — Zhang Xiaojun: By the time this show airs, perhaps in the next three months, we will have another paper coming out, called Solaris.
  • 06:07:07 (Mid-term) — Zhang Xiaojun: Those who are hypnotized will eventually wake up. And I think at that time, we would not rule out setting up a company in Silicon Valley.

Visual Signals (Beyond the Transcript)

Production setting: An indoor studio or loft-style office with an exposed brick wall in the background. · production: Professional

  • props: Professional microphones on stands on a light-colored wooden table, Guest’s smartwatch on his left wrist, Guest’s dark button-down shirt with a small Vivienne Westwood orb logo

Energy Shifts (15)

  • 📈 01:23:08 — Recounting his awkward job talk at FAIR where he finished too early.
    • A brief, genuine smile and a slight chuckle, showing self-awareness and humor about a past mistake. The energy becomes lighter and more personal.
  • 📈 01:45:48 — Explaining the philosophical purpose of research as a way to seek understanding and connect with others.
    • His expression becomes more earnest, and his hand gestures become more deliberate and expansive, reflecting the deeper, more abstract nature of the topic.
  • 📈 02:03:37 — Describing the qualities of his mentor, Kaiming He.
    • The speaker’s rate of speech increases, he leans forward slightly, and his hand gestures become more frequent and emphatic. He nods for emphasis and his eyes widen, conveying strong admiration and excitement for the topic.
  • 📈 02:12:00 — Explaining the non-linear, winding path of research.
    • The speaker breaks into a smile and uses a fluid, winding hand gesture to illustrate the concept of a ‘winding and twisting’ (弯弯绕绕) research journey, showing his amusement and passion for the metaphor.
  • 📈 02:51:00 — Apologizing for mixing languages
    • Xie Saining’s energy becomes more lighthearted and humorous. He smiles self-deprecatingly and uses a tech-specific metaphor (‘my back-end training is a bit broken’) to describe his language skills, which creates a moment of connection and laughter.
  • 📈 03:04:00 — Explaining the concept of ‘research taste’ and its connection to philosophy.
    • His speech becomes more animated, and his hand gestures become more frequent and expansive. He leans forward slightly, showing increased engagement with the abstract topic.
  • 📉 03:05:29 — Recalling that the DiT paper was initially rejected by a conference.
    • He gives a brief, wry smile and a slight shrug, transitioning into a more matter-of-fact, resigned tone. The energy is not one of sadness, but of ironic acceptance.
  • 📈 03:24:45 — Recounting his personal experience ‘pitching’ to a collaborator at Google.
    • He leans forward slightly and his hand gestures become more frequent and animated, showing increased personal engagement with the story.
  • 📈 03:39:58 — Refuting the idea that computer vision researchers are frustrated by the rise of LLMs.
    • His posture becomes more upright, his tone (inferred from his expression) becomes more assertive, and he makes strong, direct eye contact with the host, non-verbally underlining his conviction.
  • 📈 04:03:55 — Recounting Professor Yi Ma’s passionate defense of high-dimensionality.
    • Xiaojun’s speech rate quickened, his gestures became larger, his body leaned further forward, and his eyes brightened. He vividly imitated and recounted Professor Ma Yi’s excited emotions, showing a strong sense of identification.
  • 📉 06:06:03 — The host asks about his personal feelings after starting his company (‘How do you feel after starting your own business?’).
    • His posture becomes more contained and his gaze shifts downward, indicating a move from intellectual explanation to personal reflection.
  • 📈 06:07:08 — Discussing the philosophical debate around AGI and Yann LeCun’s arguments.
    • He becomes more animated, using more frequent and emphatic hand gestures to explain complex concepts.
  • 📈 09:28:20 — Introducing the concept of ‘world models’ (世界模型).
    • Gestures become more expansive and definitive. His speech cadence becomes slightly more pronounced as he introduces this key concept.
  • 📈 25:40:00 — Explaining the Cambrian Explosion and the evolution of vision
    • His energy shifts to that of an enthusiastic professor. He becomes highly animated, leaning forward and using his hands to illustrate complex concepts like the evolutionary ‘arms race’ and the structure of the brain.
  • 📈 35:38:00 — Recounting the dramatic, last-minute nature of his PhD application process
    • His energy becomes high and engaging as he tells the story. He smiles, shakes his head in amusement, and uses dramatic hand gestures to re-enact key moments, like his professor’s sudden job change and his own decisive response.

Gestures Emphasizing Claims (24)

  • 01:21:06 — “Describing how he rejected OpenAI’s offer without much discussion.”
    • He holds both hands up, palms facing each other, creating a defined space between them, and then makes a quick, dismissive gesture. · The initial gesture contains the ‘offer’, and the subsequent motion visually enacts the quick rejection he is describing.
  • 01:38:00 — “Describing Dumbo, Brooklyn as ‘very artistic’ (very artistic)”
    • He brings both hands up, palms facing each other with fingers spread. · The gesture visually ‘shapes’ or ‘frames’ the abstract concept of artistry, making his description more vivid.
  • 01:52:57 — “Explaining Yann LeCun’s ‘cake analogy’ for different types of machine learning.”
    • He uses his hands to physically layer the components: a wide, flat base for self-supervised learning (‘the cake’), a thinner layer on top for supervised learning (‘the icing’), and a final pinch with his fingers for reinforcement learning (‘the cherry’). · A direct and powerful visualization that makes a complex technical analogy immediately intuitive to a layperson.
  • 02:03:19 — “Kaiming He has the ability to meticulously analyze and extract the core points o”
    • The speaker brings his hands together in front of him and makes a delicate motion with his fingers, as if pulling fine threads apart. · The gesture provides a direct visual metaphor for the meticulous and careful process of ‘drawing silk from a cocoon,’ reinforcing the idea of extracting key insights from complex information.
  • 02:03:23 — “He can establish connections in a high-dimensional abstract space (‘建立这种高维度的抽象的空”
    • He holds his hands up, palms facing each other, and moves them apart to define a three-dimensional space in front of him. · This gesture physically carves out the ‘abstract space’ he is describing, making a highly conceptual idea more tangible for the viewer.
  • 02:39:30 — “A negative signal’s opposite direction is a positive signal (‘一个negative的信号的反方向就”
    • He uses his right hand to point in one direction for the ‘negative signal,’ and then immediately uses his left hand to point in the opposite direction for the ‘positive signal.’ · The gesture creates a clear, physical opposition that perfectly illustrates the inverse relationship he is explaining, making the logic immediately intuitive.
  • 02:42:00 — “Explaining abstract concepts related to social interactions in gaming.”
    • Holds both hands up and open, palms facing each other, as if shaping or holding an invisible object between them. · This gesture is used to give form to an abstract idea, making it feel more tangible for the listener as he describes it.
  • 02:43:19 — “The meaning of PhD: ‘It’s a Doctor of Philosophy’.”
    • He uses his right hand to make small, distinct chopping motions in the air as he says each part of the phrase. · The gesture breaks down the term into its components, emphasizing each word to highlight the philosophical root of the degree, which is central to his point.
  • 03:04:35 — “Describing a ‘more elegant solution’ in research.”
    • Makes a smooth, sweeping gesture with his right hand, palm down. · The fluid, clean motion visually represents the concept of ‘elegance’ and ‘simplicity’ he is describing in a technical context.
  • 03:22:42 — “The funding system for academia has not increased despite inflation, making ever”
    • He holds his hands apart and parallel, then moves them upwards together. · The gesture visually represents two parallel tracks (costs and funding) where one (costs) is rising, illustrating the growing gap he is describing.
  • 03:24:06 — “A grant of $100,000 can only support one student for one year.”
    • Raises his right index finger to emphasize the number ‘one’. · A classic enumerating gesture used to add weight and specificity to the quantitative point being made.
  • 03:28:40 — “A movie’s timeline is a linear axis, but at every point on that axis, there is a”
    • He first traces a horizontal line in the air with his right hand to represent the timeline, then opens both hands to form a three-dimensional ‘box’ to represent the spatial dimension at each point. · This is a powerful visual metaphor, translating a complex spatio-temporal concept from physics and film theory into a simple, understandable hand movement.
  • 03:44:18 — “A true ‘real world’ intelligence must interact with the physical world, not just”
    • He gestures forward with both hands, pushing away from his body. · The gesture physically separates the ‘self’ (the AI model) from the ‘world out there,’ emphasizing the concept of embodied interaction with an external environment.
  • 04:03:01 — “Explaining the concept of high-dimensional representation space in AI models.”
    • Xiaojun opened his hands in front of his chest, palms facing each other, as if outlining an invisible, three-dimensional space. He moved his hands to metaphorically represent concepts of different dimensions and levels, such as ‘transforming’ low-dimensional vectors into high-dimensional representations. · This gesture makes the abstract idea of ‘space’ and ‘dimensionality’ tangible for the viewer, visually representing the conceptual framework he is building with his words.
  • 04:07:25 — “Describing the brain as a complex architecture with multiple components.”
    • He gently tapped his temple with his finger, then opened his hands again, simulating different regions of the brain. · The gesture directly links the abstract concept of a ‘cognitive architecture’ to the physical brain, grounding the technical analogy.
  • 06:05:06 — “JEPA is not a specific method but a vast ocean (‘a very, very vast ocean’).”
    • He spreads his hands wide apart, palms up. · This gesture visually represents the concept of vastness and scale, directly mirroring his words.
  • 06:07:47 — “The number of possible visual functions is enormous.”
    • He raises his right hand with his index finger pointing up to emphasize the scale of the number he is describing. · The gesture draws attention to the specific, large number, highlighting its significance in his argument.
  • 06:09:47 — “The intelligence of a squirrel is the real hard problem.”
    • He uses his hands to form a small, contained shape. · This gesture contrasts the seemingly small and simple ‘squirrel intelligence’ with the grand, abstract problems often discussed, emphasizing that the former is the more profound challenge.
  • 09:27:21 — “一个模型去通过真正理解这个世界的方式去回答问题”
    • Holds both hands up, palms inward, as if holding or defining a spherical space between them. · Visually conceptualizes the abstract ‘model’ as a tangible object that can be examined, giving form to the idea.
  • 09:28:26 — “它会有一个非常非常不一样的 scaling law”
    • Makes a sharp, downward slicing motion with his right hand. · The gesture creates a strong visual metaphor for a ‘different’ or divergent path, emphasizing a break from the established norms of language models.

Authenticity Tells (16)

  • 01:21:00 — Genuine, slightly embarrassed laugh at the start of the interview.: Responding to the host’s initial comments before the formal questions begin.
    • This unscripted moment of laughter helps break the ice and establishes a relaxed, authentic rapport between the host and guest from the outset.
  • 01:23:08 — Self-deprecating smile and laugh.: When recounting how he finished his one-hour job talk at FAIR in only 30 minutes, making everyone feel awkward.
    • His ability to laugh at a past professional blunder makes him seem humble, relatable, and not overly concerned with maintaining a perfect image.
  • 01:46:27 — Looks down and to the side, pausing thoughtfully.: Before answering the question ‘Why are people so important to you?’.
    • This is a classic ‘accessing memory/thought’ cue. Instead of a canned answer, he is genuinely considering the question, which lends weight and sincerity to his subsequent response about the nature of research and human connection.
  • 02:03:30 — A brief, genuine smile and slight laugh.: After the interviewer says ‘This is very difficult,’ the speaker smiles and agrees, ‘I think it’s very, very difficult.’
    • The smile and laugh indicate a moment of genuine agreement and shared understanding with the interviewer, showing that his high praise for his mentor’s focus comes from a place of authentic experience and admiration, not just prepared talking points.
  • 02:33:18 — A self-deprecating chuckle and glance away.: When stating he hasn’t produced a truly valuable paper yet, he follows it with a slight laugh and looks down.
    • This moment of humility feels genuine. It’s a common trait among high-achievers to downplay their own successes when discussing foundational, field-defining work, and his body language here reflects that authentic self-assessment.
  • 02:43:25 — A brief, self-deprecating laugh and smile.: After quoting Kaiming He’s question about why PhDs don’t know philosophy, he laughs while saying ‘a soul-searching question’.
    • The laugh shows his genuine amusement and perhaps a hint of embarrassment, acknowledging the truth and irony in the critique. It makes the anecdote feel personal and relatable.
  • 03:05:29 — A wry smile and a slight shake of the head.: When revealing that the highly influential DiT paper was rejected by CVPR.
    • This reaction conveys a sense of ‘can you believe it?’ irony and shows he has processed the initial frustration. It’s a moment of candid reflection on the unpredictable nature of peer review.
  • 03:25:37 — A genuine, slight smile and nod.: The host uses the term ‘the process of begging for alms’ (a process of begging for alms, used humorously for fundraising) to describe his efforts to secure academic funding.
    • His smile shows he appreciates the host’s humorous and apt analogy, creating a moment of rapport and demonstrating a relaxed self-awareness about the difficulties of academic funding.
  • 03:39:58 — Immediate, firm headshake and direct eye contact.: In response to the direct question of whether he and his peers feel frustrated (‘frustrated’) by the dominance of LLMs.
    • The speed and conviction of his non-verbal denial, which precedes his verbal explanation, strongly suggest his subsequent positive framing of the situation is his genuine belief, not a polite deflection.
  • 04:03:55 — Smiling slightly and leaning forward with increased animation while recounting an anecdote about another professor.: He is sharing a story about Professor Yi Ma’s passionate argument for high-dimensionality, a view he clearly shares and respects.
    • The shift in his demeanor from purely analytical to animated and slightly reverent shows his genuine respect for Professor Ma and his passion for the topic. It’s a moment of personal connection to the academic community, not just a dry explanation.
  • 04:57:00 — Laughing and saying ‘This is why I don’t want to do podcasts’ (This is why I don’t want to do podcasts).: When asked a deep, personal question about his earliest memories and childhood.
    • A humorous deflection that reveals a moment of genuine unpreparedness or slight discomfort with deep introspection on the spot. It makes him appear more relatable and less like a polished media figure.
  • 06:06:06 — A brief downward gaze and a slight pause before answering.: He is asked about his true feelings (‘true feelings’) after becoming an entrepreneur.
    • The pause and gaze shift suggest genuine introspection and a move away from a rehearsed answer, lending authenticity to his subsequent reflections on the ups and downs of his journey.
  • 06:06:57 — A quick, firm nod while saying ‘对’ (Right).: He is agreeing with the host’s observation that his fear disappeared once he committed to his path.
    • The decisive nod reinforces his verbal agreement, suggesting this is a deeply felt and confirmed part of his experience.
  • 09:28:34 — Slight upward glance and brief pause.: Just before stating ‘我现在的直觉是这样’ (My intuition now is like this), when about to offer his own hypothesis on world model scaling.
    • The non-verbal cue suggests he is accessing his own thoughts and formulating a genuine, unscripted opinion, rather than reciting a prepared point. This enhances his credibility as an expert sharing a real-time insight.
  • 09:29:04 — A quick, slight smile and soft chuckle.: In response to the host’s question about ‘人类最高级的知识’ (the highest level of human knowledge), before gently redirecting the conversation.
    • This authentic reaction shows he acknowledges the philosophical depth of the question while skillfully avoiding a tangent. It’s a moment of spontaneous, personable interaction that builds rapport.
  • 34:48:00 — Matter-of-factly stating his undergraduate rank was not at the very top.: When asked if he was a top student in the competitive ACM class.
    • He states his rank (‘around 10th’) and that he ‘couldn’t become’ number one with a calm, direct demeanor, showing a lack of ego and a comfortable self-awareness that is free from false modesty.

Facts the Transcript Loses

  • The main visual throughout the video is a podcast cover graphic, which incorrectly identifies the guest as ‘ZHANG XIAOJUN’ instead of his actual name, Xie Saining, which is used in the audio.
  • The guest’s frequent and expressive hand gestures add a layer of dynamism and emphasis to his explanations that is entirely absent in a transcript.
  • The setting in a Brooklyn loft, with its brick wall and professional lighting, contrasts with the host’s description of the cold, snowy New York winter outside, creating a warm and intimate atmosphere for the conversation.
  • The non-verbal cues, such as Xie Saining’s moments of self-deprecating laughter and the host’s attentive nodding, establish a friendly and comfortable dynamic that encourages candid storytelling.
  • The guest’s constant, fluid hand gestures are not just for emphasis but seem integral to his thought process, as if he is physically shaping and organizing abstract ideas in the air in front of him.
  • The contrast between his calm, measured speaking style and the fast-paced, high-stakes world of AI he is describing.
  • The visual branding of the podcast (name, episode number, stylized background) is consistently present, reinforcing the identity of the show.
  • The speaker’s communication is highly kinesthetic; his hands are constantly shaping, connecting, separating, and illustrating the abstract concepts of research methodology he discusses. A transcript would miss how he physically embodies his ideas.
  • The contrast between the speaker’s dynamic, passionate delivery and the static, almost sterile graphic design of the podcast frame.
  • The consistent off-camera gaze, which implies a comfortable, in-person rapport with an unseen interviewer, making the monologue feel more like a natural conversation.
  • The contrast between the highly abstract and philosophical topics (the nature of research, the Diamond Sutra, antifragility) and the speaker’s calm, clear, and grounded delivery.
  • The consistent use of hand gestures not just for emphasis, but as a tool to visually construct and manipulate the complex ideas he is explaining.
  • The small, stylish detail of the Vivienne Westwood logo on his shirt, which contrasts with the typical academic attire and adds a layer to his personal presentation.
  • The professional branding of the podcast, with consistent graphic overlays, which frames the conversation as a formal, high-value piece of content.
  • The speaker’s constant and highly descriptive use of hand gestures to ‘sculpt’ abstract concepts like funding systems, model architectures, and spatio-temporal relationships in the air. This visual layer is a primary mode of communication for him and is completely lost in a transcript.
  • The subtle but consistent way he maintains eye contact with the off-screen host, which makes the interview feel like an intimate, focused conversation rather than a public address.
  • The calm and measured pace of his speech, which contrasts with the complexity of the topics, conveying a sense of deep expertise and confidence.
  • The physical embodiment of his ideas, such as when he acts out looking for an object in the room to explain visual reasoning, which makes abstract cognitive processes feel concrete and intuitive.
  • Throughout the segment, Xiaojun heavily relied on hand gestures to elaborate on his points, which is completely lost in plain text. His gestures were not just for emphasis, but also for ‘shaping’ and ‘dividing’ the abstract conceptual space he was discussing.
  • He wore two watches: a traditional black-strapped watch on his left wrist and a smartwatch on his right, a unique personal visual detail.
  • When recounting Professor Ma Yi’s story, his face showed a mixed expression of excitement and reverence, which added strong emotional color and persuasiveness to the technical points he was explaining.
  • The video’s background is a carefully designed brand visual, not a real scene, indicating that this is a professionally produced interview program.
  • The constant, descriptive hand gestures used by Zhang Xiaojun to illustrate complex, abstract concepts like ‘models’, ‘scaling laws’, and ‘parameters’. His hands are almost a second voice, shaping and defining his ideas visually.
  • The professional studio environment, including the graphic overlays, which establishes the context as a high-production value podcast, not an informal conversation.
  • The subtle, authentic facial expressions, such as his thoughtful pauses and slight smiles, which reveal his process of thinking and his engagement with the host’s questions.
  • The guest’s personal style detail of wearing two watches, one on each wrist.
  • The speaker’s consistent use of hand gestures to shape abstract concepts, such as forming a container with his hands when discussing the components of JEPA or spreading them wide to signify a ‘vast ocean’.
  • The visual contrast between the speaker’s intense, focused expression when discussing technical or philosophical topics and his softer, more reflective expression when discussing his personal journey and motivations.
  • The speaker wears two different watches/wristbands, one on each wrist, a distinctive personal quirk.

Named Entities

People (70): Alex Kirillov, Andrei Tarkovsky (塔可夫斯基), Aravind Srinivas, Bi Gan, Bill Freeman, Bill Peebles, Bowen, Charlie Parker, Demis Hassabis, Douglas Hofstadter, Eddy (艾迪), Fei-Fei Li, Hannah Arendt, Ilya, Ilya Sutskever, Jia Zhangke, Jose Mourinho, Jurgen Klopp, Kaiming, Kenneth Craik, Ludwig Wittgenstein, Ma Yi, Martin Scorsese, Michael Rabbat (Mike), Pascal Fung (冯), Piotr Dollár, Rich Sutton, Richard Feynman, Robert McKee, Robert Pirsig, Robin Rombach, Ross Girshick, Sam Altman, Serge Belongie, Stanisław Lem (莱姆), Steven Soderbergh (索德伯格), Tim Brooks, Yann LeCun, Zhang Xiaojun, 于老师, 何加迪, 何恺明, 何恺明 (Kaiming He), 余泳, 侯晓迪, 冯佳时, 刘壮, 刘宇昆, 叔本华, 向语 (Xiangyu), 吴宇欣 (Yuxin Wu), 孙剑, 屠卓文, 库布里克, 康德, 张涛, 晓君, 朱松纯 (Zhu Song-Chun), 李飞飞 (Fei-Fei Li), 杨立昆, 杨立昆 (Yann LeCun), 沈少爷, 涂老师 (Professor Tu), 王小龙 (Wang Xiaolong), 理查德·柯朗, 谢赛宁, 赵婷, 马丁·斯科塞斯, 马毅, 马毅 (Ma Yi)

Companies / Institutions (31): Adobe, Amy Labs, Autodesk, Bank of America (BOA), Berkeley, Build.ai, DeepMind, FAIR, FAIR (Facebook AI Research), Google, Google Chat, Google Research, Mastercard, Meta, Microsoft, Microsoft Research Asia, NEC Labs, NSF, NYU, Newlab AMI, OpenAI, Perplexity, Pika, Runway, SSI, Stability AI, Thinking Machines, UCSD, Visa, YouTube, xAI

Papers / Methods / Datasets (94): AISTATS, AlexNet, AlphaFold, Autoencoder, BERT, BMVC, C++, CLIP, COT (Chain of Thought), CPC, CVPR, Cambr, Cambr-S, Cambrian, Canbens, Computer Vision, Contrastive learning, ConvNeXt, DDPM, DSN, Deep Learning, Deeply Supervised Nets (DSN), DiT, DiT (Diffusion Transformers), Diffusion Model, Dyna, Eyes Wide Shut, Faster R-CNN, Flow Matching, Focal Loss, GAN (Generative Adversarial Network), GPT, GPT-3, Gaussian Splatting, Genie, HED, Holistic Edge Detection (HED), ICCV, Image Segmentation, ImageNet, ImageNet Challenge, JAX, JEPA (Joint Embedding Predictive Architecture), Kernel Method, LDM, LLM, LLM (Large Language Model), Language Model, Large Language Models, Large Language Models (LLM), LeNet, MAE, MAE (Masked Autoencoders), Marr Prize, Mask R-CNN, Memory Bank, Mixture of Experts, Mixture of Experts (MoE), MoCo, MoCo (Momentum Contrast), Moco, Model Predictive Control (MPC), NeRF (Neural Radiance Fields), NeurIPS, Neural Architecture Search (NAS), Neural Network, PointContrast, Pre-training, Pretext task, PyTorch, R-CNN, RE, RE (Representation Engineering), REPA, Reinforcement Learning, Reinforcement Learning (RL), Representation Learning, ResNeXt, ResNet, Scaling Law, Self-attention, Self-supervised learning, Sora, The Bitter Lesson, Thinking Space, Transformer, U-Net, VAE (Variational Autoencoder), VISTAR, ViT, ViT (Vision Transformer), Video Diffusion, World Model, World Models

Takeaways

  • Xie Saining’s growth experience and academic choices demonstrate the importance of following personal interests and intuition, even if it differs from traditional ‘successful’ paths.
  • A rich reading environment in childhood and early exposure to the internet cultivated his strong curiosity and desire for self-expression.
  • In academic and research decisions, the importance of mentors and personal connections often outweighs institutional rankings.
  • His passion for computer vision stems from a deep understanding of how humans perceive the world and a personal connection to it.
  • He firmly believes in taking initiative and pursuing what he truly wants to do, even if it means going against the current or facing initial rejections.
  • He believes that competition should foster innovation rather than excessive internal friction, and emphasizes the value of collaboration and an open mindset.
  • He put forward the idea that ‘if you don’t do this thing, this thing will never happen in this world,’ emphasizing the unique value and responsibility of individual action.
  • An excellent mentor not only provides opportunities but, more importantly, leads by example and personally guides students in research.
  • The path of scientific research is not always smooth; temporary setbacks (like paper rejections) do not negate the value of the work. Truly impactful work stands the test of time.
  • Undertaking multiple, diverse internships during a PhD helps broaden horizons and explore different directions. Even if not every internship yields results, the process itself is valuable.
  • Collaborating with top talents (like Kaiming He) possesses a ‘reality distortion field’ magic, capable of elevating seemingly ordinary ideas into highly influential work.
  • Compared to chasing fleeting popular directions, focusing on eternal and fundamental problems like ‘representation learning’ is a more sustainable research strategy.
  • Research is not linear; it’s full of uncertainties and moments of inspiration. One should not focus solely on success or failure at a single point in time, but rather on long-term accumulation and integral effects.
  • Career choices should prioritize the academic environment and long-term development potential, rather than short-term material rewards.
  • The true meaning of research lies in promoting understanding, sharing knowledge, and enhancing overall human intelligence, rather than pursuing superficial ‘impact’.
  • The academic community is an organism composed of interpersonal relationships and common interests; effective collaboration and mutual inspiration are key to advancing scientific progress.
  • Top researchers often possess advanced foresight, capable of anticipating and planning future research directions, such as Yann LeCun’s strategic vision in the field of data science.
  • Defining the right problem is more important than solving the problem itself, as Fei-Fei Li defined the challenge of image classification through ImageNet, providing a platform for deep learning.
  • Self-supervised learning is key to addressing the limitations of traditional supervised learning; it enables models to learn common sense from unlabeled data through proxy tasks, thereby acquiring more powerful representation capabilities.
  • Top researchers possess extreme focus, capable of dedicating all mental resources to a single problem.
  • True research breakthroughs are the result of ‘exploration’ rather than ‘epiphany,’ stemming from a long, non-linear process of discovery, not initial inspiration.
  • The research process is like ‘stochastic gradient descent’; the key is to find the ‘signal’ (gradient) that guides the direction, rather than clinging to the initial goal.
  • Research impact is exponential; a top-tier ‘masterpiece’ far outweighs countless mediocre papers, thus one should optimize for the ‘maximum’ value of their career works.
  • Strong baselines and engineering scaffolding are decisive factors for the upper limit of research; improvements made on weak baselines might be spurious.
  • All experimental results (including negative ones) should be treated as valuable signals, and cognition should be systematically advanced through a ‘predict-verify’ cycle.
  • Top-tier ‘research taste’ is a philosophical pursuit, requiring researchers to transcend superficial metrics and forms, and to discern the essence behind problems, just as ‘The Diamond Sutra’ states, ‘seeing that all phenomena are not phenomena’.
  • Good research is like filmmaking, a creative storytelling process, whose core lies in the ‘decisions’ made at critical moments, rather than simple linear progression.
  • Breakthrough innovation often stems from questioning mainstream consensus. For example, ConvNeXt challenged the necessity of self-attention, while DiT overturned the tradition that diffusion models must use U-Net.
  • Researchers should cultivate an ‘antifragile’ mindset, viewing setbacks like paper rejections as opportunities for learning and growth, allowing themselves to benefit from uncertainty rather than being harmed by it.
  • Elegance and simplicity are important criteria for measuring research value. A simple, scalable, and efficient solution (like DiT’s architecture) is inherently superior to a complex and bloated system.
  • True research freedom comes from bottom-up exploration, not top-down planning. Too many ‘alignment meetings’ can stifle innovation.
  • Research funding in North American academia has long stagnated, forcing researchers to possess an entrepreneurial spirit and actively seek and integrate various resources to advance frontier research.
  • True Artificial General Intelligence (AGI) must be able to interact with the physical world, which requires AI to have strong visual understanding capabilities, not just processing virtual text information. Pure language models have fundamental limitations in this regard.
  • Film theories, such as Bi Gan’s ‘long take’ and Jia Zhangke’s ‘spatiotemporal expansion,’ can provide profound philosophical inspiration for research in video understanding and world models.
  • Language model training data is not truly unsupervised data, but rather the highly structured knowledge accumulated over long periods by human civilization; its training process is more akin to ‘strong supervised learning’.
  • The essence of language is a communication tool, not a thinking tool. Its symbolic nature, while compressing information for communication, also loses a large amount of continuous, high-dimensional information about the physical world.
  • The future direction of AI development is to move from language models that process discrete symbols to visual and physical world models capable of understanding continuous, high-dimensional, noisy signals, ultimately building a ‘world model’ that can predict changes in the world.
  • Building powerful ‘world models’ is the key path to general artificial intelligence, with its core lying in learning the underlying representations of the world, rather than solely relying on language.
  • High-dimensional representation is the cornerstone of modern AI; its complexity should not be feared, as it enables the solution of more complex problems.
  • Current popular technologies like Large Language Models (LLMs) and video generation models are merely different stages or components in the process of reaching world models. In the future, the role of LLMs might be more like an ‘interface’ rather than intelligence itself.
  • Both language and pixels are ‘interfaces’ designed for human perception; true machine intelligence needs to transcend these interfaces to learn a more fundamental and abstract world representation.
  • A true world model should possess predictive capabilities, allowing for internal planning and reasoning, which enables an intelligent agent to foresee the consequences of actions and achieve higher levels of controllability and safety.
  • World models and language models have fundamental differences in Scaling Laws; world models may focus more on understanding and filtering rather than parameter scale.
  • The mechanism by which the human brain processes high-bandwidth sensory information and converts it into low-bandwidth behavioral patterns provides important inspiration for world models, indicating that an efficient filtering system is key.
  • The biggest challenge for world models is acquiring and processing massive amounts of real-world data, especially video and multimodal data, and there are legal and ethical issues such as data crawling and copyright.
  • AI glasses (personal assistants) and robots are two important application directions for world models; they require a deep understanding of the physical world, not just language interaction.
  • The guest chose entrepreneurship to find a platform outside of existing academia and large tech companies where he could freely define problems, conduct cutting-edge research, and advance the development of world models.
  • The current ‘arms race’ and product-driven model in the AI field lead to an overconcentration of resources on short-term commercial goals and benchmark rankings, squeezing investment in fundamental, long-term research like world models.
  • There is an ‘invisible world’ overlooked by the Silicon Valley LLM narrative, which refers to a large number of unsolved real-world problems in the physical world, representing huge market potential for world models.
  • The guest’s company is committed to building a general world model, with research breakthroughs as its core product, attracting mission-driven young researchers to jointly explore the next paradigm of AI.
  • Zhang Xiaojun and Yann LeCun’s new AI company aims to be a ‘Reverse OpenAI,’ building world models through a partner alliance rather than by scraping the public internet.
  • One of the company’s core missions is to build a platform for top young talents in academia, allowing them to break free from academic constraints and fully unleash their potential.
  • The company’s philosophy is ‘World model needs the world,’ reflecting its decentralized, globally collaborative model, with initial offices established in Paris, New York, Montreal, and Singapore.
  • Yann LeCun is not only a research leader but also a principled and versatile individual; his integrity and foresight were key to attracting Zhang Xiaojun.
  • Zhang Xiaojun believes that true AI innovation requires the courage to explore unconventional, counter-intuitive paths, and he positions his company as an ‘underdog’ challenging existing paradigms.
  • Entrepreneurship is similar to skiing, both requiring a delicate sense of balance and the courage to face the unknown, daring to ‘point your shoulders downhill.’
  • The core criterion for team recruitment is to find individuals with extreme passion and persistence for problem-solving; this quality is more important than mere credentials.
  • AI entrepreneurship should focus on solving grand problems rather than being limited to small-scale improvements, to achieve greater breakthroughs.
  • True AI intelligence requires capabilities for world understanding, prediction, and planning, not just language processing.
  • Human intelligence is not unique; AI development should pursue human-like intelligence with humility and recognize the complexity of animal intelligence.
  • Developing AI robots capable of performing real-world tasks (e.g., household chores) is a more fundamental challenge than writing code or exploring space.
  • The research process is full of setbacks, but difficulties can be overcome through human connection and inspiration, ultimately leading to breakthroughs.
  • AI development needs to go beyond theoretical research, delve into the real world, and understand and solve practical problems.
  • The success of generative AI models primarily relies on high-quality data curation and alignment, rather than solely on innovative model architecture.
  • Genuine interpersonal communication and connection are core values that permeate all aspects of personal growth, research, and entrepreneurship.