Qihong Ruan / CS336 Notes / Lecture 01

Lecture 01: Overview and Tokenization

Stanford CS336 Β· Spring 2025 Β· 1:18:59 Β· Watch on YouTube β†—

TL;DR

Key Concepts


[0:00] Introduction to CS336: Language Models From Scratch

The course CS336: Language Models From Scratch is being taught for the second time. The lectures are made available on YouTube to allow global access to the content.

[0:05] Course Staff Introductions

[1:42] Why We Made This Course

A code snippet defining a response variable with a long string value.
2:23 A code snippet defining a response variable with a long string value.

The motivation for creating this course stems from a perceived crisis in the field of AI research.

This shift isn't inherently bad, as layers of abstraction boost productivity. However, these abstractions are "leaky," meaning that a deep understanding of the underlying mechanisms is crucial for fundamental research and pushing the boundaries of the field.

[3:59] The Industrialization of Language Models

Text explaining the course's focus on understanding language models via building them.
3:58 Text explaining the course's focus on understanding language models via building them.

A significant challenge in building language models is their industrial scale: - GPT-4: Rumored to have 1.8 trillion parameters, costing $100 million to train. - **xAI**: Building clusters with 200,000 H100 GPUs to train Grok. - **Stargate (OpenAI, Nvidia, Oracle)**: Supposedly a$500 billion investment over 4 years.

Furthermore, there's a lack of public details on how these frontier models are built. The GPT-4 technical report explicitly states that due to competitive landscape and safety limitations, they disclose no details about their architecture, training data, or methods.

[4:54] "More is Different"

Text discussing the limitations of building small language models for research.
5:16 Text discussing the limitations of building small language models for research.

Frontier models are often out of reach for individual researchers. The small language models built in this class (e.g., <1B parameters) might not be representative of large models.

[7:00] What Can We Learn That Transfers to Frontier Models?

A graph showing model accuracy across various tasks as a function of model scale.
6:47 A graph showing model accuracy across various tasks as a function of model scale.

There are three types of knowledge: 1. Mechanics: How things work (e.g., what a Transformer is, how model parallelism leverages GPUs). This can be taught effectively. 2. Mindset: Squeezing the most out of hardware, taking scaling seriously (scaling laws). This is crucial, as the scaling mindset pioneered by OpenAI led to the current generation of AI models. 3. Intuitions: Which data and modeling decisions yield good accuracy. This can only be partially taught, as intuitions developed at small scales may not transfer to large scales.

[8:44] Intuitions?

Text highlighting the teachable aspects (mechanics, mindset) versus intuitions in ML.
8:39 Text highlighting the teachable aspects (mechanics, mindset) versus intuitions in ML.

Some design decisions are not purely justifiable and come from experimentation. For example, the SwiGLU activation function (Shazeer 2020) was adopted due to its empirical success, with the paper honestly stating, "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence." This highlights the limits of current understanding.

[9:33] The Bitter Lesson

Text explaining the "bitter lesson" of accuracy, efficiency, and resources in ML.
10:27 Text explaining the "bitter lesson" of accuracy, efficiency, and resources in ML.

Accuracy is a product of efficiency and resources ($accuracy = efficiency \times resources$). Efficiency is even more important at larger scales because one cannot afford to be wasteful.

[11:17] Current Landscape (History)

Text defining the framing of maximizing efficiency given compute and data budgets.
11:50 Text defining the framing of maximizing efficiency given compute and data budgets.
A list of neural ingredients from the 2010s, highlighting sequence-to-sequence modeling.
13:43 A list of neural ingredients from the 2010s, highlighting sequence-to-sequence modeling.
A list of early foundation models, including OpenAI's GPT-2 and Google's PaLM.
15:13 A list of early foundation models, including OpenAI's GPT-2 and Google's PaLM.
A list of today's frontier models, including OpenAI's o3 and Google's Gemini 2.5.
17:22 A list of today's frontier models, including OpenAI's o3 and Google's Gemini 2.5.

[18:04] What is this program?

Text defining an "executable lecture" and its benefits, including viewing and running code.
18:22 Text defining an "executable lecture" and its benefits, including viewing and running code.

The lecture itself is an executable program, allowing for: - Viewing and running code (since everything is code). - Stepping through code and inspecting variables. - Seeing the hierarchical structure of the lecture. - Jumping to definitions and concepts.

def what_is_this_program():
    """
    This is an executable lecture, a program whose execution delivers the content of a lecture.
    Executable lectures make it possible to:
    - View and run code (since everything is code!),
        total = 0
        for x in [1, 2, 3]: # @inspect x
            total += x # @inspect total
    - see the hierarchical structure of the lecture, and
    - jump to definitions and concepts: supervised_finetuning
    """
    pass

# All information online: https://stanford-cs336.github.io/spring2025/

[19:12] Course Logistics

A code snippet demonstrating an executable lecture with inspect annotations.
19:01 A code snippet demonstrating an executable lecture with inspect annotations.

[20:59] Why You Should Take This Course

[21:32] Why You Should Not Take This Course

[22:45] How You Can Follow Along at Home

[23:21] Assignments

[25:52] Cluster

[26:47] It's All About Efficiency

[27:14] Design Decisions

The course is organized into five units, each focusing on a set of design decisions: 1. Basics: Tokenization, architecture, loss function, optimizer, learning rate. 2. Systems: Kernels, parallelism, quantization, activation checkpointing, CPU offloading, inference. 3. Scaling Laws: Scaling sequence, model complexity, loss metric, parametric form. 4. Data: Evaluation, curation, transformation, filtering, deduplication, mixing. 5. Alignment: Supervised fine-tuning, reinforcement learning, preference data, verifiers, synthetic data.

[27:44] Overview of the Course Units

[27:44] Basics

[27:55] Tokenization

[29:06] Architecture

[31:35] Training

[32:25] Assignment 1

[33:52] Systems

[34:12] Kernels

[36:09] Parallelism

[37:11] Inference

[39:50] Assignment 2

[40:51] Scaling Laws

[43:00] Assignment 3

[44:50] Data

[45:32] Composition of the Pile by Category

[45:56] Evaluation

[46:45] Data Curation

[47:26] Look at Web Data

[48:29] Data Processing

[50:00] Assignment 4

[50:21] Alignment

[51:44] Two Phases

  1. Supervised Fine-tuning (SFT):

    • Instruction data: Prompt, response pairs (e.g., ChatExample).
    • Intuition: Base model already has the skills, just need few examples to surface them (Zhou+ 2023).
    • Learning: Fine-tune model to maximize $P(response | prompt)$.
    • SFT is relatively simple and effective for initial alignment.
  2. Learning from Feedback:

    • Goal: Make it better without expensive annotation.
    • Preference data: Generate multiple responses using model (e.g., A, B) to a given prompt. User provides preferences (e.g., A < B or A > B).
    • Verifiers:
      • Formal verifiers: (e.g., for code, math).
      • Learned verifiers: Train an LM as-a-judge (RLHF).
    • Algorithms:
      • Proximal Policy Optimization (PPO): From reinforcement learning (Schulman+ 2017, Ouyang+ 2022).
      • Direct Policy Optimization (DPO): For preference data, simpler (Rafailov+ 2023).
      • Group Relative Preference Optimization (GRPO): Remove value function (Shao+ 2024).

[55:04] Assignment 5

[55:59] Efficiency Drives Design Decisions (Recap)

[58:09] Tomorrow: Data-Constrained Regime

[59:01] Tokenization (Deep Dive)

This unit is inspired by Andrej Karpathy's video on tokenization.

[59:50] Intro to Tokenization

[1:01:25] Tokenization Examples

[1:05:17] Character-Based Tokenization

[1:07:15] Byte-Based Tokenization

[1:09:16] Word-Based Tokenization

[1:11:04] Byte Pair Encoding (BPE)

[1:12:44] Training the Tokenizer

[1:16:11] Using the New Tokenizer

[1:17:47] Summary of Tokenization


Practical Takeaways

Open Questions / Things to Remember