Stanford CS336 — Language Modeling from Scratch

Comprehensive lecture notes for the Spring 2025 edition of Stanford's CS336. Each set of notes is generated from the entire lecture video — the slides, whiteboard math, and live code — combined with the official transcript. Covers tokenization, architectures, MoE, GPUs & kernels, parallelism, scaling laws, inference, evaluation, data, and alignment (SFT/RLHF/RL).

17/17 lectures · Course playlist ↗ · Course site ↗

⭐ Star on GitHub

🧠

Algorithm Deep Dives

Beyond the slides — the trickiest algorithmic questions across the course (BPE, attention, MoE routing, FlashAttention, parallelism, scaling laws, KV-cache, PPO/DPO/GRPO), worked through with full derivations and intuition.

→

🛠️

Assignments — Do the Work

The lectures are the easy part; the real learning is in the 5 implementation-heavy assignments. Study guides for each — what you build, what makes it hard, and how to approach it — guidance, not solutions.

→

LECTURE 01 · 1:18:59

Overview and Tokenization

CS336: Language Models From Scratch aims to teach students how to build language models from the ground up, emphasizing a deep understanding of the un…

LECTURE 02 · 1:19:22

PyTorch, Resource Accounting

This lecture focuses on building language models from scratch using PyTorch, emphasizing efficiency and resource accounting (memory and compute).

LECTURE 03 · 1:27:03

Architectures, Hyperparameters

Modern transformer architectures have converged on pre-norm, RMSNorm, and gated linear units (GLUs) for better stability and performance.

LECTURE 04 · 1:22:04

Mixture of Experts

LECTURE 05 · 1:14:21

GPUs

GPUs are massively parallel processors optimized for throughput, not latency, by having many simple compute units (SMs) orchestrated by minimal contro…

LECTURE 06 · 1:20:22

Kernels, Triton

LECTURE 07 · 1:24:42

Parallelism 1

LECTURE 08 · 1:15:10

Parallelism 2

Data Transfer Bottleneck: The primary challenge in distributed training is minimizing data transfer bottlenecks, as moving data between different memo…

LECTURE 09 · 1:05:18

Scaling Laws 1

LECTURE 10 · 1:22:52

Inference

Inference is a critical part of language model deployment, distinct from training due to its memory-limited and dynamic nature.

LECTURE 11 · 1:18:13

Scaling Laws 2

Scaling Laws in Practice: Modern large language model (LLM) builders use scaling laws as a core part of their design process, but the details of these…

LECTURE 12 · 1:20:48

Evaluation

Evaluation is a complex topic that goes beyond simple metrics, influencing how language models are built and used.

LECTURE 13 · 1:19:06

Data 1

LECTURE 14 · 1:19:12

Data 2

[0:00] Lecture 14: Data 2 - Deep Dive into Data Processing

LECTURE 15 · 1:14:51

Alignment — SFT/RLHF

LECTURE 16 · 1:20:32

Alignment — RL 1

RLHF Limitations: Overoptimization and mode collapse are significant problems in RLHF, often stemming from the noisiness and complexity of human prefe…

LECTURE 17 · 1:16:09

Alignment — RL 2

Reinforcement Learning (RL) is key to surpassing human abilities in Language Models (LMs).