Stanford CS336 — Language Modeling from Scratch
Comprehensive lecture notes for the Spring 2025 edition of Stanford's CS336. Each set of notes is generated from the entire lecture video — the slides, whiteboard math, and live code — combined with the official transcript. Covers tokenization, architectures, MoE, GPUs & kernels, parallelism, scaling laws, inference, evaluation, data, and alignment (SFT/RLHF/RL).
⭐ Star on GitHub🧠
Algorithm Deep Dives
Beyond the slides — the trickiest algorithmic questions across the course (BPE, attention, MoE routing, FlashAttention, parallelism, scaling laws, KV-cache, PPO/DPO/GRPO), worked through with full derivations and intuition.
→
🛠️
Assignments — Do the Work
The lectures are the easy part; the real learning is in the 5 implementation-heavy assignments. Study guides for each — what you build, what makes it hard, and how to approach it — guidance, not solutions.
→
LECTURE 01 · 1:18:59
Overview and Tokenization
CS336: Language Models From Scratch aims to teach students how to build language models from the ground up, emphasizing a deep understanding of the un…
LECTURE 02 · 1:19:22
PyTorch, Resource Accounting
This lecture focuses on building language models from scratch using PyTorch, emphasizing efficiency and resource accounting (memory and compute).
LECTURE 03 · 1:27:03
Architectures, Hyperparameters
Modern transformer architectures have converged on pre-norm, RMSNorm, and gated linear units (GLUs) for better stability and performance.
LECTURE 05 · 1:14:21
GPUs
GPUs are massively parallel processors optimized for throughput, not latency, by having many simple compute units (SMs) orchestrated by minimal contro…
LECTURE 08 · 1:15:10
Parallelism 2
Data Transfer Bottleneck: The primary challenge in distributed training is minimizing data transfer bottlenecks, as moving data between different memo…
LECTURE 10 · 1:22:52
Inference
Inference is a critical part of language model deployment, distinct from training due to its memory-limited and dynamic nature.
LECTURE 11 · 1:18:13
Scaling Laws 2
Scaling Laws in Practice: Modern large language model (LLM) builders use scaling laws as a core part of their design process, but the details of these…
LECTURE 12 · 1:20:48
Evaluation
Evaluation is a complex topic that goes beyond simple metrics, influencing how language models are built and used.