Qihong Ruan / CS336 Notes / Lecture 03

Lecture 03: Architectures, Hyperparameters

Stanford CS336 ยท Spring 2025 ยท 1:27:03 ยท Watch on YouTube โ†—


Slide: "What you implemented - simple, modern variant" with a transformer block diagram and differences listed.
2:23 Slide: "What you implemented - simple, modern variant" with a transformer block diagram and differences listed.
Slide: "How to pick architectures?" with many research paper titles on language models.
3:37 Slide: "How to pick architectures?" with many research paper titles on language models.
Slide: "Architecture variations.." showing a table comparing different model architectures and trends.
5:42 Slide: "Architecture variations.." showing a table comparing different model architectures and trends.
Slide: "Pre-vs-post-norm, the data" with graphs comparing pre-norm and post-norm performance.
7:24 Slide: "Pre-vs-post-norm, the data" with graphs comparing pre-norm and post-norm performance.
Slide: "New things - 'double' norm." showing diagrams of two different normalization architectures.
9:54 Slide: "New things - 'double' norm." showing diagrams of two different normalization architectures.
Slide: "LayerNorm vs RMSNorm" with equations for both normalization types and notable models.
11:53 Slide: "LayerNorm vs RMSNorm" with equations for both normalization types and notable models.
Slide: "Why RMSNorm?" explaining fewer operations/parameters, with a table of flop percentages.
13:30 Slide: "Why RMSNorm?" explaining fewer operations/parameters, with a table of flop percentages.
Slide: "RMSNorm - validation" showing a table comparing RMSNorm performance with other models.
15:14 Slide: "RMSNorm - validation" showing a table comparing RMSNorm performance with other models.
Slide: "LayerNorm: recap" summarizing pre-norm, RMSNorm, and bias term practices.
17:24 Slide: "LayerNorm: recap" summarizing pre-norm, RMSNorm, and bias term practices.
A speaker presenting in front of a screen displaying the "New things - 'double' norm" slide.
18:39 A speaker presenting in front of a screen displaying the "New things - 'double' norm" slide.
Slide: "A few of the common activations" showing ReLU and GeLU equations, graphs, and notable models.
20:38 Slide: "A few of the common activations" showing ReLU and GeLU equations, graphs, and notable models.
Slide: "Gated variants of standard FF layers" showing GeGLU and SwiGLU equations and notable models.
23:20 Slide: "Gated variants of standard FF layers" showing GeGLU and SwiGLU equations and notable models.

TL;DR - Modern transformer architectures have converged on pre-norm, RMSNorm, and gated linear units (GLUs) for better stability and performance. - Hyperparameter choices, especially the feed-forward dimension to model dimension ratio, show surprising consensus (4x for ReLU, 2.66x for GLU). - Weight decay in large language models (LLMs) is used more for optimization dynamics than for controlling overfitting. - Stability tricks, particularly related to softmax operations, are crucial for training large models. - Innovations like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) improve inference efficiency, while sparse attention patterns (e.g., sliding window) enable longer context windows.

Key concepts - Pre-norm vs. Post-norm - LayerNorm vs. RMSNorm - Gated Linear Units (GLUs) - Feed-forward dimension ratio - Head dimension to model dimension ratio - Aspect ratio (model dimension vs. number of layers) - Vocabulary sizes - Regularization (Dropout, Weight Decay) - Stability tricks (Z-loss, QK norm, Logit soft-capping) - Attention heads (GQA/MQA, Sparse/Sliding Window Attention) - KV cache


[00:00] Lecture 3: Architectures, Hyperparameters

This lecture will delve into the "nitty-gritty details" of Language Model (LM) architecture and training, covering aspects often omitted in other courses. The goal is to learn from the collective experience of those who have trained many LLMs.

[00:46] Outline and Goals 1. Quick recap of the "standard" transformer (what you implement). 2. What do most of the large LMs have in common? 3. What are common variations to the architecture/training process?

Today's theme: The best way to learn is hands-on experience; the second best way is to try to learn from others' experience.

[01:42] Starting Point: The 'Original' Transformer

The original Transformer architecture (from "Attention Is All You Need") consists of: - Position Embeddings: Sine and cosine functions. - Feed-Forward Network (FFN): ReLU activation, $FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2$. - Normalization: Post-LayerNorm (LayerNorm applied after the residual connection).

We will examine variations in these components, leading to the most modern Transformer variants.

[02:17] What You Implemented - Simple, Modern Variant

The Transformer variant implemented in the assignment is a modern version, not the original "vanilla" Transformer. Key differences include: - LayerNorm: Applied in front of the block (pre-norm). - Position Embeddings: Rotary Position Embeddings (RoPE). - FFN Activation: Uses SwiGLU, not ReLU. - Linear Layers: Linear layers (and LayerNorm) have no bias (constant) terms.

The question arises: Why these specific choices? We will explore these decisions based on empirical evidence from various LLMs.

[03:07] How Should We Think About Architectures?

The field of LLM architectures is rapidly evolving. In the last year alone, there have been over 19 new dense model releases, many with minor architectural tweaks. This proliferation of models (e.g., Command A, OLMo, Gemma, Qwen, Mistral, Falcon) presents a challenge but also a wealth of information. By analyzing what these models have in common and what parts vary, we can understand which architectural choices are truly important.

[04:30] What Are We Going to Cover? 1. Common architecture variations: Activations, FFN, Attention variants, Position embeddings. 2. Hyperparameters that (do or don't) matter: What is ff_dim? Do multi_head_dims always sum to model_dim? How many vocab elements? 3. Stability tricks.

[05:15] Architecture Variations

A high-level overview of various LLMs (from 2017 to 2024) reveals: - Low consensus on many architectural choices, with the exception of pre-norm. - Trends towards 'LLaMA-like' architectures in recent years.

[05:54] Pre-vs-Post Norm

The original Transformer used post-norm, where LayerNorm is applied after the residual connection. $$x_{l+1} = LayerNorm(x_l + MultiHeadAtt(x_l))$$ $$x_{l+1} = LayerNorm(x_l + FFN(x_l))$$ However, very early on, researchers found that moving LayerNorm to before the block (pre-norm) led to much better results. $$x_{l+1} = x_l + MultiHeadAtt(LayerNorm(x_l))$$ $$x_{l+1} = x_l + FFN(LayerNorm(x_l))$$ Almost all modern LMs use pre-norm, with OPT-350M being a notable exception.

[07:13] Pre-vs-Post Norm, the Data

Early papers (e.g., Salazar and Nguyen 2019, Xiong 2020) demonstrated the benefits of pre-norm. - Pre-norm + Stability tricks (like ScaleNorm + FixNorm) allowed models to train without warm-up, achieving comparable or better performance than post-norm with careful warm-up. - This was observed across various tasks, including machine translation (Dev BLEU) and language modeling (Validation Loss on TWSLT and BERT).

[08:10] Pre-vs-Post Norm, Explanations?

Today, pre-norm and other LayerNorm tricks are primarily used as stability-inducing aids for training large neural networks, especially with larger learning rates.

[09:14] New Things - 'Double Norm'

A recent innovation (not present in last year's lectures) is the "double norm" approach. - If putting LayerNorms in residual streams is bad, why not put them outside the stream? - Models like Grok and Gemma 2 apply LayerNorms both before and after the attention and FFN blocks (i.e., in front of the residual stream and after the main block output). - OLMo 2 uses LayerNorms only after the attention and FFN blocks (non-residual post-norm). - This approach is argued to be even more stable and easier to train for larger models.

[11:59] LayerNorm vs RMSNorm

[12:38] Why RMSNorm?

[15:06] RMSNorm - Validation

Narang et al. (2020) showed that RMSNorm provides runtime improvements and, surprisingly, performance gains. - Vanilla Transformer: 3.50 steps/s, Final loss 1.838. - RMSNorm: 3.68 steps/s, Final loss 1.821. - This is a win-win: faster runtime and lower loss.

[15:58] More Generally: Dropping Bias Terms

Most modern Transformers do not have bias terms. - Original Transformer FFN: $FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2$. - Most Implementations FFN (if not gated): $FFN(x) = \sigma(xW_1)W_2$. - Reasons: Similar to RMSNorm, it saves memory and improves optimization stability. Dropping bias terms has been empirically observed to stabilize training.

[17:12] LayerNorm: Recap - Basically everyone does pre-norm. - Intuition: Keep the good parts of residual connections. - Observations: Nicer gradient propagation, fewer spikes. - Some people add a second norm outside the residual stream (not post-norm). - Most people do RMSNorm. - In practice, works as well as LayerNorm. - But, has fewer parameters to move around, which saves on wallclock time. - People more generally drop bias terms since the compute/param tradeoffs are not great.

[18:05] Activations

There's a "whole zoo" of activations: ReLU, GeLU, Swish, ELU, GLU, GeGLU, ReGLU, SeLU, SwiGLU, LiGLU. - It really does matter which activation function is chosen. SwiGLU and other GLU variants consistently work well.

[20:07] A Few of the Common Activations - ReLU: $FFN(x) = \max(0, xW_1)W_2$. - Notable models: Original Transformer, T5, Gopher, Chinchilla, OPT. - GeLU: $FFN(x) = GeLU(xW_1)W_2 = x\Phi(x)W_2$. - Notable models: GPT-1/2/3, GPT-Neox, BLOOM. - SwiGLU / GeGLU (next slide). - Notable models: LLaMa 1/2/3, PaLM, Mistral, OLMo, most models post 2023.

[21:24] Gated Activations (*GLU)ivations

GLUs modify the "first part" of an FF layer. - Original FFN (with ReLU): $FFN(x) = \max(0, xW_1)W_2$. - Instead of a linear + ReLU, augment the above with an (entrywise) linear term: $$\max(0, xW_1) \rightarrow \max(0, xW_1) \otimes (xV)$$ - This gives the gated variant (ReGLU): $$FFN_{ReGLU}(x) = (\max(0, xW_1) \otimes xV)W_2$$ - Note that we have an extra parameter (V).

[22:47] Gated Variants of Standard FF Layers - GeGLU: $FFN_{GeGLU}(x, W, V, W_2) = (GeLU(xW_1) \otimes xV)W_2$. - Notable models: T5 v1.1, mT5, LLaMDA, Phi3, Gemma 2, Gemma 3. - SwiGLU: $FFN_{SwiGLU}(x, W, V, W_2) = (Swish(xW_1) \otimes xV)W_2$. - Notable models: LLaMa 1/2/3, PaLM, Mistral, OLMo, most models post 2023. - Note: Gated models use smaller dimensions for the $d_{ff}$ by 2/3 to keep the total parameter count similar to non-gated counterparts.

[25:49] Do Gated Linear Units Work?

Yes, fairly consistently so. - Shazeer (2020) showed that GLU variants consistently outperform ReLU on various tasks (e.g., CoLA, SST-2). - FFN$_{ReGLU}$ achieved the highest average score (84.67) and accuracy (94.38) among all tested FFN variants. - Narang et al. (2020) corroborated these findings, showing that GLU variants consistently achieve lower losses.

[27:54] Gating, Activations - Many variations (ReLU, GeLU, GLU) across models. - GLU isn't necessary for a good model (see GPT3), but it's probably helpful. - Recent outlier models like Nemotron 340B (Squared ReLU) and Falcon 2 11B (ReLU) also achieve high performance. - But evidence points towards somewhat consistent gains from SwiGLU/GeGLU.

[28:51] Serial vs Parallel Layers

Normal Transformer blocks are serial: they compute attention, then the MLP. - Input comes in, attention is computed, result is passed to MLP, MLP is computed, result is passed forward. - This serial nature can limit parallelism across GPUs.

[29:40] Parallel Layers

A few models (GPT-J, PaLM, GPT-NeoX) do parallel layers. Originally in GPT-J. - Parallel Layers: Instead of serial computation, attention and MLP are computed in parallel and then added to the residual stream. - Standard (serial) formulation: $y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x))$. - Parallel formulation: $y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x))$. (Note: The slide shows the same formula for both serial and parallel, but the key difference is that in parallel, MLP and Attention are computed from the same input $x$ and their outputs are summed before adding to $x$). - The parallel formulation can result in roughly 15% faster training speed at large scales, since the MLP and Attention input matrix multiplications can be fused. - If implemented right, LayerNorm can be shared, and matrix multiplies can be fused for systems efficiency. - Recent Models: Cohere Command A, Falcon 2 11B, Command R. - However, most models since then have reverted to serial layers, with only a few exceptions.

[30:43] Summary: Architectures - Pre-vs-Post norm: Everyone does pre-norm (except OPT350M), likely with good reason. - Layer vs RMSNorm: RMSNorm has clear compute wins, sometimes even performance wins. - Gating: GLUs seem generally better, though differences are small. - Serial vs parallel layers: No extremely serious ablations, but has a compute win.

[32:34] Many Variations in Position Embeddings

[33:29] RoPE: Rotary Position Embeddings

[34:41] RoPE: Rotary Position Embeddings - How can we solve this problem?

[36:16] RoPE: Rotary Position Embeddings - There are many rotations, which one do you pick?

[37:24] The Actual RoPE Math

[37:52] Implementation and Code for RoPE


[39:03] Hyperparameters

Transformer hyperparameter questions you might have had in 224n: - How much bigger should the feedforward size be compared to hidden size? - How many heads, and should num_heads always divide hidden size? - What should my vocab size be? - Do people even regularize these LMs? - How do people scale these models - very deep or very wide?

[40:03] Surprising (?) Consensus Hyperparameter 1 - Feedforward

[41:00] Exception #1 - GLU Variants

[42:00] Exception #2 - T5

[43:00] Why This Range of Multipliers?

[44:00] What Can We Learn from the Model-Dim Hyperparam?

[45:00] Surprising (?) Consensus Hyperparameter 2 - Multi-Head Self-Attention

[46:00] How Many Heads, What's the Model Dim?

[47:00] Evidence for 1-1 Ratio?

[47:00] Aspect Ratios

[48:00] Considerations About Aspect Ratio

[49:00] What Are Typical Vocabulary Sizes?

[50:00] Dropout and Other Regularization

[51:00] Dropout and Weight Decay in Practice

[52:00] Why Weight Decay LLMs?

[53:00] Summary: Hyperparameters - Feedforward: Factor-of-4 rule of thumb (8/3 for GLUs) is standard (with some evidence). - Head dim: Head dim * Num head = D model is standard. - Aspect ratio: Wide range of 'good' values (100-200). Systems concerns dictate the value. - Regularization: You still 'regularize' LMs but its effects are primarily on optimization dynamics.


[54:00] Stability Tricks

Recently, lots of attention on stable training. - As models get bigger and are trained longer, stability issues become more prominent. - A common problem is exploding gradients, which leads to unstable training and divergence. - The goal is to turn an unstable training curve (like the blue one with high gradient spikes) into a stable one (like the orange one with low gradient norms).

[55:00] Where Do the Issues Arise? Beware of Softmaxes!

[56:00] Output Softmax Stability - The 'Z-loss'

[57:00] Attention Softmax Stability - The 'QK Norm'

[58:00] Logit Soft-Capping


[59:00] Attention Heads

[1:00:00] GQA/MQA - Reducing Attention Head Cost

[1:02:00] GQA/MQA - Reducing Attention Head Cost (cont.)

[1:03:00] MQA - Just Have Fewer Key Dimensions.

[1:04:00] Recent Extension - GQA

[1:04:00] Does MQA Hurt? Sometimes.

[1:04:00] Sparse / Sliding Window Attention

[1:04:00] Sliding Window Attention

[1:04:00] Current Standard Trick - Interleave 'Full' and 'LR' Attention


[1:04:00] Recap, Conclusion, etc.


Practical Takeaways - When building LLMs, prioritize pre-norm for stability. - RMSNorm is generally preferred over LayerNorm for efficiency without sacrificing performance. - GLU variants (GeGLU, SwiGLU) are the current state-of-the-art for FFN activations. - Follow established hyperparameter ratios (e.g., $d_{ff} = 4d_{model}$ or $8/3 d_{model}$ for GLUs, $d_{model} / (h \cdot d_h) \approx 1$). - Weight decay is crucial for optimizing LLMs, even if not for traditional overfitting control. - Implement stability tricks, especially for softmax operations (Z-loss, QK norm, logit soft-capping). - Consider MQA/GQA for inference efficiency and sparse attention patterns for longer context windows.

Open Questions / Things to Remember - The exact theoretical reasons for the superior stability of pre-norm and RMSNorm are still being actively researched, but empirical evidence is strong. - The interaction between weight decay and learning rate schedules is complex and crucial for optimal training. - The field is still evolving, with new architectural tweaks and stability tricks emerging regularly (e.g., double norm, interleaved attention patterns). - System-level considerations (memory movement, parallelism constraints) increasingly influence architectural choices.