Qihong Ruan / CS336 Notes / Lecture 09

Lecture 09: Scaling Laws 1

Stanford CS336 ยท Spring 2025 ยท 1:05:18 ยท Watch on YouTube โ†—

TL;DR * Scaling laws describe how model performance (loss) changes with resources (data, model size, compute) in a predictable, often log-linear, fashion. * Historically, scaling laws were studied in statistical learning theory (generalization bounds) and early NLP (data scaling for task performance). * Modern scaling laws for LLMs show power-law relationships between test loss and compute, dataset size, or parameters. * These laws enable efficient engineering decisions: predicting optimal hyperparameters, architectures, and resource allocation (e.g., data vs. model size) without expensive large-scale training. * Key insights include the existence of distinct scaling regions (small data, power-law, irreducible error) and the importance of accounting for factors like data composition, repetition, and specific parameter types (e.g., embeddings). * While powerful for pre-training objectives (like perplexity), scaling law predictability can be less reliable for downstream tasks.

Key Concepts * Scaling Laws * Data Scaling * Model Scaling * Compute Scaling * Power-Law Relationships * Log-Log Plots * Generalization Error * Irreducible Error * Small Data Region * Power-Law Region * Data Composition * Data Repetition * Isotropes / IsoFLOPs * Optimal Training Budget * Critical Batch Size * Scale-Aware Initialization (muP) * Downstream Scaling


[0:00] Introduction to Scaling Laws

The lecture begins by setting a scenario: imagine you have access to 100,000 H100 GPUs for a month and need to build the best open-source Large Language Model (LLM). We've already covered infrastructure, distributed training, pre-training datasets, and architectures. The question then becomes: how do you make optimal decisions regarding architecture, hyperparameters, and resource allocation to push the frontiers of model performance, rather than just copying existing models?

This is where scaling laws come in. * Goal: Build simple, predictive "laws" for the behavior of language models. * Approach: Train small models, learn from their behavior, and extrapolate to larger ones. * Old (unpleasant) way: Tune hyperparameters directly on big models, which is computationally expensive. * New (optimistic) way: Tune on small models, extrapolate to large ones, saving significant compute.

The lecture will cover: 1. History and background of scaling laws: Contextualizing their origins beyond recent "AGI" hype. 2. Neural (LLM) scaling behaviors: Empirical results and practical applications.


[0:05] Part 1. Scaling Laws: History and Background

The speaker emphasizes that scaling laws are more grounded than often portrayed, with a rich history.

[0:05] Data Scaling as Empirical Sample Complexities

From a statistical machine learning perspective, scaling laws describe how model behavior changes with increased data or model size.

[0:05] Earliest (Data) Scaling Law Paper - 1993

The earliest paper that resembles modern scaling law analysis is "Learning Curves: Asymptotic Values and Rate of Convergence" from 1993 by Corinna Cortes, L.D. Jackel, Sara A. Solla, Vladimir Vapnik, and John S. Denker (Bell Labs).

[0:05] Early History of Scaling Laws - Data Scaling

Other early works also explored data scaling. * Log-linear scaling with data (Banko and Brill, 2001): Studied how NLP system performance scales with data size. * Observation: Log-linear relationship between data size (x-axis, log scale) and test accuracy (y-axis). * Conclusion: Dramatic performance improvements can be achieved by scaling data. This led to the argument that collecting more data might be more effective than complex algorithm development, a sentiment echoed in modern pre-training.

[0:05] Early Tests of Functional Forms

Even in the early 2010s, researchers were investigating the best functional forms for these scaling relationships. * Kolachina et al. (2012): Tested various functional forms (e.g., exponential, power-law) to predict model behavior based on training data size. * Key Relationship: Model capabilities (y-axis) as a function of data (x-axis).

[0:05] Earliest Large-Scale Neural Scaling Work: Hestness et al. 2017

This paper from Baidu (Hestness et al., 2017) is often cited as the earliest large-scale neural scaling work. * Observation: For tasks like machine translation, speech, and vision, error rates fall as a power law with increasing data. * Generalization Error Curve: The paper proposes a conceptual curve for generalization error vs. training data size, identifying three regions: 1. Small Data Region: Initial phase where performance is close to random guessing ("Best Guess Error"). 2. Power-Law Region: Predictable scaling where error decreases polynomially with data ("Power-Law"). 3. Irreducible Error Region: Asymptotic phase where error approaches the fundamental limit of the task ("Irreducible Error").

[0:05] Hestness II: Ahead of its Time

The Hestness et al. (2017) paper also foreshadowed many "new" phenomena discussed in recent years: * "Emergence": Noted that small data regimes can be hard to predict because models might suddenly "leave the random region" and show significant improvements. * Scaling by Compute: Highlighted that if scaling laws are predictable, then scaling by compute (training larger models for longer) becomes a crucial strategy. * Speed = Accuracy Trade-off: Suggested that hardware improvements (e.g., quantization) and distributed training could be used to improve compute throughput, allowing for larger models and better accuracy.


[0:05] Part 2. Neural (LLM) Scaling Behaviors

This section delves into empirical scaling behaviors of LLMs, focusing on data, model size, and hyperparameters.

[0:05] Data vs. Performance

What is a data scaling law? * A simple formula that maps dataset size ($n$) to error. * Expectation: Monotonic, logistic-like curves for generalization error (log-scale) vs. training data set size (log-scale), with distinct regions (small data, power-law, irreducible error).

Scaling laws hold on many different kinds of phenomena! * Kaplan et al. (2020): Showed log-linear relationships between test loss and: * Compute (PF-days) * Dataset Size (tokens) * Parameters (non-embedding) * These relationships hold even in non-standard settings (e.g., when training and test data are different).

[0:05] Conceptual Foundations of Data Scaling Laws

Q: Why do scaling laws show up? * We expect error to be monotonic (more data = less error). * Q: But why is it a power law / linear in log-log? * A: Estimation error naturally decays polynomially.

Toy Example: Mean Estimation * Input: $x_1, \dots, x_n \sim N(\mu, \sigma^2)$ * Task: Estimate the average, $\hat{\mu} = \frac{1}{n} \sum x_i$. * Error: The expected squared error is $E[(\hat{\mu} - \mu)^2] = \frac{\sigma^2}{n}$. * This is a "scaling law": Taking the logarithm of both sides: $$ \log(\text{Error}) = -\log n + 2 \log \sigma $$ This shows a linear relationship in log-log space with a slope of -1. * Generalization: Any polynomial rate $1/n^\alpha$ is a scaling law.

Detour: Scaling Laws for (Nonparametric) Learning * Neural nets can approximate arbitrary functions. Let's consider estimating a function $f(x)$ where $x_i$ are uniformly distributed in a 2D unit box, and $y_i = f(x_i) + N(0,1)$. * Approach: Cut the 2D space into boxes of length $n^{-1/4}$. * Estimation Error: Informally, we have $\sqrt{n}$ boxes, each getting $\sqrt{n}$ samples. The error is approximately $1/\sqrt{n}$ (plus other smoothness terms). * In d-dimensions: This becomes Error $\approx n^{-1/d}$. * Meaning: Scaling means $\log(\text{Error}) = -\frac{1}{d} \log n + C$. * Takeaway: "Nonparametric" learning has dimension-dependent scaling. The slope of the scaling law is related to the intrinsic dimensionality of the data.

Scaling Law Exponents: An Intriguing Mystery * We expect "nice round numbers" for the slope (e.g., -1 or -0.5) based on simple models. * Empirical Findings: * Machine Translation (Hestness et al.): -0.13 * Speech (Hestness et al.): -0.3 * Language Modeling (Kaplan et al.): -0.095 * These exponents are much slower than expected. This suggests that the "effective" dimensionality of the learning problem for neural networks is very high.

Intrinsic Dimensionality Theory of Data Scaling Laws * Argument (Bahri 2021): 1. Scaling laws arise due to polynomial rates of learning $1/n^\alpha$. 2. Scaling argument $\alpha$ is closely connected to the intrinsic dimensionality of the data. * Caveat: Estimators of intrinsic dimension are sketchy, and this is not airtight.

[0:05] Other Data Scaling Laws

Data scaling laws are useful for making engineering decisions. * Related Question: How does dataset composition affect performance? * Kaplan et al. showed that data composition affects the offset of the scaling law, not the slope. * Implication: To pick an optimal data mixture, you can run experiments on small models and extrapolate. * Example: Optimal data mixture can be found by analyzing the shape of expected error as a function of data source proportion.

Recap: Data Scaling Laws * Remarkably linear relationship between log-data size and log-error. * Holds across domains and models. * Theory understanding: Similar to generalization bounds (mean estimation example). * Applications: Data collection / curation.


[0:05] Scaling Laws for Model Engineering

Now, let's shift to model scaling, which is often more mysterious.

Our Motivation: How can we efficiently design huge LLMs? * Choices: LSTMs vs. Transformers, Adam vs. SGD, etc. * Resource Allocation: How should we allocate limited resources? * Train models longer vs. train bigger models? * Collect more data vs. get more GPUs? * Scaling laws provide a simple procedure to answer these questions.

[0:05] Hyperparameter Questions

We'll consider some of these choices in the context of the classic Kaplan scaling paper. 1. Architecture 2. Optimizer 3. Aspect ratio / depth 4. Batch size

1. Architecture: Transformers vs. LSTMs * Q: Are transformers better than LSTMs? * Brute force way: Spend tens of millions to train an LSTM GPT-3. * Scaling law way: Train LSTMs and Transformers across various compute levels. * Observation: Transformers consistently outperform LSTMs with a constant factor gap in compute efficiency (e.g., 15x more efficient). This gap holds across different numbers of layers.

Many Architectures (Tay et al.) * Cross-architecture scaling (Tay et al., Google): Compared many architectures (ALBERT, DConv, Funnel, Transformer-GLU, LCconv, MLP Mixer, MoS Transformer, Switch Transformer, Universal Transformer) against a Transformer baseline. * Method: Scaled each architecture across different FLOPs budgets and plotted negative log-perplexity. * Observation: Only architectures like Gated Linear Units (GLU) and Mixture of Experts (MoE) (e.g., Switch Transformer) consistently beat the Transformer baseline. This provides evidence for which architectures are worth scaling up.

2. Optimizer Choice * Q: What about Adam vs. SGD? * Hestness et al. (2017): Compared SGD and Adam for Recurrent Highway Networks (RHNs, pre-transformers). * Observation: Similar to architecture choice, there's a constant factor gap in compute effectiveness between Adam and SGD. The slopes of the scaling curves are similar, but Adam provides a better offset (lower loss for the same compute).

3. Depth/Width: Number of Layers * Q: Does depth or width make a huge difference? * Kaplan et al. analysis shows that while 1 layer performs significantly worse, models with 2 or more layers have diminishing returns beyond $10^7$ parameters. * Insight: There's a wide basin of approximately optimal depth/width ratios, rather than a sharp optimal point.

Depth/Width: But not all parameters are made equal * Observation: Embedding layer parameters don't behave the same as non-embedding parameters. * If embedding parameters are included in the total parameter count, the scaling law becomes distorted (bends over). * If only non-embedding parameters are considered, the scaling law is much cleaner. * Related Work: Recent papers on scaling laws for mixtures of experts also explore how different types of parameters contribute to scaling.

Do hyperparameters and other Transformer layers scale equally? * Kaplan et al. also analyzed the impact of aspect ratios (Feed-Forward Ratio, Attention Head Dimension) on performance. * Observation: The shape of the loss curve (loss increase vs. aspect ratio) remains similar across different model sizes (50M, 274M, 1.5B parameters). * Implication: You can tune aspect ratios on small models, and the optimal range will likely transfer to larger models. This highlights the importance of being "scale-aware" in hyperparameter tuning.

4. Batch Size: Critical Batch Size * Batch size is known to have strong diminishing returns past a certain point. * Critical batch size: The minimum number of examples for target loss / minimum number of steps for target loss. * Perfect Scaling: When batch size is smaller than the noise scale, increasing batch size is almost equivalent to taking more gradient steps. This is desirable for parallel processing. * Ineffective Scaling: Past the critical batch size, increasing batch size no longer effectively reduces noise, as it's dominated by the curvature of the optimization landscape. * Empirical Analysis: The critical batch size can be estimated empirically. * Observation: As the loss target gets smaller (better performance), the critical batch size tends to get larger. * Practical Implication: Training reports (e.g., Llama 3) often show increasing batch size during training as loss decreases.

Batch Size: Selecting the Optimal Batch * Q: As we increase both compute and model size, how should we scale training? * Kaplan et al. analysis shows that for a given compute budget, the number of total steps can remain relatively constant while batch sizes increase. * Good news for data parallel processing: This allows for efficient scaling of training.

5. Learning Rates: muP and Scale-Aware LR Choices * Problem: If we naively scale up, the optimal learning rate depends on scale. * Standard Practice (left plot): As model width increases, the optimal learning rate shifts to the left (smaller values). * Solution: We need scale-aware initialization and learning rate scaling. * muP (Maximal Update Parametrization): A reparameterization of the model where learning rates are scaled based on model width and other factors (e.g., variance of initialization, output multipliers). * Our Work (right plot): With muP, the optimal learning rate remains stable across different model widths. * Implication: Tune learning rate once on a small model, and it directly transfers to the largest scale. * Industry Adoption: Labs (e.g., Meta's "MetaP" for Llama 4) are adopting similar ideas to simplify scaling.


[0:05] Caution - Scaling Behaviors Can Differ Downstream


[0:05] Some Surprising Takeaways

[0:05] One Important Use of Scaling Laws

Q: Do we need more data or bigger models? * Context: Historically, compute was the limiting resource, not data. The question was how to optimally spend a fixed FLOPs budget. * Joint Data-Model Scaling Laws: Describe how data and model size relate to error. * Rosenfeld et al. (2020) & Kaplan et al. (2020): Proposed functional forms where error is a sum of terms decaying polynomially with data ($n^{-\alpha}$) and model size ($m^{-\beta}$), plus an irreducible error term ($C$). * $$ \text{Error} = n^{-\alpha} + m^{-\beta} + C $$ (Rosenfeld et al.) * $$ \text{Error} = [m^{-\alpha} + n^{-1}]^\beta $$ (Kaplan et al.) * These functional forms, while somewhat ad-hoc, provide surprisingly good fits to the observed data-model joint error landscape.

Model-Data Joint Scaling is Accurate * Rosenfeld et al. demonstrated that by fitting scaling exponents on small data and small models, they could accurately predict the performance of much larger models and datasets. * Method: Train models on a small subset of the data-model space (e.g., small models, small data), fit the joint scaling law, and extrapolate to predict performance for larger models and datasets. * Accuracy: The predictions (y-axis) closely match the real values (x-axis) for both ImageNet and WikiText-103, showing high accuracy of joint extrapolation. * Implication: This allows trading off data and model size to optimize $n^{-\alpha} + m^{-\beta} + C$ with your costs.

[0:05] Compute Tradeoffs

[0:05] Caution - 'Optimal' Scaling Laws Are Hard to Get

Main Difference - Accounting for LR Schedules * One key reason for the discrepancy between Kaplan's and Chinchilla's estimates lies in how they accounted for learning rate schedules. * Cosine Learning Rate Schedules: Models are typically trained with cosine learning rate schedules, which involve a warm-up phase, a decay phase, and a cool-down phase. * Problem: You cannot truncate a cosine learning rate schedule early and expect the same model quality as a full run. A model trained for a shorter duration with a truncated schedule is not equivalent to a model trained from scratch with a schedule designed for that shorter duration. * Kaplan's Approach: Assumed that truncating training runs (and thus the learning rate schedule) was equivalent to training a smaller model for a shorter time. This assumption was flawed. * Chinchilla's Approach: Accounted for the full cosine learning rate schedule, ensuring that models were trained to convergence for each data/model size combination.

Chinchilla in Depth - 3 Methods Chinchilla authors suggested 3 ways of fitting scaling laws, which mostly (minus method 3) suggest similar constants.

  1. Minimum over training runs (Method 1):

    • Approach: Overlay training curves for various model sizes and compute budgets. Identify the "lower envelope" of these curves, representing the minimum loss achievable for a given FLOPs budget.
    • Observation: The minimum over the union of all training curves is a power law.
    • Result: This method yields optimal parameter-to-FLOPs and token-to-FLOPs ratios that are consistent with 0.5 for both coefficients ($N_{opt} \propto C^a$, $D_{opt} \propto C^b$).
  2. IsoFLOPs (Method 2):

    • Approach: Pick a range of FLOPs budgets. For each budget, vary the total parameter count and take the minimum loss over these convex shapes (IsoFLOP curves).
    • Observation: The minima form a power law.
    • Result: This method also yields optimal parameter-to-FLOPs and token-to-FLOPs ratios consistent with 0.5. This is considered conceptually straightforward.
  3. Joint Fits (Method 3):

    • Approach: Run a bunch of models on the size-data grid. Use least squares to fit a joint scaling law (like the Rosenfeld/Kaplan functional forms).
    • Observation: This method is messier and yields different coefficients (e.g., 0.73 for $a$ and 0.27 for $b$ in Kaplan's initial work).

Fun Addendum - Errors in Chinchilla Method 3 * Discovery: Some authors (Besiroglu et al., 2024) later found that Method 3 in the original Chinchilla paper was likely flawed. * Process: They performed data forensics, recovered the raw data, and re-did the fit. * Result: The re-fit yielded results more consistent with Methods 1 and 2. The original fit had non-zero mean residuals, indicating a biased fit. Correcting this bias brought Method 3's estimates in line with the other two methods. * Conclusion: The original authors had both the idea and the data right, but a minor error in curve fitting led to a misleading result.

[0:05] Important Note - Train-Optimal May Not Be What You Want

[0:05] Recent Example for Different (Diffusion) Models


[0:05] Scaling Laws for Models and Compute


[0:05] Recap: Scaling Laws - Surprising and Useful!

Lecture 9: Scaling Laws - Basics, introducing a scenario for large-scale language model training.
0:14 Lecture 9: Scaling Laws - Basics, introducing a scenario for large-scale language model training.
Slide title: Today: simple, predictive 'laws' for behaviors of LMs, with two graphs showing validation and test loss.
1:42 Slide title: Today: simple, predictive 'laws' for behaviors of LMs, with two graphs showing validation and test loss.
Slide title: Today: simple, predictive 'laws' for behaviors of LMs, with two graphs showing validation and test loss.
2:20 Slide title: Today: simple, predictive 'laws' for behaviors of LMs, with two graphs showing validation and test loss.
Slide title: Sample complexity and rates, showing two mathematical equations related to learning theory.
4:01 Slide title: Sample complexity and rates, showing two mathematical equations related to learning theory.
Slide title: Earliest (data) scaling law paper โ€“ 1993, displaying a paper title and a graph of error vs. training set size.
5:28 Slide title: Earliest (data) scaling law paper โ€“ 1993, displaying a paper title and a graph of error vs. training set size.
Slide title: Early history of scaling laws โ€“ data scaling, showing a graph of test accuracy vs. millions of words.
7:13 Slide title: Early history of scaling laws โ€“ data scaling, showing a graph of test accuracy vs. millions of words.
Slide title: Hestness et al 2017, presenting three graphs illustrating neural machine translation learning curves and generalization error.
8:40 Slide title: Hestness et al 2017, presenting three graphs illustrating neural machine translation learning curves and generalization error.
Slide title: Part 2. Neural (LLM) scaling behaviors, outlining three key areas: data vs performance, data vs model size, and hyperparameters vs performance.
10:50 Slide title: Part 2. Neural (LLM) scaling behaviors, outlining three key areas: data vs performance, data vs model size, and hyperparameters vs performance.
Slide title: Part 2. Neural (LLM) scaling behaviors, outlining three key areas: data vs performance, data vs model size, and hyperparameters vs performance.
12:21 Slide title: Part 2. Neural (LLM) scaling behaviors, outlining three key areas: data vs performance, data vs model size, and hyperparameters vs performance.
Slide title: Data vs performance, defining data scaling laws and showing a generalization error curve.
13:43 Slide title: Data vs performance, defining data scaling laws and showing a generalization error curve.
Slide title: Conceptual foundations of data scaling laws, posing a question about why scaling laws appear.
15:13 Slide title: Conceptual foundations of data scaling laws, posing a question about why scaling laws appear.
Slide title: Toy example: mean estimation, presenting equations for estimating the average and its error.
16:57 Slide title: Toy example: mean estimation, presenting equations for estimating the average and its error.

Practical Takeaways

Open Questions / Things to Remember