Qihong Ruan / CS336 Notes / Lecture 15

Lecture 15: Alignment โ€” SFT/RLHF

Stanford CS336 ยท Spring 2025 ยท 1:14:51 ยท Watch on YouTube โ†—

TL;DR * Post-training (alignment) is crucial to make large language models (LLMs) useful and safe, transitioning from raw pre-trained models to instruction-following agents like ChatGPT. * Supervised Fine-Tuning (SFT) involves training on expert demonstrations, but the quality and style of this data significantly impact model behavior and can lead to issues like hallucination if not carefully managed. * Reinforcement Learning from Human Feedback (RLHF) optimizes models to maximize a measurable reward function, moving beyond simply imitating a reference distribution. * RLHF data collection (pairwise comparisons) is often cheaper than SFT, but still faces challenges with annotator quality, consistency, and ethical considerations (e.g., fair wages). * Algorithms like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) are used to adapt LLMs using human feedback, with DPO offering a simpler, more accessible approach by reframing RL as a maximum likelihood problem.

Key Concepts * Post-training / Alignment * Supervised Fine-Tuning (SFT) * Instruction Tuning Data (FLAN, Alpaca, Oasst) * Hallucination * Safety Tuning / Content Moderation * Mid-training / Two-phase Training * Reinforcement Learning from Human Feedback (RLHF) * Reward Model * Pairwise Feedback * Policy Gradient Theorem * Proximal Policy Optimization (PPO) * Trust Region Policy Optimization (TRPO) * Direct Preference Optimization (DPO) * KL Divergence * Bradley-Terry Model * Self-preference / Self-bias * Length Effects * Generator-Validator Gap * Crowdsourcing complexities and ethics


[0:00] Introduction: From Pre-training to Post-training

Lecture 15 title slide: RLHF / Alignment for CS336.
0:14 Lecture 15 title slide: RLHF / Alignment for CS336.

The lecture shifts focus from pre-training (building large models and data components) to post-training or alignment, which aims to make these pre-trained models useful and safe.

Motivation: * GPT-3 vs. ChatGPT: GPT-3 was a remarkable system, but not particularly "useful" in a product sense (e.g., it didn't follow instructions well). ChatGPT, however, transformed the landscape by effectively following instructions and performing amazing feats. * The Goal: Understand the transition from a powerful pre-trained model (like GPT-3) to a highly aligned and useful system (like ChatGPT).

Key Aspects of Post-Training: 1. Instruction Following: Enabling models to understand and execute complex, nested instructions. * Example: GPT-4 generating matplotlib code from a long, detailed prompt (Sebastian Bubeck's "Sparks of AGI" paper). This ability to follow instructions is a key differentiator. 2. Safety and Content Moderation: Ensuring models are safe, prevent misuse (scams), and avoid generating toxic or harmful content. * ChatGPT's success is partly attributed to its robust guardrails.

Core Idea of Post-Training: * Pre-training "packs" the model with capabilities (reasoning, answering questions), but these aren't accessible "out of the box." * Post-training involves collecting specific data on desired behaviors and training the model to exhibit them. * Key Questions: * What does this "desired behavior" data look like? * How hard is it to collect? * How do we best make use of it (algorithmic questions)? * How do we scale this process?

[0:49] The RLHF Pipeline: Supervised Fine-Tuning (SFT)

Lecture 15 title slide: RLHF / Alignment for CS336, with lecturer visible.
0:49 Lecture 15 title slide: RLHF / Alignment for CS336, with lecturer visible.
Slide on goal: enable better, tighter controls over LM output, with questions on data collection.
3:56 Slide on goal: enable better, tighter controls over LM output, with questions on data collection.

The lecture structure will roughly follow the InstructGPT paper, which outlines a three-step process for building instruction-following models:

  1. Supervised Fine-Tuning (SFT): Collect demonstration data and train a supervised policy.
  2. Reward Model Training: Collect comparison data and train a reward model.
  3. Reinforcement Learning (RLHF): Optimize a policy against the reward model using reinforcement learning.

This lecture covers Part 1 (SFT) and Part 2 (RLHF).

[5:11] Ingredients for SFT: Data and Method

Slide on ingredients in SFT: training data examples like Muffin, CoT, Natural Instructions, and Open Assistant.
5:15 Slide on ingredients in SFT: training data examples like Muffin, CoT, Natural Instructions, and Open Assistant.

Two main considerations for SFT: 1. Training Data: What does expert demonstration data look like? 2. Method: How do we adapt the model to this data? (Beyond simple gradient descent, there are non-obvious aspects).

[5:51] Training Data: Instruction Tuning Datasets

The speaker introduces three types of instruction-tuning datasets, representing different paradigms:

  1. FLAN (Fine-tuned LAnguage Net):

    • Constructed by aggregating many existing NLP task datasets (e.g., T0-SF, Natural Instructions v2, CoT).
    • Pros: Easy to get lots of data for free by repurposing existing benchmarks.
    • Cons: Can be unnatural or "benchmark-centric." The format often requires "surgery" (e.g., appending options to a text) that doesn't resemble natural chat interactions.
  2. Alpaca (Stanford Alpaca):

    • An early attempt at using LLMs to generate instruction tuning data.
    • Procedure: A seed set of human-written instructions is used to prompt a powerful LLM (like GPT-3) to generate more instructions. Then, InstructGPT is used to generate responses for these instructions.
    • Pros: Generates data that feels more like natural chat interactions, with long-form natural language responses.
    • Cons: The generated instructions can be less diverse and shorter than human-written ones.
  3. Oasst (Open Assistant):

    • A crowd-sourced effort where online enthusiasts wrote instruction-tuning data.
    • Pros: High-quality, detailed human-written instructions and responses, often including citations.
    • Cons: Very difficult and expensive to collect at scale.

[2:01] Interactive Annotation Task

The speaker conducts a live annotation task where students are asked to provide a response to the prompt: "Please provide what you think is the best response to the following user input: CS336 is all you need."

[2:45] GPT-4o Response vs. Human Annotation

[5:06] What We Notice Across Datasets

Slide on instruction-tuning data: FLAN, Oasst, and Alpaca datasets.
7:10 Slide on instruction-tuning data: FLAN, Oasst, and Alpaca datasets.
Slide showing FLAN random examples, including email, business, and restaurant prompts.
8:31 Slide showing FLAN random examples, including email, business, and restaurant prompts.
Slide showing FLAN random examples, including email, business, and restaurant prompts.
10:00 Slide showing FLAN random examples, including email, business, and restaurant prompts.
Graph showing preference for lists and longer outputs when evaluating by preferences.
17:03 Graph showing preference for lists and longer outputs when evaluating by preferences.
Slide on references, complex knowledge, and factuality, with an example from Open Assistant.
18:38 Slide on references, complex knowledge, and factuality, with an example from Open Assistant.
Slide on safety-tuning, showing a graph of unsafe and exaggerated safety responses.
26:40 Slide on safety-tuning, showing a graph of unsafe and exaggerated safety responses.
Slide on how to fine-tune, showing Python code for gradient descent and instruction tuning.
28:13 Slide on how to fine-tune, showing Python code for gradient descent and instruction tuning.
Slide on turning instruction tuning into pretraining, outlining a three-step process.
28:51 Slide on turning instruction tuning into pretraining, outlining a three-step process.

Instruction-tuning datasets vary significantly in: * Length and bullet points (style variations): Some prefer lists, some prefer long paragraphs. * References, other complex knowledge: Some include citations, some assume deep domain knowledge. * Scale: Amount of data collected. * Safety: How models handle harmful or toxic content.

Style Variations in Data and Models: * A table from a survey by Ejong Wong et al. (2023) shows significant variation in the average length of prompts and completions across different datasets. * Human Preference: Humans (and AI judges) have a strong preference for lists and longer outputs. * Concern: Optimizing for stylistic preferences (like length) might overshadow optimizing for core capabilities (e.g., reducing hallucinations, improving factual accuracy). * Benchmarking: These stylistic factors are not highly correlated with benchmark performance (e.g., MMLU). Benchmarks are still crucial for evaluating core capabilities, while chat-style evaluations (e.g., AlpacaEval) help understand user engagement. A diverse array of evaluation strategies is needed.

[1:30:00] References, Complex Knowledge, and Factuality

[2:39:50] Safety Tuning

[2:59:50] Putting it Together: SFT Data

[3:07:00] How to Fine-Tune (and Mid-training)

[3:42:00] Part 2: RLHF - From Imitation to Optimization

[3:49:00] Why Optimize? Costs and G-V Gap

Two main reasons to optimize with RLHF: 1. Cost: SFT data (expert demonstrations) can be very expensive to collect. * Annotation costs for SFT are high ($25k for 100 examples in one study). * Pairwise feedback (used in RLHF) is cheaper ($4k for 100 examples). * RLHF (optimizing against a reward model) has lower annotation costs than SFT. * RLHF is cheaper because it's easier to verify than to generate. 2. Generator-Validator (G-V) Gap: People don't always write what they prefer in LM outputs. * Human annotators might prefer LM-generated summaries over their own, even if they are expert writers. * This suggests that human generation is not always aligned with human preference, creating a gap between what humans generate and what they prefer. * RLHF helps bridge this gap by directly optimizing for human preference.

[4:03:00] RLHF Data: Types of Pairwise Feedback

[4:53:00] RLHF Data: LM-Generated Feedback (Self-Training)

[5:09:00] How do we do RLHF? PPO and DPO

[5:12:00] PPO (Proximal Policy Optimization)

[5:30:00] Can we get rid of PPO? (Introducing DPO)

[5:34:00] DPO - RLHF without Tears?


[5:43:00] Practical Takeaways

[5:46:00] Open Questions / Things to Remember

TL;DR * Post-training (alignment) is crucial to make large language models (LLMs) useful and safe, transitioning from raw pre-trained models to instruction-following agents like ChatGPT. * Supervised Fine-Tuning (SFT) involves training on expert demonstrations, but the quality and style of this data significantly impact model behavior and can lead to issues like hallucination if not carefully managed. * Reinforcement Learning from Human Feedback (RLHF) optimizes models to maximize a measurable reward function, moving beyond simply imitating a reference distribution. * RLHF data collection (pairwise comparisons) is often cheaper than SFT, but still faces challenges with annotator quality, consistency, and ethical considerations (e.g., fair wages). * Algorithms like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) are used to adapt LLMs using human feedback, with DPO offering a simpler, more accessible approach by reframing RL as a maximum likelihood problem.

Key Concepts * Post-training / Alignment * Supervised Fine-Tuning (SFT) * Instruction Tuning Data (FLAN, Alpaca, Oasst) * Hallucination * Safety Tuning / Content Moderation * Mid-training / Two-phase Training * Reinforcement Learning from Human Feedback (RLHF) * Reward Model * Pairwise Feedback * Policy Gradient Theorem * Proximal Policy Optimization (PPO) * Trust Region Policy Optimization (TRPO) * Direct Preference Optimization (DPO) * KL Divergence * Bradley-Terry Model * Self-preference / Self-bias * Length Effects * Generator-Validator Gap * Crowdsourcing complexities and ethics


[0:00] Introduction: From Pre-training to Post-training

The lecture shifts focus from pre-training (building large models and data components) to post-training or alignment, which aims to make these pre-trained models useful and safe.

Motivation: * GPT-3 vs. ChatGPT: GPT-3 was a remarkable system, but not particularly "useful" in a product sense (e.g., it didn't follow instructions well). ChatGPT, however, transformed the landscape by effectively following instructions and performing amazing feats. * The Goal: Understand the transition from a powerful pre-trained system (like GPT-3) to a highly aligned and useful system (like ChatGPT).

Key Aspects of Post-Training: 1. Instruction Following: Enabling models to understand and execute complex, nested instructions. * Example: GPT-4 generating matplotlib code from a long, detailed prompt (Sebastian Bubeck's "Sparks of AGI" paper). This ability to follow instructions is a key differentiator. 2. Safety and Content Moderation: Ensuring models are safe, prevent misuse (scams), and avoid generating toxic or harmful content. * ChatGPT's success is partly attributed to its robust guardrails.

Core Idea of Post-Training: * Pre-training "packs" the model with capabilities (reasoning, answering questions), but these aren't accessible "out of the box." * Post-training involves collecting specific data on desired behaviors and training the model to exhibit them. * Key Questions: * What does this "desired behavior" data look like? * How hard is it to collect? * How do we best make use of it (algorithmic questions)? * How do we scale this process?

[0:49] The RLHF Pipeline: Supervised Fine-Tuning (SFT)

The lecture structure will roughly follow the InstructGPT paper, which outlines a three-step process for building instruction-following models:

  1. Supervised Fine-Tuning (SFT): Collect demonstration data and train a supervised policy.
  2. Reward Model Training: Collect comparison data and train a reward model.
  3. Reinforcement Learning (RLHF): Optimize a policy against the reward model using reinforcement learning.

This lecture covers Part 1 (SFT) and Part 2 (RLHF).

[5:11] Ingredients for SFT: Data and Method

Two main considerations for SFT: 1. Training Data: What does expert demonstration data look like? 2. Method: How do we adapt the model to this data? (Beyond simple gradient descent, there are non-obvious aspects).

[5:51] Training Data: Instruction Tuning Datasets

The speaker introduces three types of instruction-tuning datasets, representing different paradigms:

  1. FLAN (Fine-tuned LAnguage Net):

    • Constructed by aggregating many existing NLP task datasets (e.g., T0-SF, Natural Instructions v2, CoT).
    • Pros: Easy to get lots of data for free by repurposing existing benchmarks.
    • Cons: Can be unnatural or "benchmark-centric." The format often requires "surgery" (e.g., appending options to a text) that doesn't resemble natural chat interactions.
    • Example: Summarizing an article with travel info, classifying text as "business," or generating restaurant descriptions from database entries.
  2. Alpaca (Stanford Alpaca):

    • An early attempt at using LLMs to generate instruction tuning data.
    • Procedure: A seed set of human-written instructions is used to prompt a powerful LLM (like GPT-3) to generate more instructions. Then, InstructGPT is used to generate responses for these instructions.
    • Pros: Generates data that feels more like natural chat interactions, with long-form natural language responses.
    • Cons: The generated instructions can be less diverse and shorter than human-written ones.
  3. Oasst (Open Assistant):

    • A crowd-sourced effort where online enthusiasts wrote instruction-tuning data.
    • Pros: High-quality, detailed human-written instructions and responses, often including citations.
    • Cons: Very difficult and expensive to collect at scale.

[2:01] Interactive Annotation Task

The speaker conducts a live annotation task where students are asked to provide a response to the prompt: "Please provide what you think is the best response to the following user input: CS336 is all you need."

[2:45] GPT-4o Response vs. Human Annotation

[5:06] What We Notice Across Datasets

Instruction-tuning datasets vary significantly in: * Length and bullet points (style variations): Some prefer lists, some prefer long paragraphs. * References, other complex knowledge: Some include citations, some assume deep domain knowledge. * Scale: Amount of data collected. * Safety: How models handle harmful or toxic content.

Style Variations in Data and Models: * A table from a survey by Ejong Wong et al. (2023) shows significant variation in the average length of prompts and completions across different datasets. * Human Preference: Humans (and AI judges) have a strong preference for lists and longer outputs. * Concern: Optimizing for stylistic preferences (like length) might overshadow optimizing for core capabilities (e.g., reducing hallucinations, improving factual accuracy). * Benchmarking: These stylistic factors are not highly correlated with benchmark performance (e.g., MMLU). Benchmarks are still crucial for evaluating core capabilities, while chat-style evaluations (e.g., AlpacaEval) help understand user engagement. A diverse array of evaluation strategies is needed.

[1:30:00] References, Complex Knowledge, and Factuality

[2:39:50] Safety Tuning

[2:59:50] Putting it Together: SFT Data

[3:07:00] How to Fine-Tune (and Mid-training)

[3:42:00] Part 2: RLHF - From Imitation to Optimization

[3:49:00] Why Optimize? Costs and G-V Gap

Two main reasons to optimize with RLHF: 1. Cost: SFT data (expert demonstrations) can be very expensive to collect. * Annotation costs for SFT are high ($25k for 100 examples in one study). * Pairwise feedback (used in RLHF) is cheaper ($4k for 100 examples). * RLHF (optimizing against a reward model) has lower annotation costs than SFT. * RLHF is cheaper because it's easier to verify than to generate. 2. Generator-Validator (G-V) Gap: People don't always write what they prefer in LM outputs. * Human annotators might prefer LM-generated summaries over their own, even if they are expert writers. * This suggests that human generation is not always aligned with human preference, creating a gap between what humans generate and what they prefer. * RLHF helps bridge this gap by directly optimizing for human preference.

[4:03:00] RLHF Data: Types of Pairwise Feedback

[4:53:00] RLHF Data: LM-Generated Feedback (Self-Training)

[5:09:00] How do we do RLHF? PPO and DPO

[5:12:00] PPO (Proximal Policy Optimization)

[5:30:00] Can we get rid of PPO? (Introducing DPO)

[5:34:00] DPO - RLHF without Tears?


[5:43:00] Practical Takeaways

[5:46:00] Open Questions / Things to Remember