Qihong Ruan / CS336 Notes / Lecture 05

Lecture 05: GPUs

Stanford CS336 ยท Spring 2025 ยท 1:14:21 ยท Watch on YouTube โ†—

TL;DR - GPUs are massively parallel processors optimized for throughput, not latency, by having many simple compute units (SMs) orchestrated by minimal control logic. - Compute (especially matrix multiplication) has scaled much faster than memory bandwidth, making memory access the primary bottleneck for modern ML workloads. - Optimizing GPU performance requires careful management of the memory hierarchy, leveraging fast on-chip memory (shared memory, registers) and minimizing slow global memory (DRAM) access. - Key techniques for optimizing GPU performance include: reducing memory accesses (coalescing, fusion, tiling), trading compute for memory (quantization, recomputation), and aligning data access patterns with hardware specifics (bursts, tile divisibility). - Flash Attention (and Flash Attention 2) dramatically accelerates attention by applying these low-level GPU optimization techniques, particularly online softmax computation and recomputation, to reduce HBM accesses.

Key Concepts - GPU Architecture (SMs, SPs, Tensor Cores) - CPU vs. GPU Design Goals (Latency vs. Throughput) - Memory Hierarchy (Registers, Shared Memory, L1/L2 Cache, Global Memory/HBM) - Execution Model (SIMT, Warps, Blocks, Threads) - Dennard Scaling vs. Parallel Scaling - Roofline Model (Compute-bound vs. Memory-bound) - GPU Optimization Techniques: - Control Divergence - Low Precision Computation (Quantization) - Operator Fusion - Recomputation - Memory Coalescing - Tiling - Flash Attention


[0:00] Introduction and Course Updates

[0:05] Outline and Goals

[0:27] Why GPUs Seem Mysterious

The slide shows the outline and goals of the lecture, focusing on making CUDA and GPUs less magical.
0:28 The slide shows the outline and goals of the lecture, focusing on making CUDA and GPUs less magical.
The slide details the lecture's goals: understanding GPU slowdown, and how to make fast algorithms.
0:50 The slide details the lecture's goals: understanding GPU slowdown, and how to make fast algorithms.

[2:11] Acknowledgements

[2:44] Organization of Today's Lecture

  1. GPUs in depth: How they work and important parts.
  2. Understanding GPU performance: What makes GPUs fast or slow.
  3. Putting it together: Unpacking Flash Attention.

[3:35] Setting the Stage: Compute Leads to Predictable Performance

The slide outlines the lecture's organization into three parts: GPUs, GPU performance, and FlashAttention.
3:32 The slide outlines the lecture's organization into three parts: GPUs, GPU performance, and FlashAttention.

[4:20] How We Get Compute Scaling: Early On - Dennard Scaling

The slide presents a graph showing that compute leads to predictable performance gains for language models.
4:04 The slide presents a graph showing that compute leads to predictable performance gains for language models.

[5:30] Parallel Scaling Continues

The slide illustrates how parallel scaling with GPUs has improved performance over 1000x in 10 years.
5:35 The slide illustrates how parallel scaling with GPUs has improved performance over 1000x in 10 years.

[6:22] How is a GPU Different from a CPU?

The slide compares CPU and GPU architectures, highlighting their different optimization strategies for threads.
6:59 The slide compares CPU and GPU architectures, highlighting their different optimization strategies for threads.

[8:28] Anatomy of a GPU (Execution Units)

The slide shows the anatomy of a GPU, detailing streaming multiprocessors (SMs) and their execution units.
8:50 The slide shows the anatomy of a GPU, detailing streaming multiprocessors (SMs) and their execution units.

[9:57] Anatomy of a GPU (Memory)

[10:05] Is this GPU the same as that GPU?

This slide is a duplicate of a previous slide, comparing CPU and GPU architectures and their execution units.
10:22 This slide is a duplicate of a previous slide, comparing CPU and GPU architectures and their execution units.

[13:15] Execution Model of a GPU

The slide explains the execution model of a GPU, defining threads, blocks, and warps.
13:48 The slide explains the execution model of a GPU, defining threads, blocks, and warps.

[15:04] Memory Model of a GPU

The slide illustrates the memory model of a GPU, showing different memory types and their accessibility.
15:57 The slide illustrates the memory model of a GPU, showing different memory types and their accessibility.

[16:34] Side Thread - What about TPUs?

The slide discusses TPUs as an alternative to GPUs, highlighting their core structure and differences.
17:19 The slide discusses TPUs as an alternative to GPUs, highlighting their core structure and differences.
The slide discusses TPUs as an alternative to GPUs, highlighting their core structure and differences.
18:30 The slide discusses TPUs as an alternative to GPUs, highlighting their core structure and differences.

[19:15] Strengths of the GPU Model

  1. Easily scales up hard workloads: By adding more SMs, performance can be increased without necessarily increasing clock speed or dealing with heat dissipation issues.
  2. Easy (?) to program due to the SIMT model: The Single Instruction, Multiple Threads (SIMT) model, where all threads in a warp execute the same instruction on different data, is conceptually straightforward for parallelizing operations on matrices.
  3. Threads are "lightweight": GPU threads have minimal state and can be stopped and started quickly. This allows the GPU to hide latency by switching between active threads, ensuring high utilization even when some threads are waiting for data.

[20:18] GPUs as Fast Matrix Multipliers

[22:03] Compute Scaling is Faster than Memory Scaling

[23:55] Recap: GPUs - What are they and how do they work?

[25:02] Part 2: Making ML Workloads Fast on a GPU

[26:21] What Makes ML Workloads Fast? The Roofline Model

[27:36] How Do We Make GPUs Go Fast? (Tricks)

[1:10:50] Putting it all together: The Forward Pass of Flash Attention

[1:11:10] Flash Attention: How it Dramatically Accelerates Attention

[1:12:54] Recap for the Whole Lecture


Practical Takeaways - Memory is King: For modern ML on GPUs, memory bandwidth is often the bottleneck, not raw FLOPs. Prioritize minimizing global memory access. - Embrace Low-Level Optimizations: Techniques like tiling, fusion, coalescing, and recomputation are not just academic exercises; they are essential for achieving high performance. - Understand the Memory Hierarchy: Design algorithms to keep data in the fastest available memory (registers, shared memory) as much as possible. - Be Mindful of Data Access Patterns: Optimize memory access patterns to leverage hardware features like burst mode and avoid control divergence. - Leverage Low Precision: Use FP16, BF16, or INT8 where possible to reduce memory traffic and increase effective bandwidth.

Open Questions / Things to Remember - How do different GPU architectures (e.g., NVIDIA vs. AMD vs. Intel) impact the optimal values for tiling sizes, burst sections, and other low-level parameters? - What are the practical implications of these low-level optimizations for researchers who primarily use high-level frameworks like PyTorch or TensorFlow? (e.g., how much can compilers like torch.compile automate?) - How do these principles extend to multi-GPU and distributed training scenarios? - What are the trade-offs between numerical stability and performance when using mixed precision and quantization?