CVPR 2024 Tutorial on Full-stack Acceleration of Deep Learning

Event: CVPR 2024 Tutorial · Duration: 174 min · ▶ Watch on YouTube

Abstract

This video presents a tutorial on full-stack acceleration of deep learning, covering foundational hardware concepts, advanced neural network acceleration techniques, and efficient vision-language models. It delves into the trade-offs between flexibility and efficiency in hardware, strategies for model compression like quantization and pruning, and the development of multi-modal foundation models capable of understanding and generating across various data types. The tutorial emphasizes practical applications, performance optimization, and the importance of hardware-software co-design for deploying large AI models.

Speakers

Jason Clemons — Senior Research Scientist, NVIDIA Research – Architecture Research Group (ARG)
Maying Shen — Senior Research Engineer, NVIDIA Research
Hongxu (Danny) Yin — Staff Research Scientist, NVIDIA Research

Talks (3)

00:01:13 — Jason Clemons: Foundations of DL Hardware And How to Apply Them
- Discusses the spectrum of hardware systems for deep learning, from flexible CPUs to efficient ASICs, detailing their architectures, performance metrics, and optimization strategies.
01:04:25 — Maying Shen: Neural Network Acceleration
- Explores techniques for neural network acceleration, focusing on model compression methods like quantization and pruning, and their application in large model deployment for autonomous vehicles.
01:08:30 — Hongxu (Danny) Yin: Efficient Vision Language Models
- Introduces VILA, a visual language model pre-trained for multi-modality understanding, reasoning, and generation, highlighting its efficiency, performance, and deployment across various hardware platforms.

Key Takeaways

Deep learning hardware design involves a fundamental trade-off between flexibility (e.g., CPUs) and efficiency (e.g., ASICs), with GPUs offering a balance through parallel processing and specialized units like Tensor Cores.
Neural network acceleration is crucial for deploying large models on diverse platforms, from edge devices to cloud computers, and relies heavily on techniques like quantization (reducing precision) and pruning (removing redundant parameters).
Multi-modal foundation models, such as VILA and X-VILA, demonstrate advanced capabilities in understanding, reasoning, and generating across various modalities (image, text, video, audio), leveraging proper training recipes and data blending strategies for optimal performance.
Effective model compression and acceleration require careful consideration of hardware characteristics, performance metrics like arithmetic intensity and memory bandwidth, and the use of specialized tools and frameworks like TensorRT and NVIDIA profiling tools.
Robustness and generalization are critical for compressed models, especially in real-world applications like autonomous driving, and can be improved by encouraging sparsity and robustness simultaneously through techniques like iterative prune & grow and flatness-aware optimization.

Methods / Models / Datasets Mentioned

VILA
X-VILA
AWQ
GPT-4o (Vision)
SenseChat-Vision-0423-Preview
Gemini 1.5 Pro
Gemini 1.5 Flash
GPT-4V
Qwen-VL-MAX
Qwen-VL-Chat
LLaVA-1.6-34B
YI-VL-34B
Qwen-VL-Plus
Marco-VL
Weitu-VL-1.0-15B
InternVL-XComposer-VL
YI-VL-6B
InfMM-Zephyr-7B
InternVL-Chat-VL
SVIT
InstructBLIP
LLaVA-1.5
POPE
VQA (Visual Question Answering)
TextVQA
COCO (Common Objects in Context)
Flickr
TinyChat
TensorRT-LLM
TensorRT
NVTX
NSight Systems (nsys)
NSight Compute (ncu)
NSight Graphics (nsight-gfx)
CUDA
cuBLAS
cuDNN
PyTorch
TensorFlow
MXNet
Group Lasso penalty
Taylor Importance
Optimal Brain Damage (OBD)
Optimal Brain Surgeon (OBS)
Skeletonization
SNIP
HALP (Hardware-aware Latency Pruning)
EagleEye
AutoSlim
MetaPruning
SMCP (Soft-Masking Channel Pruning)
AdaSAP (Adaptive Sharpness-Aware Pruning)
Greg-1
Greg-2
ResNet50
ResNet18
Segformer
Blackwell GPU
NVIDIA DRIVE Thor
Llama-7B
Llama-2
AdamW (optimizer)
SGD (optimizer)
Adam (optimizer)

Topics

Deep Learning Hardware Architectures (CPU, GPU, Systolic Array, DLA, ASIC) · Hardware Flexibility vs. Efficiency · GPU Architecture and Tensor Cores · System-on-Chips (SoCs) · Neural Network Acceleration Techniques (Quantization, Pruning) · Vision-Language Models (VLM) · Multi-modality AI (Video, Image, Language, Audio) · Performance Optimization (Arithmetic Intensity, Bandwidth) · Model Compression for Autonomous Vehicles

Notes

Open for commentary — connections to other work, critiques, follow-up reading.