Computer Vision Foundation Talk/Workshop

Event: CVPR 2024 Workshop on Efficient Large Vision Models · Duration: 64 min · ▶ Watch on YouTube

Abstract

This video presents two talks from the CVPR 2024 Workshop on Efficient Large Vision Models. The first talk introduces VILA, an efficient visual language model, detailing a full-stack optimization approach for deploying general AI models on edge devices through multi-modal pre-training, activation-aware weight quantization, and a lightweight inference engine. The second talk focuses on accelerating image synthesis with Adversarial Diffusion Distillation (ADD) and Latent Adversarial Diffusion Distillation (LADD), showcasing techniques to achieve high-quality, high-resolution image generation with significantly reduced inference steps while addressing the computational challenges of large diffusion models. Both talks emphasize the importance of efficiency and scalability in the evolving landscape of AI.

Speakers

Song Han — Associate Professor, MIT; Distinguished Scientist, NVIDIA
Robin Rombach — Stability AI

Talks (2)

00:00:30 — Song Han: VILA and Efficient Visual Language Models
- This talk introduces VILA, an efficient visual language model, and discusses the full-stack optimization pipeline for deploying general AI models with world knowledge on edge devices, covering multi-modal pre-training, model compression via quantization, and efficient inference engines.
00:30:50 — Robin Rombach: Fast Image Synthesis with Adversarial Diffusion Distillation
- This talk explores methods for fast image synthesis, focusing on Adversarial Diffusion Distillation (ADD) and Latent Adversarial Diffusion Distillation (LADD) to achieve high-quality, high-resolution image generation with fewer sampling steps, addressing the computational cost and inference speed limitations of traditional diffusion models.

Key Takeaways

Full-stack optimization, encompassing pre-training, compression, and deployment, is crucial for bringing powerful AI models like VLMs to resource-constrained edge devices.
Quantization techniques like AWQ and SmoothQuant are vital for reducing model size and memory footprint, enabling efficient inference and even on-device training.
Adversarial Diffusion Distillation (ADD) offers a promising path to significantly accelerate image synthesis by reducing sampling steps while maintaining high image quality.
Operating in the latent space (LADD) is key for scaling adversarial diffusion distillation to high-resolution image generation and supporting multi-aspect ratios.
The choice of diffusion formalism, architecture, and sampling strategy (e.g., focused sampling with Logit-Normal distribution) profoundly impacts the efficiency and quality of generated content.

Methods / Models / Datasets Mentioned

VILA
Once-for-all Network
MCUNet
TinyNAS
TinyEngine
AWQ
SmoothQuant
TinyChat
TensorRT-LLM
StyleGAN-T
LCM-XL
SDXL
SD3-Turbo
MMDIT
UVIT
DIT
CrossDIT
Rectified Flow
Flow Matching
EDM
DDPM
SDE
DINOv2
LADD
DPO
MLPerf

Topics

Visual Language Models (VLM) · Efficient AI · Model Compression · Quantization (AWQ, SmoothQuant) · Edge AI Deployment · Image Synthesis · Diffusion Models · Adversarial Diffusion Distillation (ADD) · Latent Diffusion Models (LDM) · Hardware-aware Neural Architecture Search

Notes

Open for commentary — connections to other work, critiques, follow-up reading.