Computer Vision Foundation Talk/Workshop
Event: CVPR 2024 Workshop on Efficient Large Vision Models · Duration: 64 min · ▶ Watch on YouTube
Abstract
This video presents two talks from the CVPR 2024 Workshop on Efficient Large Vision Models. The first talk introduces VILA, an efficient visual language model, detailing a full-stack optimization approach for deploying general AI models on edge devices through multi-modal pre-training, activation-aware weight quantization, and a lightweight inference engine. The second talk focuses on accelerating image synthesis with Adversarial Diffusion Distillation (ADD) and Latent Adversarial Diffusion Distillation (LADD), showcasing techniques to achieve high-quality, high-resolution image generation with significantly reduced inference steps while addressing the computational challenges of large diffusion models. Both talks emphasize the importance of efficiency and scalability in the evolving landscape of AI.
Speakers
- Song Han — Associate Professor, MIT; Distinguished Scientist, NVIDIA
- Robin Rombach — Stability AI
Talks (2)
- 00:00:30 — Song Han: VILA and Efficient Visual Language Models
- This talk introduces VILA, an efficient visual language model, and discusses the full-stack optimization pipeline for deploying general AI models with world knowledge on edge devices, covering multi-modal pre-training, model compression via quantization, and efficient inference engines.
- 00:30:50 — Robin Rombach: Fast Image Synthesis with Adversarial Diffusion Distillation
- This talk explores methods for fast image synthesis, focusing on Adversarial Diffusion Distillation (ADD) and Latent Adversarial Diffusion Distillation (LADD) to achieve high-quality, high-resolution image generation with fewer sampling steps, addressing the computational cost and inference speed limitations of traditional diffusion models.
Key Takeaways
- Full-stack optimization, encompassing pre-training, compression, and deployment, is crucial for bringing powerful AI models like VLMs to resource-constrained edge devices.
- Quantization techniques like AWQ and SmoothQuant are vital for reducing model size and memory footprint, enabling efficient inference and even on-device training.
- Adversarial Diffusion Distillation (ADD) offers a promising path to significantly accelerate image synthesis by reducing sampling steps while maintaining high image quality.
- Operating in the latent space (LADD) is key for scaling adversarial diffusion distillation to high-resolution image generation and supporting multi-aspect ratios.
- The choice of diffusion formalism, architecture, and sampling strategy (e.g., focused sampling with Logit-Normal distribution) profoundly impacts the efficiency and quality of generated content.
Methods / Models / Datasets Mentioned
VILAOnce-for-all NetworkMCUNetTinyNASTinyEngineAWQSmoothQuantTinyChatTensorRT-LLMStyleGAN-TLCM-XLSDXLSD3-TurboMMDITUVITDITCrossDITRectified FlowFlow MatchingEDMDDPMSDEDINOv2LADDDPOMLPerf
Topics
Visual Language Models (VLM) · Efficient AI · Model Compression · Quantization (AWQ, SmoothQuant) · Edge AI Deployment · Image Synthesis · Diffusion Models · Adversarial Diffusion Distillation (ADD) · Latent Diffusion Models (LDM) · Hardware-aware Neural Architecture Search
Notes
Open for commentary — connections to other work, critiques, follow-up reading.