23675 First Workshop on Efficient and On Device Generation EDGE

Event: CVPR 2024: EDGE Workshop · Duration: 506 min · ▶ Watch on YouTube

Abstract

This segment provides an in-depth look into 3D Gaussian Splatting, starting with an introduction to virtual scenes and various scene representation techniques. It then details the core concepts, rendering process, and adaptive densification strategies of 3D Gaussian Splatting, showcasing its high fidelity and real-time performance. The discussion extends to recent advancements, addressing limitations like aliasing, optimization challenges, and memory footprint, along with its application in large-scale scene rendering and high-dimensional neural rendering. The segment also includes introductions to subsequent talks covering machine learning generalization in open-world settings, efficient multi-modal LLMs on the edge, and advanced visual synthesis models. This segment features three talks on efficient AI models. The first talk focuses on optimizing multi-modal LLMs for edge devices through techniques like pruning, quantization, and hardware-aware neural architecture search. The second presentation delves into efficient generative models for visual understanding, covering flow matching, image diffusion translation, and zero-shot conditional low-rank adaptation. The final talk introduces Dream Machine, a video generation model, and addresses the scalability challenges of 3D data and its generation using AI. This segment delves into the challenges of 3D data scalability and proposes leveraging 2D foundation models for 3D generation. The speaker demonstrates how models trained on videos can implicitly learn complex 3D properties like depth, light transport, and dynamics, enabling the creation of interactive and stylized 3D content from single images. While highlighting the promising capabilities, the presentation also addresses current limitations, including issues with intuitive physics, camera control, and object consistency, and concludes by discussing the future potential of 4D generation for embodied AI. This segment focuses on the challenges and solutions for enabling content generation models, particularly diffusion models, to run efficiently on mobile devices. The speaker introduces three key techniques: SnapFusion for reducing inference latency, BitsFusion for minimizing model storage through quantization, and TextCrafter for enhancing generation quality. The presentation highlights the potential for these optimizations to unlock real-time, on-device AI applications, including advanced image and video generation, offering a better user experience with lower costs and improved data privacy. This segment features three speakers discussing advancements in diffusion models. Yanwu Xu concludes a discussion on BitsFusion, a quantization method for Stable Diffusion, highlighting its compression capabilities and performance. Tim Salimans then introduces two multi-step distillation methods for diffusion models, emphasizing configurable trade-offs between speed and quality, and detailing improvements to DDIM sampling. Finally, Cheng Lu presents DPM-Solver, a training-free framework for fast diffusion model sampling, explaining how it reduces discretization errors and achieves stable sampling, even for guided image generation. This segment begins with a discussion on SDE variants of DPM-Solver++ and a comparison of ODE and SDE solvers with DeepFloyd-IF, highlighting improved sample quality and faster convergence for SDE-based DPM-Solver++. The presentation then transitions to a new talk introducing “Rectified Flow,” a novel approach for fast generative modeling. This method frames unsupervised learning as a distribution transport problem, focusing on finding “nice” (straight and non-intersecting) pairings between distributions. The speaker details the theoretical underpinnings of Rectified Flow, including its iterative “Reflow” procedure, and demonstrates its application in InstaFlow for one-step, high-quality text-to-image generation, outperforming existing methods in speed and FID scores. The talk also covers theoretical guarantees, practical results, and ongoing research directions related to diffusion models and optimal transport.

Speakers

Thomas Leimkuehler
George Kopanas
Yanwu Xu
Prof. Shanghang Zhang — Peking University
Prof. Song Han
Dr. Naiming Song
Dr. Jiaming Song
Song Han — Associate Professor, MIT; Distinguished Scientist, NVIDIA
Björn Ommer — Computer Vision & Learning Group, University of Munich
Jiaming Song — Chief Scientist, Luma AI
Jian Ren — Snap. Inc
Tim Salimans — Google DeepMind
Cheng Lu — OpenAI
Qiang Liu — UT Austin

Talks (19)

00:00:00 — Thomas Leimkuehler: What We Want: Virtual Scenes
- This talk introduces the concept of virtual scenes, various scene representations like meshes, voxel grids, and neural fields, and then delves into 3D Gaussian Splatting as a primitive-based representation, covering its rendering algorithm, adaptive densification strategy, and performance evaluation.
00:15:09 — George Kopanas: Gaussian Splatting Recent Advancements
- This talk builds upon the introduction to 3D Gaussian Splatting, discussing its advantages as an ideal 3D representation. It addresses limitations such as aliasing, optimization heuristics, and memory footprint, while also exploring its extension to very large datasets and high-dimensional neural rendering.
00:28:30 — Prof. Shanghang Zhang: Towards Machine Learning Generalization in the Open World
- This talk discusses challenges in machine learning generalization in open-world scenarios, focusing on domain shift and category shift. It proposes methods for domain adaptation using domain-invariant representations and sample-efficient learning for new tasks, including open-vocabulary 3D object detection.
00:30:00 — Prof. Song Han: Efficient Multi-modal LLM on the Edge
- This talk presents a cloud-device collaborative adaptation framework for efficient multi-modal large language models (LLMs) on edge devices. It introduces uncertainty-guided sampling and visual prompt learning to enhance performance in continually changing environments, and explores applications in object goal navigation and 3D LiDAR understanding.
00:31:10 — Dr. Naiming Song: Beyond Diffusion: Efficient Models for Visual Synthesis
- This talk focuses on developing efficient models for visual synthesis beyond traditional diffusion models. It explores methods for faster and more controllable image generation, aiming to improve the practical applicability of generative AI in various visual tasks.
00:32:20 — Dr. Jiaming Song: Dream Machine
- This talk introduces ‘Dream Machine’, a novel approach to visual synthesis that leverages advanced generative models to create highly realistic and controllable visual content. It delves into the underlying architecture and techniques that enable the generation of complex and diverse imagery.
01:52:13 — Song Han: Efficient Multi-modal LLM on the Edge
- This talk discusses techniques for efficient multi-modal Large Language Models (LLMs) on edge devices, focusing on model compression, hardware-aware neural architecture search, and on-device training.
02:48:47 — Yanwu Xu: 3D data has a scalability issue
- The speaker highlights the scarcity of 3D data compared to abundant 2D image and video data, posing a scalability challenge for 3D applications.
02:53:47 — Yanwu Xu: Learning “light transport” from watching videos
- The speaker demonstrates the model’s capacity to learn complex light transport phenomena, including reflections, refractions, and how light interacts with different materials and colors in a scene.
02:59:23 — Yanwu Xu: Efficiency
- The speaker emphasizes the importance of efficient solutions for making these models widely accessible, noting that current diffusion models and video contexts pose unique challenges compared to large language models.
03:16:21 — Björn Ommer: Beyond Diffusion: Efficient Models for Visual Synthesis
- This talk explores efficient generative models for visual understanding, focusing on flow matching for boosting diffusion models, translating image diffusion to different modalities, and zero-shot conditional low-rank adaptation.
04:09:46 — Jiaming Song: Dream Machine
- This talk introduces Dream Machine, a new video generation model, and discusses the challenges and solutions for 3D data scalability, 3D capture, and 3D generation using AI.
05:37:35 — Yanwu Xu: BitsFusion - Results
- The speaker discusses the applicability of BitsFusion to different model sizes and its interaction with existing Stable Diffusion tools like ControlNet, emphasizing the trade-offs between speed and storage.
05:38:15 — Tim Salimans: Two new distillation methods for few-step generation with diffusion models
- Tim Salimans introduces two new distillation methods for diffusion models, focusing on multi-step distillation to achieve a configurable trade-off between quality and speed, and discusses improving DDIM sampling.
05:41:55 — Cheng Lu: DPM-Solver: Training-Free Fast Samplers for Diffusion Models
- Cheng Lu presents DPM-Solver, a principled framework for fast sampling of diffusion models, explaining how it reduces discretization errors and achieves stable sampling even under large guidance scales.
06:10:00 — Jian Ren: Content Generation on Mobile Devices
- This talk presents techniques to optimize large-scale diffusion models for efficient content generation on mobile devices, addressing challenges like high latency and storage requirements.
07:01:59 — Yanwu Xu: SDE variant of DPM-Solver++
- Explains that sampling by DDPM is equivalent to first-order discretization of diffusion SDEs and presents the Diffusion SDE and its solution.
07:21:58 — Yanwu Xu: ODE and SDE solvers with DeepFloyd-IF
- Compares various ODE and SDE solvers, demonstrating that SDE variants of DPM-Solver++ achieve better sample quality and faster convergence.
08:14:59 — Qiang Liu: Rectified Flow: A Straight Approach to Fast Generation
- Introduces Rectified Flow as a method for fast generative modeling by finding straight and non-intersecting transport mappings between distributions, detailing its theoretical guarantees and practical application in InstaFlow for one-step text-to-image generation.

Key Takeaways

3D Gaussian Splatting is a versatile and efficient primitive-based scene representation capable of high-fidelity, real-time rendering, outperforming NeRF baselines in speed while maintaining quality.
Ongoing research in 3D Gaussian Splatting addresses key limitations such as aliasing (Mip-Splatting), optimization complexity (novel densification heuristics), and memory footprint (compression techniques), making it more robust and scalable.
The technology is being extended to handle very large-scale environments through hierarchical level-of-detail structures and to represent high-dimensional signals for advanced neural rendering applications.
Future directions for 3D Gaussian Splatting involve leveraging its efficiency to bootstrap new ideas and applications, potentially solving fundamental problems in Radiance Fields and contributing to broader machine learning generalization in open-world scenarios.
Efficient AI for edge devices is crucial for modern LLMs, utilizing techniques like pruning, quantization, and hardware-aware neural architecture search to reduce model size and improve performance.
Generative models for visual understanding can be made more efficient through flow matching, enabling faster inference and better generalization across different modalities and resolutions.
3D data generation and capture present significant scalability challenges, but innovative approaches like Dream Machine and 3D capture technologies are making progress in creating high-quality, interactive 3D content.
Novel methods like AWQ and SmoothQuant are being integrated into inference frameworks like TensorRT-LLM to enable efficient deployment of large multi-modal models on various hardware platforms, from mobile GPUs to microcontrollers.
The scarcity of 3D data can be addressed by fine-tuning large 2D foundation models to generate 3D assets from images and videos.
Video generation models can implicitly learn sophisticated 3D properties such as depth, light transport, and object dynamics without explicit 3D supervision.
Despite impressive capabilities in generating realistic and stylized 3D content, current models face challenges with intuitive physics, precise camera control, and maintaining object consistency in dynamic scenes.
The future of 3D generation lies in 4D (3D + time) generation, which could significantly impact fields like embodied AI.
SnapFusion significantly reduces inference latency of diffusion models on mobile devices by optimizing the UNet architecture and employing step distillation techniques.
BitsFusion tackles the large storage footprint of diffusion models by implementing mixed-precision quantization, allowing for substantial model size reduction with minimal quality degradation.
TextCrafter improves the quality of generated images by fine-tuning the text encoder, enabling better adherence to text prompts and supporting various downstream tasks like ControlNet and InPainting.
Optimizing diffusion models for mobile devices offers advantages such as lower operational costs, enhanced data privacy by keeping processing on-device, and a superior real-time user experience for content creation.
BitsFusion offers significant compression for Stable Diffusion models with minimal quality loss, demonstrating its potential for efficient deployment.
Multi-step distillation methods provide a configurable trade-off between sampling speed and generation quality in diffusion models, allowing users to balance performance based on application needs.
DPM-Solver is a training-free framework that significantly accelerates diffusion model sampling by reducing discretization errors and leveraging exact computation of linear parts.
Moment matching and improved DDIM variants can lead to faster and more stable diffusion model sampling, with some distilled models even outperforming their teacher models in FID scores.
Higher-order solvers in diffusion models can exhibit instability under large guidance scales, necessitating careful design and potentially lower-order methods for robust guided sampling.
Rectified Flow is a novel method for fast generation that aims to learn a continuous-time process (ODE) whose trajectories are straight and non-intersecting, effectively finding ‘straight transport’ mappings between distributions.
The iterative ‘Reflow’ procedure in Rectified Flow helps untangle crossing trajectories by alternating between connecting pairs with straight lines and fitting a neural ODE, leading to convergence to straight flows.
InstaFlow, an application of one-step rectified flow, demonstrates state-of-the-art performance in text-to-image generation, achieving high image quality with significantly faster inference times compared to other diffusion models.
Rectified Flow provides strong theoretical guarantees, including marginal invariance and lower transport costs for all convex cost functions, ensuring the generated samples maintain the desired distribution properties.

Methods / Models / Datasets Mentioned

2DETR
3D Gaussian Splatting
3D Gaussian Splatting as Markov Chain Monte Carlo
A Hierarchical 3D Gaussian Representation for Real-Time Rendering of Very Large Datasets
AKD (Adapter-based Knowledge Distillation)
AWQ
Action matching
Alpha-blending
BitsFusion
Block-NeRF
CAN (Condition-Aware Neural Network)
CAT3D
CFG-Aware Step Distillation
CFM
CLIP (Contrastive Language-Image Pre-training)
CTM
CTRLLoRA
Cloud-Device Collaborative Adaptation (CCA)
Consistency distillation
ControlNet
DDIB-ODE
DDIB-SDE
DDIM
DDPM
DPM-Solver
DPM-Solver++
Deep Compression
DeepFloyd-IF
DepthFM
Diff-Instruct
Diffusion bridges
DistriFusion
Distribution Matching Distillation (DMD)
DreamMachine
EDM
EWA Volume Splatting
Efficient Inference Engine (EIE)
EfficientViT
EfficientViT-SAM
Ego-Exo4D
Euler Discretization
Flow Matching
Flow matching
GANs
Gaussian Splatting
Genie
IP-Adapter
InPainting
InstaFlow
InstaFlow-1.7B
Instant-NGP
Instant3D
LAION
LCM-LoRA
LiDAR-LLM
Llama2-Accessory
LoRA
LoRAadapter
Locality-Sensitive Hashing
MCUNet
MIMingNet
Marigold
Mesh
MiDaS
Mip-Splatting
MipNeRF360
Mixed-Precision Quantization
Multistep Consistency Models (MCM)
Multistep Distillation of Diffusion Models via Moment Matching
MvimgNet
N-Dimensional Gaussians for Fitting of High Dimensional Functions
NeRF
Neural Field
Neural Radiance Fields (NeRFs)
ODEs
OV-3DET (Open-Vocabulary 3D Detection)
Objaverse
Objaverse-XL
Objverse-XL
OmniML
Once-for-all Network (OFA)
Optimal Transport (OT)
PerFlow
Pixel DPM
Point-based Graphics
Probability Flow ODEs
Progressive Distillation
Progressive distillation
RIN
Rectified Flow
Reducing the Memory Footprint of 3D Gaussian Splatting
Revising Densification in Gaussian Splatting
Robust Training
SD1.5
SDEs
SDXL
SFM (Structure from Motion)
SR3
Schrodinger bridge matching
SmoothQuant
SnapFusion
Sora
Stable Diffusion (SD) 1.4
Stable Diffusion v1.5
Stochastic interpolant
StyleGan-T
T2I-Adapter
TensorRT-LLM
TextCrafter
Time-reversal diffusion
TinyChat
TinyEngine
TinyML
TinyNAS
UFO-Gen
UNet
Uncertainty-Guided Sampling
Uni-ControlNet
UniPC
VDM++
VILA
Visual Prompt Learning (VPL)
Voxel Grid
ZoeDepth
aDDIM

Topics

3D Gaussian Splatting · 3D Generation · 3D data scalability · 4D generation · Adaptive Densification · Causal reasoning · Content generation · DPM-Solver++ · DeepFloyd-IF · Depth estimation · Diffusion Models · Diffusion models · Distribution Transport · Dynamic scenes · Edge AI · Efficiency · Efficient AI · Embodied AI · Fast Sampling · Flow Matching · Foundation models · Generative AI · Generative Models · Hardware-aware NAS · Image Generation · Implicit 3D learning · InstaFlow · Latency reduction · Light transport · Machine Learning Generalization · Memory Optimization · Mobile devices · Model Compression · Model Distillation · Model limitations · Model optimization · Moment Matching · Multi-modal LLMs · Neural ODEs · Neural Radiance Fields · Neural Rendering · On-device Training · Open-World AI · Optimal Transport · Quantization · Rectified Flow · SDE solvers · Scene Representation · Storage reduction · Stylized 3D · Text-to-Image Generation · Theoretical Guarantees · Unsupervised Learning · Video Generation · Video generation · Virtual Scenes · Visual Synthesis

Notes

Open for commentary — connections to other work, critiques, follow-up reading.