Multimodal AI for Edge AI
Event: CVPR 2024 Tutorial · Duration: 214 min · ▶ Watch on YouTube
Abstract
This CVPR 2024 tutorial, “Multimodal AI for Edge AI,” provides a comprehensive overview of deploying efficient and reliable AI models on edge devices. It delves into the fundamentals of Edge AI, model development strategies tailored for resource-constrained hardware, and advanced optimization techniques such as neural architecture search, pruning, and quantization. The tutorial specifically addresses the complexities of multimodal perception, showcasing practical applications in gaze correction, hand gesture recognition, and sound localization. Through detailed case studies and live demonstrations on Jabra’s edge devices, attendees gain insights into real-world challenges and solutions for building intelligent, privacy-preserving, and low-latency AI experiences.
Speakers
- Fabricio Batista Narcizo — Jabra / ITU
- Elizabete Munzlinger — Jabra / ITU
- Anuj Dutt — Adobe
- Shan Ahmed Shaffi — GN Hearing A/S
- Sai Narsi Reddy Donthi Reddy — Jabra
- Keshav Vashishth — GN
- Kris — GN
Talks (9)
- 00:08:40 — Anuj Dutt: INTRODUCTION TO EDGE AI
- Overview of Edge AI, its benefits, challenges, and examples in various industries, contrasting it with Cloud AI.
- 02:01:52 — Sai Narsi Reddy Donthi Reddy: MODEL DEVELOPMENT FOR EDGE AI
- Details the process of designing a segmentation model for edge devices, focusing on hardware constraints, model designing challenges, dataset pipeline, and training.
- 03:03:50 — Fabricio Batista Narcizo: GAZE CORRECTION
- Introduces gaze correction, its importance for virtual meetings, technical challenges, and discusses deep learning approaches like GANs and warping.
- 03:41:15 — Elizabete Munzlinger: HAND GESTURES RECOGNITION
- Highlights the growing market for hand gesture recognition, its diverse applications, benefits, and technical challenges including cross-cutting problems.
- 03:59:40 — Keshav Vashishth: SOUND LOCALIZATION
- Explains the motivation behind sound localization, audio processing concepts, and introduces deep learning models for sound localization and active speaker detection.
- 04:05:28 — Kris: MODEL DEPLOYMENT FOR EDGE AI
- Covers model compression techniques (neural architecture search, early exits, pruning, quantization) and hardware-aware design for efficient deployment on edge devices.
- 04:51:50 — Fabricio Batista Narcizo: JABRA EYE CORRECTION (Video Example (Gaze Correction + Beautification))
- Live demonstration of the Jabra Eye Correction model, showcasing gaze correction and optional beautification features on a laptop.
- 04:59:30 — Elizabete Munzlinger: HAND GESTURE EDGE AI DEMO (Volume Control)
- Live demonstration of hand gesture recognition for volume control using a Luxonis OAK-1 MAX camera, highlighting the pipeline and challenges in intelligent meeting spaces.
- 05:09:40 — Keshav Vashishth: SOUND LOCALIZATION (Owens et al. (2021), How to Listen: Rethinking Visualizing and Localizing Sound.)
- Live demonstration of sound localization and active speaker detection using Jabra Panacast 20 and 50 devices, showing heatmaps on faces to indicate sound origin.
Key Takeaways
- Edge AI deployment requires careful optimization of models due to limited hardware resources, emphasizing techniques like pruning, quantization, and neural architecture search.
- Multimodal perception, integrating visual and audio cues, is crucial for creating intuitive and engaging user experiences in edge devices.
- Privacy and real-time processing are key benefits and challenges of Edge AI, necessitating on-device inference and efficient model architectures.
- Hardware-aware design is essential, tailoring model architectures and optimization strategies to specific chipsets and their capabilities (e.g., VPU, DSP, NPU).
- Jabra’s Expy Experience Platform demonstrates practical applications of multimodal AI for gaze correction, hand gesture recognition, and sound localization in real-world meeting scenarios.
Methods / Models / Datasets Mentioned
Intel Movidius Myriad X VPUTensorFlow LiteTFLite MicroPyTorch MobilePyTorch ExecuTorchONNX RuntimeQualcomm SNPEIntel OpenVINOEdge ImpulseDINO Pre-TrainingAdam OptimizerJECModelGANWarping Neural NetworksNetron appLuxonis OAK-1 MAX cameraGoogle MediaPipeGhostConv2DPartial ConvolutionPixel ShuffleSpace2DepthDenseFeatBlockFastVITRSB-ResNetPoolFormerDeiTRegNetYConvNeXtMobileOneNFNetMNASNetShuffleNetMobileNetV2MobileNetV1MobileNetV0Knowledge DistillationDistilBERTKullback-Leibler Divergence LossOpenCVScikit-imageKerasTensorFlowMediaPipeKorniaDlibMathLab (Image Processing Toolbox)Deep Learning EnsemblesArtificial Neural Networks (ANN)3D Convolutional Neural Networks (3D-CNN)Support Vector Machines (SVM)Convolutional Neural Networks (CNN)Recurrent Neural Networks (RNN)Long Short-Term Memory (LSTM)LibrosaSTFTMFCCMel ScaleOwens et al. (2021)CLIPWav2CLIPContrastive LearningRuijie et al. (2021)Jabra Panacast 20Jabra Panacast 50Jabra Panacast 50 VBSExpy Experience Platform
Topics
Edge AI Fundamentals · Multimodal AI · Model Deployment Strategies · Model Optimization · Neural Architecture Search (NAS) · Model Pruning · Model Quantization · Hardware-Aware Design · Gaze Correction · Hand Gesture Recognition · Sound Localization · Real-time Processing · Privacy in AI · Computer Vision · Audio Processing
Notes
Open for commentary — connections to other work, critiques, follow-up reading.