Multimodal AI for Edge AI

Event: CVPR 2024 Tutorial · Duration: 214 min · ▶ Watch on YouTube

Abstract

This CVPR 2024 tutorial, “Multimodal AI for Edge AI,” provides a comprehensive overview of deploying efficient and reliable AI models on edge devices. It delves into the fundamentals of Edge AI, model development strategies tailored for resource-constrained hardware, and advanced optimization techniques such as neural architecture search, pruning, and quantization. The tutorial specifically addresses the complexities of multimodal perception, showcasing practical applications in gaze correction, hand gesture recognition, and sound localization. Through detailed case studies and live demonstrations on Jabra’s edge devices, attendees gain insights into real-world challenges and solutions for building intelligent, privacy-preserving, and low-latency AI experiences.

Speakers

  • Fabricio Batista Narcizo — Jabra / ITU
  • Elizabete Munzlinger — Jabra / ITU
  • Anuj Dutt — Adobe
  • Shan Ahmed Shaffi — GN Hearing A/S
  • Sai Narsi Reddy Donthi Reddy — Jabra
  • Keshav Vashishth — GN
  • Kris — GN

Talks (9)

  • 00:08:40Anuj Dutt: INTRODUCTION TO EDGE AI
    • Overview of Edge AI, its benefits, challenges, and examples in various industries, contrasting it with Cloud AI.
  • 02:01:52Sai Narsi Reddy Donthi Reddy: MODEL DEVELOPMENT FOR EDGE AI
    • Details the process of designing a segmentation model for edge devices, focusing on hardware constraints, model designing challenges, dataset pipeline, and training.
  • 03:03:50Fabricio Batista Narcizo: GAZE CORRECTION
    • Introduces gaze correction, its importance for virtual meetings, technical challenges, and discusses deep learning approaches like GANs and warping.
  • 03:41:15Elizabete Munzlinger: HAND GESTURES RECOGNITION
    • Highlights the growing market for hand gesture recognition, its diverse applications, benefits, and technical challenges including cross-cutting problems.
  • 03:59:40Keshav Vashishth: SOUND LOCALIZATION
    • Explains the motivation behind sound localization, audio processing concepts, and introduces deep learning models for sound localization and active speaker detection.
  • 04:05:28Kris: MODEL DEPLOYMENT FOR EDGE AI
    • Covers model compression techniques (neural architecture search, early exits, pruning, quantization) and hardware-aware design for efficient deployment on edge devices.
  • 04:51:50Fabricio Batista Narcizo: JABRA EYE CORRECTION (Video Example (Gaze Correction + Beautification))
    • Live demonstration of the Jabra Eye Correction model, showcasing gaze correction and optional beautification features on a laptop.
  • 04:59:30Elizabete Munzlinger: HAND GESTURE EDGE AI DEMO (Volume Control)
    • Live demonstration of hand gesture recognition for volume control using a Luxonis OAK-1 MAX camera, highlighting the pipeline and challenges in intelligent meeting spaces.
  • 05:09:40Keshav Vashishth: SOUND LOCALIZATION (Owens et al. (2021), How to Listen: Rethinking Visualizing and Localizing Sound.)
    • Live demonstration of sound localization and active speaker detection using Jabra Panacast 20 and 50 devices, showing heatmaps on faces to indicate sound origin.

Key Takeaways

  • Edge AI deployment requires careful optimization of models due to limited hardware resources, emphasizing techniques like pruning, quantization, and neural architecture search.
  • Multimodal perception, integrating visual and audio cues, is crucial for creating intuitive and engaging user experiences in edge devices.
  • Privacy and real-time processing are key benefits and challenges of Edge AI, necessitating on-device inference and efficient model architectures.
  • Hardware-aware design is essential, tailoring model architectures and optimization strategies to specific chipsets and their capabilities (e.g., VPU, DSP, NPU).
  • Jabra’s Expy Experience Platform demonstrates practical applications of multimodal AI for gaze correction, hand gesture recognition, and sound localization in real-world meeting scenarios.

Methods / Models / Datasets Mentioned

  • Intel Movidius Myriad X VPU
  • TensorFlow Lite
  • TFLite Micro
  • PyTorch Mobile
  • PyTorch ExecuTorch
  • ONNX Runtime
  • Qualcomm SNPE
  • Intel OpenVINO
  • Edge Impulse
  • DINO Pre-Training
  • Adam Optimizer
  • JECModel
  • GAN
  • Warping Neural Networks
  • Netron app
  • Luxonis OAK-1 MAX camera
  • Google MediaPipe
  • GhostConv2D
  • Partial Convolution
  • Pixel Shuffle
  • Space2Depth
  • DenseFeatBlock
  • FastVIT
  • RSB-ResNet
  • PoolFormer
  • DeiT
  • RegNetY
  • ConvNeXt
  • MobileOne
  • NFNet
  • MNASNet
  • ShuffleNet
  • MobileNetV2
  • MobileNetV1
  • MobileNetV0
  • Knowledge Distillation
  • DistilBERT
  • Kullback-Leibler Divergence Loss
  • OpenCV
  • Scikit-image
  • Keras
  • TensorFlow
  • MediaPipe
  • Kornia
  • Dlib
  • MathLab (Image Processing Toolbox)
  • Deep Learning Ensembles
  • Artificial Neural Networks (ANN)
  • 3D Convolutional Neural Networks (3D-CNN)
  • Support Vector Machines (SVM)
  • Convolutional Neural Networks (CNN)
  • Recurrent Neural Networks (RNN)
  • Long Short-Term Memory (LSTM)
  • Librosa
  • STFT
  • MFCC
  • Mel Scale
  • Owens et al. (2021)
  • CLIP
  • Wav2CLIP
  • Contrastive Learning
  • Ruijie et al. (2021)
  • Jabra Panacast 20
  • Jabra Panacast 50
  • Jabra Panacast 50 VBS
  • Expy Experience Platform

Topics

Edge AI Fundamentals · Multimodal AI · Model Deployment Strategies · Model Optimization · Neural Architecture Search (NAS) · Model Pruning · Model Quantization · Hardware-Aware Design · Gaze Correction · Hand Gesture Recognition · Sound Localization · Real-time Processing · Privacy in AI · Computer Vision · Audio Processing


Notes

Open for commentary — connections to other work, critiques, follow-up reading.