Multimodal AI for Edge AI

Event: CVPR 2024 Tutorial · Duration: 214 min · ▶ Watch on YouTube

Abstract

This CVPR 2024 tutorial, “Multimodal AI for Edge AI,” provides a comprehensive overview of deploying efficient and reliable AI models on edge devices. It delves into the fundamentals of Edge AI, model development strategies tailored for resource-constrained hardware, and advanced optimization techniques such as neural architecture search, pruning, and quantization. The tutorial specifically addresses the complexities of multimodal perception, showcasing practical applications in gaze correction, hand gesture recognition, and sound localization. Through detailed case studies and live demonstrations on Jabra’s edge devices, attendees gain insights into real-world challenges and solutions for building intelligent, privacy-preserving, and low-latency AI experiences.

Speakers

Fabricio Batista Narcizo — Jabra / ITU
Elizabete Munzlinger — Jabra / ITU
Anuj Dutt — Adobe
Shan Ahmed Shaffi — GN Hearing A/S
Sai Narsi Reddy Donthi Reddy — Jabra
Keshav Vashishth — GN
Kris — GN

Talks (9)

00:08:40 — Anuj Dutt: INTRODUCTION TO EDGE AI
- Overview of Edge AI, its benefits, challenges, and examples in various industries, contrasting it with Cloud AI.
02:01:52 — Sai Narsi Reddy Donthi Reddy: MODEL DEVELOPMENT FOR EDGE AI
- Details the process of designing a segmentation model for edge devices, focusing on hardware constraints, model designing challenges, dataset pipeline, and training.
03:03:50 — Fabricio Batista Narcizo: GAZE CORRECTION
- Introduces gaze correction, its importance for virtual meetings, technical challenges, and discusses deep learning approaches like GANs and warping.
03:41:15 — Elizabete Munzlinger: HAND GESTURES RECOGNITION
- Highlights the growing market for hand gesture recognition, its diverse applications, benefits, and technical challenges including cross-cutting problems.
03:59:40 — Keshav Vashishth: SOUND LOCALIZATION
- Explains the motivation behind sound localization, audio processing concepts, and introduces deep learning models for sound localization and active speaker detection.
04:05:28 — Kris: MODEL DEPLOYMENT FOR EDGE AI
- Covers model compression techniques (neural architecture search, early exits, pruning, quantization) and hardware-aware design for efficient deployment on edge devices.
04:51:50 — Fabricio Batista Narcizo: JABRA EYE CORRECTION (Video Example (Gaze Correction + Beautification))
- Live demonstration of the Jabra Eye Correction model, showcasing gaze correction and optional beautification features on a laptop.
04:59:30 — Elizabete Munzlinger: HAND GESTURE EDGE AI DEMO (Volume Control)
- Live demonstration of hand gesture recognition for volume control using a Luxonis OAK-1 MAX camera, highlighting the pipeline and challenges in intelligent meeting spaces.
05:09:40 — Keshav Vashishth: SOUND LOCALIZATION (Owens et al. (2021), How to Listen: Rethinking Visualizing and Localizing Sound.)
- Live demonstration of sound localization and active speaker detection using Jabra Panacast 20 and 50 devices, showing heatmaps on faces to indicate sound origin.

Key Takeaways

Edge AI deployment requires careful optimization of models due to limited hardware resources, emphasizing techniques like pruning, quantization, and neural architecture search.
Multimodal perception, integrating visual and audio cues, is crucial for creating intuitive and engaging user experiences in edge devices.
Privacy and real-time processing are key benefits and challenges of Edge AI, necessitating on-device inference and efficient model architectures.
Hardware-aware design is essential, tailoring model architectures and optimization strategies to specific chipsets and their capabilities (e.g., VPU, DSP, NPU).
Jabra’s Expy Experience Platform demonstrates practical applications of multimodal AI for gaze correction, hand gesture recognition, and sound localization in real-world meeting scenarios.

Methods / Models / Datasets Mentioned

Intel Movidius Myriad X VPU
TensorFlow Lite
TFLite Micro
PyTorch Mobile
PyTorch ExecuTorch
ONNX Runtime
Qualcomm SNPE
Intel OpenVINO
Edge Impulse
DINO Pre-Training
Adam Optimizer
JECModel
GAN
Warping Neural Networks
Netron app
Luxonis OAK-1 MAX camera
Google MediaPipe
GhostConv2D
Partial Convolution
Pixel Shuffle
Space2Depth
DenseFeatBlock
FastVIT
RSB-ResNet
PoolFormer
DeiT
RegNetY
ConvNeXt
MobileOne
NFNet
MNASNet
ShuffleNet
MobileNetV2
MobileNetV1
MobileNetV0
Knowledge Distillation
DistilBERT
Kullback-Leibler Divergence Loss
OpenCV
Scikit-image
Keras
TensorFlow
MediaPipe
Kornia
Dlib
MathLab (Image Processing Toolbox)
Deep Learning Ensembles
Artificial Neural Networks (ANN)
3D Convolutional Neural Networks (3D-CNN)
Support Vector Machines (SVM)
Convolutional Neural Networks (CNN)
Recurrent Neural Networks (RNN)
Long Short-Term Memory (LSTM)
Librosa
STFT
MFCC
Mel Scale
Owens et al. (2021)
CLIP
Wav2CLIP
Contrastive Learning
Ruijie et al. (2021)
Jabra Panacast 20
Jabra Panacast 50
Jabra Panacast 50 VBS
Expy Experience Platform

Topics

Edge AI Fundamentals · Multimodal AI · Model Deployment Strategies · Model Optimization · Neural Architecture Search (NAS) · Model Pruning · Model Quantization · Hardware-Aware Design · Gaze Correction · Hand Gesture Recognition · Sound Localization · Real-time Processing · Privacy in AI · Computer Vision · Audio Processing

Notes

Open for commentary — connections to other work, critiques, follow-up reading.