Edge-Optimized Deep Learning: Harnessing Generative AI and Computer Vision with Open-Source Libraries

Event: CVPR 2024 Tutorial · Duration: 371 min · ▶ Watch on YouTube

Abstract

This tutorial segment introduces Intel’s open-source solutions for edge-optimized deep learning, focusing on the OpenVINO framework. It covers the challenges of deploying AI, particularly with generative AI models, and demonstrates how OpenVINO addresses these by offering hardware-agnostic optimization and deployment capabilities. The segment also delves into various software optimizations within OpenVINO, including auto-batching, inference precision, thread scheduling, and model caching, showcasing significant performance improvements for computer vision and generative AI tasks. Finally, it introduces the concept of quantization as a crucial technique for model compression and acceleration. This segment provides a comprehensive overview of quantization techniques, hardware optimizations, and the OpenVINO ecosystem for deploying AI models. It covers various quantization methods like fake quantization, weight compression, post-training quantization, and quantization-aware training, emphasizing their role in reducing model size and accelerating inference. The discussion extends to Intel’s hardware capabilities, including BFloat16, VNNI, AMX, and the heterogeneous architecture of Core Ultra processors with dedicated MPU, CPU, and GPU engines. A live demo showcases the performance and power efficiency benefits of running object detection on NPU compared to CPU and GPU. The segment concludes by introducing OpenVINO Training eXtensions (OTX), OpenVINO Notebooks, and the OpenVINO Model Server, highlighting the framework’s tools for training, optimizing, and serving AI models across diverse hardware platforms. This segment features two main presentations. First, Samet Akcay introduces OpenVINO Training Extensions (OTX), an end-to-end framework for computer vision tasks, demonstrating its use for data management, model training, explainable AI, and deployment. He highlights OTX’s user-friendly CLI/API and its support for various tasks like classification, object detection, and segmentation, including features like auto-configuration and image tiling. Following this, Hakan Kang presents Module 2, focusing on customizing and running generative AI pipelines with LoRA and OpenVINO. She provides a quick demo of image generation and then delves into the concept of LoRA, explaining its importance in reducing the training cost and memory footprint associated with fine-tuning large language and vision models. This segment covers hands-on practice for enabling LoRA weights in text-to-image models using OpenVINO, demonstrating both offline and runtime approaches. It then transitions to a discussion on model optimization, focusing on quantization techniques for Computer Vision and Generative AI models. The speaker explains the benefits and challenges of various quantization methods, including post-training quantization, quantization-aware training, accuracy-aware quantization, and weight-only quantization, utilizing the NNCF framework. This segment provides a deep dive into OpenVINO’s advanced quantization features for Generative AI models, leveraging the Neural Network Compression Framework (NNCF) and Optimum Intel. The speaker, Aleksandr Korobov from Intel, elaborates on diverse quantization strategies including 8-bit/4-bit weight-only, data-aware/data-free, AWQ, GPTQ, SmoothQuant, and hybrid quantization for diffusion models. Practical demonstrations using Colab notebooks illustrate the application of these methods to real-world scenarios such as anomaly detection (using QAT), Large Language Models (LLMs) like Phi-3, and Stable Diffusion models (specifically the UNet of LCM), emphasizing significant improvements in inference latency and model size while maintaining high accuracy. The presentation concludes with an invitation to upcoming Intel events.

Speakers

Paula Ramos — Intel
Adrian Boguszewski — Intel
Hakan Kang — Intel
Samet Akcay — Intel
Zhaozhong Wu — Intel
Alexander Koplyay — Intel
Aleksandr Korobov — Intel

Talks (26)

00:00:00 — Paula Ramos: Edge-Optimized Deep Learning: Harnessing Generative AI and Computer Vision with Open-Source Libraries
- Paula introduces the tutorial, outlines its modules, discusses AI deployment challenges, and presents OpenVINO as a solution, covering its fundamentals, hardware agnosticism, and ease of use with a live demo.
01:14:09 — None: Dequantization, Fake Quantization, Weight Compression, Post-Training Quantization (PTQ), Accuracy-Control Quantization, Quantization-Aware Training
- The speaker introduces various quantization techniques, starting with fake quantization, then weight compression, post-training quantization (PTQ) which uses a calibration dataset, and accuracy-control quantization to manage accuracy drops. Finally, quantization-aware training (QAT) is presented as a method to fix quantization errors during training by introducing fake quantization nodes.
01:14:53 — None: Neural Network Compression Framework (NNCF) and Quantization Results
- This segment introduces the Neural Network Compression Framework (NNCF) as part of OpenVINO Toolkit for model compression. It showcases NNCF’s effectiveness in weight compression for Large Language Models (LLMs) like Llama 2 and Dolly, demonstrating significant model size reduction with minimal perplexity increase. Post-training quantization (PTQ) results for YOLOv8 are also presented, showing a small accuracy drop but substantial performance gains. Hybrid PTQ for sensitive models like LCM is discussed, along with OpenPose performance comparisons, highlighting 2x speedup and reduced model size.
01:15:32 — None: Quantization with NNCF (Transition)
- A brief transition slide showing QR codes for NNCF quantization examples and announcing the next module on hardware optimizations.
01:15:50 — None: Hardware Optimizations
- The speaker emphasizes the importance of hardware optimization for running AI models on portable devices, citing the vast number of laptops shipped annually. He introduces Llama 3 and discusses the challenge of running large language models locally due to memory constraints, highlighting the need for optimization and compression.
01:17:41 — None: BFloat16
- The speaker explains BFloat16 as a crucial floating-point format for deep learning training, developed in collaboration with Google. He highlights its importance in accelerating training times (e.g., reducing 2 weeks to 1 week) by maintaining sufficient precision while reducing memory footprint, contrasting it with traditional FP32 and FP16.
01:18:52 — None: VNNI - VPDPBUSD and Intel® AMX (Advanced Matrix Extensions)
- The speaker introduces VNNI (Vector Neural Network Instructions) and AMX (Advanced Matrix Extensions) as Intel hardware capabilities designed to accelerate deep learning inference. VNNI, specifically the VPDPBUSD instruction, combines multiple operations into a single clock cycle for efficient INT8 processing. AMX, with its Tile Matrix Multiplication (TMUL) instructions, further enhances performance for large matrix operations.
01:20:07 — None: AI Deployment
- The speaker discusses the landscape of AI deployment across Cloud, Edge, and Core devices, emphasizing the trade-offs in terms of compute power, latency, and data privacy. He highlights the growing importance of Edge processing for real-time applications and the need for efficient hardware to run large models locally.
01:21:03 — None: Three AI Engines in Intel® Core™ Ultra
- The speaker introduces Intel’s heterogeneous architecture in Core Ultra processors, featuring three dedicated AI engines: MPU (for power efficiency), CPU (for fast response), and GPU (for high throughput). He explains that this design optimizes performance for different AI tasks and workloads.
01:21:41 — None: Intel® Core™ Ultra Architecture and Hybrid CPU Architecture
- The speaker delves into the Intel Core Ultra architecture, describing its tiled design (Compute, SoC, IO, GPU tiles) and the ‘pizza’ analogy for integrating different manufacturing processes. He also explains the hybrid CPU architecture with P-cores and E-cores, emphasizing the role of a thread director for efficient workload scheduling.
01:22:38 — None: Acceleration Capabilities Intel® DL Boost: VNNI and Neural Processing Unit
- The speaker reiterates the importance of VNNI (Vector Neural Network Instructions) for accelerating INT8 operations on CPUs and introduces the Neural Processing Unit (NPU) as a dedicated AI accelerator. He highlights the NPU’s power efficiency and its role in offloading AI tasks from the CPU.
01:23:28 — None: Live Object Detection Demo
- A live demonstration of object detection using OpenVINO, comparing performance and power consumption across CPU, GPU, and NPU. The demo shows significant FPS improvements and reduced power usage when switching from CPU to GPU and NPU.
01:25:18 — None: OpenVINO™ Training eXtensions (OTX)
- The speaker introduces OpenVINO™ Training eXtensions (OTX) as a one-stop shop for verified algorithms for various vision tasks. OTX provides simple CLI and Python API for quick starts and integrates OpenVINO for optimization, inference, acceleration, and deployment of trained models.
01:26:09 — None: OpenVINO™ Notebooks
- The speaker highlights OpenVINO™ Notebooks as a totally open-source and free resource on GitHub, providing examples to convert, optimize, and deploy models on different hardware (CPU, GPU, NPU). She showcases various notebook examples for object detection, pose estimation, Segment Anything Model, depth estimation, YOLOv8, Stable Diffusion, Latent Consistency Models (LCM), ControlNet, QR Code Monster, Text to Video (ZeroScope), Film-Slowmo, LLava, Llama3 Chatbot, Music Generation, Text to Speech, and Diarization.
01:26:59 — None: OpenVINO™ GenAI Pipeline Repo and OpenVINO™ Ecosystem Adoption
- The speaker introduces the OpenVINO™ GenAI Pipeline Repo, offering C++ examples for GenAI and LLM pipelines, including a benchmark tool for performance evaluation. She also highlights the significant adoption of OpenVINO, with over 5 million downloads and its participation in Google Summer of Code.
01:27:38 — None: OpenVINO™ Model Server
- The speaker introduces OpenVINO™ Model Server, powered by OpenVINO™ Runtime, as a solution for serving models efficiently. She explains its architecture, allowing multiple clients to make requests to a single model server, supporting various frameworks and model management.
01:28:13 — None: OpenVINO™ Model Server Client Example
- The speaker demonstrates a code snippet for running a client application to interact with the OpenVINO™ Model Server, showcasing how to make requests and receive responses for inference.
02:28:19 — Hakan Kang: Announcements and Introduction to Module 1
- Hakan Kang shares information about various Intel and CVPR events, including a networking meetup and the AIPC Developer Program, encouraging attendees to connect via QR codes before introducing the next speaker for Module 1.
02:32:04 — Samet Akcay: Edge-Optimized Deep Learning: Harnessing Generative AI and Computer Vision with Open-Source Libraries - Module 1: Data Management, Training, and Fine-tuning Computer Vision Tasks
- Samet Akcay presents OpenVINO Training Extensions (OTX), an end-to-end framework for computer vision tasks, demonstrating its capabilities for data management, model training (classification, object detection, segmentation), explainable AI, and model deployment using CLI/API and Jupyter notebooks.
03:06:50 — Hakan Kang: Customize and Run Gen AI Pipelines with LoRA and OpenVINO™ - Module 2
- Hakan Kang demonstrates generating an image using an LCM model and then explains the necessity of LoRA (Low-Rank Adaptation) to efficiently fine-tune large generative AI models by reducing computational costs and memory footprint.
03:13:50 — Adrian Boguszewski: Software Optimizations
- Adrian details various software optimizations within OpenVINO, including auto-batching, inference precision, thread scheduling, shared memory, model caching, pre/post-processing optimization, and asynchronous mode, demonstrating performance gains with a generative AI model.
03:42:29 — (Female, Asian appearance): Hands-on practice
- Demonstrates two methods for enabling LoRA weights with OpenVINO for text-to-image generation, including offline and runtime approaches, and provides a practical hands-on session.
04:08:34 — Alexander Koplyay, Intel: Module 3: Optimization with NNCF for Computer Vision and Gen AI
- Discusses the importance of model optimization, particularly quantization, for Computer Vision and Generative AI models, highlighting different quantization methods and their application using the NNCF framework.
04:56:39 — Aleksandr Korobov: OpenVINO Features for Gen AI
- The speaker continues discussing OpenVINO features for Generative AI, covering various quantization methods like 8-bit/4-bit, data-aware/data-free, AWQ, GPTQ, SmoothQuant, and hybrid quantization for diffusion models, followed by API examples for model optimization using NNCF and Optimum Intel.
05:06:21 — Aleksandr Korobov: Practical Part
- This section provides live demonstrations of applying OpenVINO and NNCF quantization techniques to various Gen AI models, including anomaly detection (ST-PFM) with QAT, LLM (Phi-3) quantization with dynamic and KV-cache options, and hybrid quantization for Stable Diffusion (LCM UNet), showcasing code examples and performance results.
09:21:50 — Adrian Boguszewski: Quantization
- Adrian introduces the concept of quantization as a lossy compression technique for AI models, illustrating how continuous values are mapped to discrete sets and the visual impact of reducing color depth.

Key Takeaways

OpenVINO provides a comprehensive open-source framework for optimizing and deploying AI models across diverse hardware, including Intel CPUs, GPUs, NPUs, and ARM.
The framework is hardware-agnostic, allowing developers to write code once and deploy it on various devices, including integrated GPUs and NPUs for edge AI applications.
OpenVINO offers numerous software optimizations, such as auto-batching, default inference precision (BFloat16, FP16), thread scheduling, shared memory, and model caching, to significantly improve inference performance.
OpenVINO seamlessly integrates as a backend with popular AI frameworks like PyTorch, ONNX Runtime, and Hugging Face, enabling developers to leverage existing models with enhanced performance and minimal code changes.
Quantization is a key technique for reducing model size and improving inference speed, especially for generative AI models, by converting floating-point weights to lower-bit integer representations, often with negligible accuracy loss.
Quantization techniques like weight compression and PTQ significantly reduce model size and improve inference speed with minimal accuracy loss, while QAT can further mitigate accuracy degradation during training.
Intel’s hardware optimizations, including BFloat16, VNNI, and AMX, are crucial for accelerating deep learning workloads, especially for large language models, by leveraging specialized instructions and heterogeneous architectures.
The OpenVINO ecosystem provides comprehensive tools (NNCF, OTX, Notebooks, Model Server) for training, optimizing, and deploying AI models across various hardware platforms (CPU, GPU, NPU), enabling efficient AI deployment from cloud to edge.
The NPU in Intel Core Ultra processors offers superior power efficiency for AI inference compared to CPU and GPU, making it ideal for running AI tasks on portable devices and extending battery life.
OpenVINO Training Extensions (OTX) provides a comprehensive, end-to-end framework for developing and deploying computer vision models, simplifying tasks from data management to explainable AI.
OTX supports various computer vision tasks and learning methods through a unified CLI and API, reducing the need to switch between different frameworks.
LoRA is a crucial technique for efficiently fine-tuning large generative AI models, significantly reducing training costs and memory requirements.
OpenVINO facilitates the optimization and deployment of these trained models, including LoRA-adapted generative AI, onto edge devices for real-world applications.
LoRA weights can be effectively integrated with OpenVINO for efficient text-to-image generation, offering both offline and runtime enablement methods.
The runtime LoRA enablement method is more efficient for handling multiple LoRA weights as it requires only a single model compilation.
Model optimization, especially quantization, is crucial for deploying Computer Vision and Generative AI models due to hardware-specific instruction sets, efficiency gains, and cost savings.
The NNCF framework provides various quantization techniques like PTQ, QAT, and accuracy-aware quantization, with weight-only quantization being particularly suitable for large Generative AI models due to their memory-bound nature.
OpenVINO, in conjunction with NNCF and Optimum Intel, provides a comprehensive set of tools and APIs for quantizing Generative AI models, supporting various precision levels and advanced techniques like data-aware and hybrid quantization.
Practical demonstrations show that applying these quantization methods can lead to substantial improvements in inference latency (e.g., 4x for LLMs) and significant model size reduction, making Gen AI models more efficient for deployment.
Specific quantization strategies, such as hybrid quantization for Stable Diffusion and dynamic/KV-cache quantization for LLMs, are tailored to the unique characteristics of different model architectures to maximize performance benefits while preserving accuracy.
The presented tools facilitate the entire model optimization workflow, from converting Hugging Face models to OpenVINO IR to applying quantization and evaluating performance, with options for both Python and C++ deployment.

Methods / Models / Datasets Mentioned

AMX (Advanced Matrix Extensions)
AUTO plugin
AWQ
Accuracy-Control Quantization
Accuracy-aware quantization
Asynchronous Mode
BFloat16
COCO
CPU
ControlNet
ConvNeXt
Diffusers
EfficientNet
FP16
Fake Quantization
Film-Slowmo
GPTQ
GPU
Hugging Face (Optimum Intel)
Hybrid CPU Architecture
Intel® Core™ Ultra
LCM
LCM (Latent Consistency Models)
LLava
Latent Consistency Model (LCM)
Llama3 Chatbot
LoRA
MPU (Neural Processing Unit)
Model Caching
Model Converter
NNCF
Neural Network Compression Framework (NNCF)
ONNX Runtime
OpenPose
OpenVINO
OpenVINO Runtime
OpenVINO Training Extensions
OpenVINO Training Extensions (OTX)
OpenVINO™ GenAI Pipeline Repo
OpenVINO™ Model Server
OpenVINO™ Notebooks
OpenVINO™ Training eXtensions (OTX)
Optimum Intel
PTQ
Pascal VOC
Phi-3
Pose Estimation
Post-Training Quantization (PTQ)
PrePostProcessor (PPP)
PyTorch
PyTorch Lightning
QAT
QR Code Monster
Quantization-Aware Training (QAT)
SAM
ST-PFM
Segment Anything Model
Shared Memory
SmoothQuant
Stable Diffusion
TMUL (Tile Matrix Multiplication)
Thread Scheduling
TorchVision
VNNI (Vector Neural Network Instructions)
VPDPBUSD (Vector Dot Product Byte Unsigned Signed Dword)
Weight Compression
Weight-only quantization
YOLOv8
ZeroScope

Topics

AI Deployment · AMX · BFloat16 · Computer Vision · Computer Vision Tasks · Computer Vision models · Data Management · Edge AI · Explainable AI (XAI) · Framework Integration · Generative AI · Generative AI (Gen AI) · Generative AI models · Hardware Optimization · Heterogeneous Architecture · Hybrid Quantization · LLMs · Large Language Models (LLMs) · Latency Reduction · LoRA (Low-Rank Adaptation) · LoRA enablement · Model Compression · Model Deployment · Model Optimization · Model Training · Model quantization · NNCF · NNCF framework · NPU · Neural Network Compression Framework (NNCF) · OpenVINO · OpenVINO IR Format · OpenVINO Model Server · OpenVINO Training Extensions (OTX) · OpenVINO Training eXtensions (OTX) · OpenVINO optimization · Quantization · Quantization-Aware Training (QAT) · Software Optimizations · Stable Diffusion · Text-to-image generation · Training Cost Optimization · VNNI · YOLOv8

Notes

Open for commentary — connections to other work, critiques, follow-up reading.