Mobile AI Workshop 2025: Introductory Talk
Event: CVPR 2025 Mobile AI Workshop · Duration: 411 min · ▶ Watch on YouTube
Abstract
This segment introduces the Mobile AI Workshop 2025, focusing on the practical aspects of deploying and optimizing deep learning models on mobile and edge devices. The speaker, Andrii Ihnatov, outlines the workshop’s goals, discusses the challenges of real-world mobile AI deployment, and provides a comprehensive overview of current on-device inference frameworks. He critically evaluates ONNX, ExecuTorch, CoreML, and LiteRT (formerly TensorFlow Lite), highlighting their strengths, weaknesses, and suitability for different platforms and hardware accelerators. The talk emphasizes the importance of selecting the right software stack and leveraging dedicated hardware like NPUs and GPUs for efficient mobile AI. This segment provides a comprehensive overview of software frameworks and hardware accelerators for on-device mobile AI inference. It delves into various LiteRT Delegates for Qualcomm, MediaTek, and ARM, highlighting their capabilities and limitations. A significant portion is dedicated to the deprecation of Android’s NNAPI, explaining the reasons behind this decision and its implications. The speaker then presents a detailed analysis of AI hardware across different vendors, including Qualcomm, MediaTek, Google, Unisoc, Kirin, Xiaomi, and Exynos, outlining their AI accelerators, supported inference formats, and available libraries. The segment concludes with an introduction to AI Benchmark (ETHZ), a tool developed by the speaker’s team for measuring smartphone AI performance, showcasing its diverse model suite (including LLMs and Stable Diffusion) and visualization features, and demonstrating the rapid performance improvements in mobile AI hardware. This presentation outlines a series of Mobile AI Challenges for CVPR & ICCV 2025, covering various computer vision and generative AI tasks. The challenges aim to push the boundaries of on-device AI performance and efficiency, focusing on practical applications and real-world constraints. Tasks include 4K quantized and floating-point image super-resolution, real-time video super-resolution, image denoising, realistic bokeh rendering, sRGB image enhancement, learned smartphone ISP, efficient Stable Diffusion, and efficient LLMs. Each challenge specifies evaluation metrics, datasets, runtime platforms (Qualcomm Snapdragon, MediaTek Dimensity, Apple M4, Raspberry Pi 4), and model formats (TFLite INT8/FP32, CoreML, ONNX, PyTorch, TensorFlow). A key emphasis is placed on balancing high accuracy with efficient runtime and practical visual quality, with some challenges incorporating user studies for quality assessment. The presentation also touches upon the challenges of power consumption measurement and the current state of mobile AI frameworks. This segment covers two main topics from the MAI 2023 workshop: the final results and analysis of the Quantized Image Super-Resolution Challenge, and a presentation on Compressed Domain Video SlowMo (CDVS) for 8K video on mobile devices. The super-resolution challenge highlights the importance of hardware-aware optimization, especially on platforms like Google Tensor, and showcases solutions leveraging reparameterization and quantization-aware training. The CDVS presentation addresses the computational and memory challenges of high-resolution video processing on mobile NPUs by performing key AI blocks in the compressed domain, demonstrating significant efficiency gains for 8K video slow-motion. This video segment concludes a presentation on a Multi-Frame Processing (MFP) pipeline designed for mobile GPUs. The pipeline operates in a compressed domain, utilizing an encoder-decoder architecture with a Vector Quantization (VQ) module to achieve 3.3x compression and address high memory consumption challenges. Key MFP functions demonstrated include demosaicing, denoising, image registration, deghosting (detecting unmatching pixels between frames), and HDR blending (replacing saturated pixels from normal exposure frames with information from short frames). The presenter showcases results for warping, deghosting, and HDR blending, highlighting the method’s ability to handle large motion and small dynamics. The full pipeline processes two raw frames, registers them, estimates saturation and deghosting maps, blends features, and decodes the final output. The presentation concludes with a summary of contributions and a vision for high-resolution multi-frame processing on mobile devices. A brief Q&A session follows, where the presenter discusses future plans for on-device implementation and acknowledges remaining challenges like tone mapping. This video segment features two main presentations. The first presentation details the MAI RGB Photo Enhancement Challenge, outlining its objectives, dataset, performance evaluation metrics, and the winning solution’s architecture. The challenge aims to enhance low-quality smartphone photos, achieve good performance on mobile GPUs, and process HD resolution photos on mobile devices. The winning solution, DaHua-IIG, utilized a Parallel Linear U-net (PLU-Net) architecture with global feature extraction and a combination of L2, MS-SSIM, and perceptual losses. The second presentation introduces PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers. This work addresses the challenge of deploying large language models and vision transformers on resource-constrained edge devices by proposing a parameter-efficient framework using LoRA adapters for both convolutional and attention layers within hybrid transformer backbones. The presentation highlights PETAH’s ability to achieve high accuracy with minimal additional parameters and demonstrates its versatility across various computer vision tasks, including fine-grained classification, object detection, and instance segmentation. It also explores the benefits of sparsity in further improving model efficiency. This presentation addresses the critical challenge of robust 6DoF (Six Degrees of Freedom) pose estimation on mobile devices, where depth sensors like the iPhone LiDAR produce low-resolution, distorted, and noisy depth maps. The speaker introduces the DTTD-Mobile Dataset, a new RGBD dataset collected using an iPhone 14 Pro and an OptiTrack motion capture system, featuring 18 objects across 100 diverse scenes with varying occlusion, lighting, and pose. To overcome the limitations of mobile depth data, they propose DTTDNet, a transformer-based 6DoF pose estimator specifically designed for depth-noise robustness. Extensive evaluations demonstrate that DTTDNet achieves state-of-the-art performance and strong noise robustness, making it a viable solution for augmented reality and robotics applications on mobile platforms.
Speakers
- Andrii Ihnatov — Unknown
- Hao-Yun Chen — MediaTek
- Yu-Syuan Xu — MediaTek
- Andrey Ignatov
- Chia-Ming Chen — MediaTek
- Biao Wu — ZTE Corporation
- Jing Li — Samsung Research America, Mobile Processor Innovation (MPI) Lab
- Andrei Arhire — Alexandru Ioan Cuza University of Iasi, University of Wurzburg
- Chengyu Wang — Samsung Research America
- Presenter — Samsung MPI Lab
- Syed Shakib Sarwar — Meta
- Keling Yao — University of California, Berkeley
- Zixun Huang — University of California, Berkeley
- Seth Z. Zhao — Carnegie Mellon University
- Chuanyu Pan — University of California, Los Angeles
- Allen Y. Yang — University of California, Berkeley
- Shambhavi Balamuthu Sampath — Technical University of Munich, BMW Group
- Judeson Anthony Fernando — Technical University of Munich, BMW Group
- Moritz Thoma — Technical University of Munich, BMW Group
- Pierpaolo Mori — Technical University of Munich, BMW Group
- Nael Fasfous — Technical University of Munich, BMW Group
- Manoj-Rohit Vemparala — Technical University of Munich, BMW Group
- Alexander Frickenstein — Technical University of Munich, BMW Group
- Ulf Schlichtmann — Technical University of Munich, BMW Group
- Walter Stechele — Technical University of Munich, BMW Group
Talks (62)
- 00:02:49 — Andrii Ihnatov: Mobile AI Workshop: Organizational Information
- Introduction to the workshop logistics, including physical and virtual attendance, YouTube streaming, and availability of recordings and slides.
- 00:03:30 — Andrii Ihnatov: Mobile AI Workshop: Goals
- Overview of the workshop’s practical goals: deploying deep learning models on mobile hardware, optimizing models for edge devices, and running them on dedicated AI accelerators, highlighting the power of modern NPUs.
- 00:05:56 — Andrii Ihnatov: Today: Introductory Talk Topics
- Outline of the key topics to be covered in the introductory talk: software, hardware, acceleration, performance, power consumption, going beyond mobile, and real-world deployment use-cases and competitions.
- 00:08:49 — Andrii Ihnatov: Today: Talks and Paper Presentations
- Preview of the invited talks and paper presentations from industry vendors and research institutions, emphasizing their practical insights into mobile AI deployment.
- 00:09:42 — Andrii Ihnatov: Deep Learning on Mobile Devices, CVPR Papers
- Discussion on selected CVPR papers related to mobile AI, noting that few papers actually evaluate runtime on mobile devices despite claims of efficiency.
- 00:11:22 — Andrii Ihnatov: Running AI Models on Mobile Devices: Overview
- Explanation of the general workflow for running AI models on mobile devices: training the model, exporting it for mobile inference, and then running the exported model on a smartphone.
- 00:12:49 — Andrii Ihnatov: On-Device Mobile Inference: Frameworks Overview
- Introduction to popular mobile ML inference frameworks: TFLite/LiteRT, ONNX, ExecuTorch, and CoreML, emphasizing the critical choice of library for efficient deployment.
- 00:15:23 — Andrii Ihnatov: On-Device Mobile Inference: ONNX
- Analysis of ONNX for mobile inference, highlighting its ease of model conversion but noting its significant limitation of only supporting CPU-based inference on mobile devices.
- 00:17:05 — Andrii Ihnatov: On-Device Mobile Inference: ExecuTorch
- Discussion of ExecuTorch, its claimed support for CPUs, NPUs, and DSPs, and the complex, vendor-specific compilation process required to utilize hardware accelerators, making it less practical for general-purpose mobile AI.
- 00:24:00 — Andrii Ihnatov: On-Device Mobile Inference: CoreML
- Recommendation of CoreML for iOS/macOS devices due to its straightforward model conversion, lack of vendor-specific SDK requirements, and efficient inference support on Apple Neural Engine and Apple GPUs.
- 00:26:56 — Andrii Ihnatov: On-Device Mobile Inference: LiteRT
- Introduction of LiteRT (formerly TensorFlow Lite) as the preferred solution for Android devices, noting its rebranding for marketing purposes and its ability to convert models from Keras, TensorFlow, and PyTorch.
- 00:29:57 — Andrii Ihnatov: On-Device Mobile Inference: LiteRT Capabilities
- Overview of LiteRT’s key advantages: ability to run one model across all hardware, support for vendor’s NPUs through delegates without custom SDKs, GPU-based inference, and good built-in model quantization tools.
- 00:31:32 — Andrii Ihnatov: LiteRT GPU Delegate
- Detailed explanation of the LiteRT GPU delegate, which offers inference acceleration on any GPU supporting OpenCL/OpenGL, can run huge models, but is generally slower and less power-efficient than NPU inference, and not ideal for LLMs.
- 00:35:00 — Andrii Ihnatov: Qualcomm QNN LiteRT Delegate
- Recommendation of the Qualcomm QNN LiteRT delegate for Qualcomm SoCs, highlighting its broad support for Snapdragon processors, excellent performance on NPUs, and consistent updates across all supported hardware.
- 00:37:19 — Andrii Ihnatov: Qualcomm Hexagon LiteRT Delegate
- Discusses the Qualcomm Hexagon LiteRT Delegate, its support for legacy Hexagon DSPs in older Snapdragon SoCs, its limitations in performance and model support, and its suitability for small, low-resolution models.
- 00:38:39 — Andrii Ihnatov: MediaTek Neuron LiteRT Delegate
- Introduces the MediaTek Neuron LiteRT Delegate, highlighting its support for Dimensity SoCs, its ability to provide NPU/APU inference for various models including Stable Diffusion and LLMs, and its recommendation for MediaTek chipsets.
- 00:39:31 — Andrii Ihnatov: ARM NN LiteRT Delegate
- Explains the ARM NN LiteRT Delegate, which offers GPU-based model acceleration on Arm Mali GPUs and an additional CPU backend, but notes its performance is generally inferior to the generic TFLite GPU delegate.
- 00:41:33 — Andrii Ihnatov: LiteRT CoreML Delegate
- Covers the LiteRT CoreML Delegate for Apple hardware, pointing out its support for A12+ SoCs and Apple Neural Engine, but criticizing its poor model coverage, lack of INT8 support, and recommending direct CoreML Tools conversion for better performance.
- 00:43:08 — Andrii Ihnatov: Android Neural Networks API / TFLite NNAPI Delegate
- Describes the Android Neural Networks API (NNAPI) as an early attempt at on-device AI inference on Android, allowing direct NPU access without delegates, introduced in Android 8.1.
- 00:44:21 — Andrii Ihnatov: NNAPI is Officially Dead :(
- Announces the deprecation of NNAPI in Android 15 after seven years, attributing it to design complexities, vendor difficulties in driver development, and various technical and personal issues, leading to its replacement by other tools.
- 00:47:19 — Andrii Ihnatov: On-Device Mobile Inference: Frameworks Overview (Summary)
- Summarizes the recommended frameworks for mobile inference: LiteRT for Android (deploying one model across devices with vendor delegates) and CoreML for iOS.
- 00:48:05 — Andrii Ihnatov: MOBILE ML: AI HARDWARE
- Transitions to the second part of the presentation, focusing on the AI hardware used for on-device inference.
- 00:48:19 — Andrii Ihnatov: Qualcomm SoCs: HTP-Based
- Provides an overview of Qualcomm chipsets with Hexagon Tensor Processors (HTP), detailing their supported inference formats (INT8, FP16, FP32 for high-end, INT8 only for others) and performance scores via QNN and LiteRT QNN Delegate.
- 00:50:19 — Andrii Ihnatov: Qualcomm SoCs: Hexagon DSP-Based
- Discusses older Qualcomm chipsets featuring Hexagon DSPs, noting their INT8-only support, reliance on Hexagon Delegate or NNAPI due to QNN limitations, and suitability for simpler, lower-resolution image classification tasks.
- 00:53:29 — Andrii Ihnatov: MediaTek SoCs: Neuron Supported
- Presents MediaTek Dimensity chipsets supported by the Neuron Delegate, outlining their NPU capabilities (INT8, FP16, FP32) and performance, with Neuron Delegate offering superior results compared to NNAPI.
- 00:54:32 — Andrii Ihnatov: MediaTek SoCs: NNAPI Only
- Details MediaTek chipsets that are exclusively accessible via NNAPI, categorizing them by APU generation and highlighting their limited performance for complex models due to outdated NNAPI drivers.
- 00:56:59 — Andrii Ihnatov: Other Vendors: NNAPI Only
- Reviews other vendors like Google (Tensor TPU), Unisoc (NPU), and Kirin (NPU), noting their reliance on NNAPI but often suffering from outdated drivers or lack of active software development, rendering many NPUs unusable for modern AI tasks.
- 00:59:53 — Andrii Ihnatov: Other Notable Mentions
- Mentions Xiaomi’s Xring O1 and recent Kirin chipsets with NPUs that lack public library support, and Exynos chipsets whose public LiteRT delegate support was withdrawn by Samsung, making their NPUs inaccessible for external developers.
- 01:04:55 — Andrii Ihnatov: MOBILE ML: BENCHMARKING
- Introduces the third section of the presentation, focusing on the importance and methodology of benchmarking mobile AI performance.
- 01:05:19 — Andrii Ihnatov: AI Benchmark (ETHZ)
- Describes AI Benchmark (ETHZ), an Android application developed by the speaker’s team since 2018, used for measuring smartphone AI performance across various metrics like INT8/FP16/INT16 performance, inference times, memory, power efficiency, and accuracy.
- 01:06:21 — Andrii Ihnatov: AI Benchmark V6 Models
- Showcases the diverse set of models included in AI Benchmark V6, covering traditional computer vision (classification, segmentation, tracking), image processing, depth estimation, video super-resolution, large language models (Llama2, GPT-2), and neural generation models (Stable Diffusion).
- 01:08:25 — Andrii Ihnatov: AI Benchmark (ETHZ): Visualization
- Illustrates the visualization features of AI Benchmark, including real-time model output during tests and detailed results for each test, allowing users to compare performance across different backends and acceleration options.
- 01:09:52 — Andrii Ihnatov: Performance Ranking
- Highlights the AI Benchmark website (ai-benchmark.com/ranking_processors) where detailed performance rankings of mobile chipsets are published, demonstrating rapid year-over-year performance improvements, reaching levels comparable to recent desktop GPUs for complex models.
- 01:11:59 — Andrii Ihnatov: Detailed Results
- Explains how to access detailed results on the AI Benchmark website, showing per-chipset and per-model performance metrics like inference times and accuracy, and provides a rule of thumb for LLM/generation models requiring an AI Benchmark score of at least 2000.
- 01:51:59 — Andrii Ihnatov: 4K Quantized Image Super-Resolution Challenge
- This challenge focuses on 3x image upsampling to 4K resolution using fully-quantized TFLite INT8 models, evaluated on Qualcomm Snapdragon 8 Elite Hexagon NPU, with a PSNR constraint of greater than 30dB.
- 01:52:58 — Andrii Ihnatov: 4K Floating-Point Image Super-Resolution Challenge
- Similar to the quantized challenge, this task involves 3x image upsampling to 4K resolution but uses floating-point (FP32) TFLite models, evaluated on MediaTek Dimensity 9400 NPU, with a PSNR constraint of greater than 31dB.
- 01:53:54 — Andrii Ihnatov: Real-Time Video Super-Resolution Challenge
- This challenge focuses on 4x video upsampling at 10 frames per second to 720p resolution using TFLite FP32 models, evaluated on Snapdragon 8 Elite Adreno GPU or Dimensity 9400 Mali GPU, with a PSNR constraint of greater than 28dB.
- 01:54:18 — Andrii Ihnatov: Image Denoising Challenge
- The task is image denoising for FullHD images using TFLite FP32 models, evaluated on Snapdragon 8 Elite Adreno GPU or Dimensity 9400 Mali GPU, with a high PSNR constraint of greater than 37dB to ensure practical visual quality.
- 01:55:10 — Andrii Ihnatov: Realistic Bokeh Rendering Challenge
- This challenge involves rendering realistic bokeh effects on ~FullHD images using TFLite FP32 models, evaluated on Snapdragon 8 Elite Adreno GPU or Dimensity 9400 Mali GPU, with emphasis on MOS, PSNR/SSIM, and the visual realism of the output.
- 01:56:21 — Andrii Ihnatov: sRGB Image Enhancement Challenge
- The challenge is sRGB image enhancement for FullHD images using TFLite FP32 models, evaluated on Snapdragon 8 Elite Adreno GPU or Dimensity 9400 Mali GPU, with a PSNR constraint of greater than 22dB.
- 01:56:38 — Andrii Ihnatov: Learned Smartphone ISP Challenge
- This challenge focuses on RAW to RGB image restoration (Learned ISP) for 5MP images using TFLite FP32 models, evaluated on Snapdragon 8 Elite Adreno GPU or Dimensity 9400 Mali GPU, with a PSNR constraint of greater than 24dB.
- 01:56:59 — Andrii Ihnatov: Efficient Stable Diffusion Challenge
- An open challenge for neural image generation (Stable Diffusion) of 512x512 px images, evaluated on Apple M4 Neural Engine/GPU/CPU, prioritizing visual quality and runtime, supporting various frameworks like CoreML, ONNX, PyTorch, and TensorFlow.
- 01:58:43 — Andrii Ihnatov: Efficient LLMs Challenge
- An open challenge for neural text generation (LLMs), evaluated on Raspberry Pi 4 8GB, focusing on text quality and runtime (tokens/s), supporting ONNX, TFLite, PyTorch, and TensorFlow, with an emphasis on model compression techniques.
- 02:02:24 — Andrii Ihnatov: Q&A and Discussion on Mobile AI Benchmarking
- This segment addresses questions regarding power consumption measurement methodologies in mobile AI benchmarks and provides a comparative analysis of different AI frameworks like PyTorch (Executorch) and TFLite (LiteRT), discussing their maturity, vendor support, and integration challenges.
- 02:29:19 — Hao-Yun Chen: Reasoning LLM Demo on MTK D9400
- Demonstrates the capability of MTK D9400+ SoC to run reasoning LLM models efficiently on edge devices, using the Monty Hall problem as an example.
- 02:30:49 — Yu-Syuan Xu: All-Around Scan Vision Mamba
- Introduces Vision Mamba as a new vision backbone that uses various spatial scanning strategies to capture diverse spatial information for image restoration tasks.
- 02:32:59 — Andrey Ignatov: Quantized Image Super-Resolution Challenge, Mobile AI 2025: Methods and Results
- Presents the objectives, dataset, evaluation metrics, and key takeaways from the Mobile AI 2025 Quantized Image Super-Resolution Challenge, emphasizing the importance of 8-bit quantization and efficient architectures for mobile NPUs.
- 03:06:39 — Andrey Ignatov: Performance Evaluation and Final Challenge Results of the MAI 2023 Quantized Image Super-Resolution Challenge
- Overview of evaluation metrics, challenges faced with Google Tensor’s runtime variability, and presentation of the final results table for the image super-resolution challenge, highlighting winning solutions and key techniques.
- 03:17:15 — Biao Wu: RepNet-VSR: Reparameterizable Architecture for High-Fidelity Video Super-Resolution
- Presentation of RepNet-VSR, an architecture utilizing reparameterization and neural architecture search to achieve high-fidelity and efficient video super-resolution on mobile NPUs, with a focus on balancing accuracy and runtime.
- 03:26:39 — Jing Li: CDVS: Compressed Domain On Device Memory Efficient 8K Video SlowMo
- Introduction of CDVS, a novel approach for 8K video slow-motion processing on mobile devices by operating in the compressed domain, significantly reducing computational and memory requirements while maintaining visual quality.
- 03:44:43 — Andrey Ignatov: Learned Smartphone ISP Challenge, Mobile AI 2025: Methods and Results
- Presentation of the results and winning solutions for the Learned Smartphone ISP Challenge, focusing on efficient and high-quality image processing on mobile devices.
- 03:50:29 — Andrei Arhire: Learned Lightweight Smartphone ISP with Unpaired Data
- Introduction of a novel method to train a learnable ISP with unpaired data, demonstrating strong generalization and robustness with promising results on real-world RAW-to-RGB datasets.
- 03:59:54 — Chengyu Wang: Compressed Domain Multiframe Processing
- Presentation of a novel approach for multi-frame image processing in the compressed domain, addressing challenges of high-resolution imaging and memory limitations on mobile devices.
- 04:21:19 — Presenter: Multi-Frame Processing Pipeline in Compressed Domain
- This talk details a Multi-Frame Processing (MFP) pipeline operating in a compressed domain, integrating functions like deghosting and HDR blending to reduce memory consumption and enable high-resolution image processing on mobile devices.
- 04:23:43 — Andrey Ignatov: Q&A and Workshop Break
- The moderator praises the presented work, discusses the potential for mobile device deployment, and announces a break for the workshop.
- 04:58:39 — Andrey Ignatov: Break
- A short break is announced, with the speaker indicating the workshop will resume shortly.
- 04:59:21 — Andrey Ignatov: MAI RGB Photo Enhancement Challenge
- The speaker presents the MAI RGB Photo Enhancement Challenge, focusing on enhancing low-quality smartphone photos, achieving good performance on mobile GPUs, and processing HD resolution photos on mobile devices, highlighting the winning solution’s architecture and efficiency.
- 05:03:17 — Syed Shakib Sarwar: PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers
- The talk introduces PETAH, a framework for parameter-efficient task adaptation for hybrid transformers, demonstrating improved performance and efficiency across various computer vision tasks, including object detection and instance segmentation, even with sparsity.
- 05:11:00 — Andrey Ignatov: Next Presentation Introduction
- Andrey Ignatov introduces the next presentation, ‘ActNAS: Generating Efficient YOLO Models using Activation NAS’, to be given by Sudhakar Sah.
- 05:11:30 — sud: ActNAS: Generating Efficient YOLO Models using Activation NAS
- The speaker introduces ActNAS, a novel approach for generating efficient YOLO models by adapting activation functions per layer using a hardware-aware neural architecture search, demonstrating improved performance and efficiency compared to traditional methods.
- 05:35:59 — Keling Yao: Robust 6DoF Pose Estimation Against Depth Noise and a Comprehensive Evaluation on a Mobile Dataset
- This presentation introduces the DTTD-Mobile Dataset and DTTDNet, a transformer-based 6DoF pose estimator, to address challenges of depth noise and distortion from mobile LiDAR sensors for robust pose estimation in AR and robotics.
- 06:13:19 — Shambhavi Balamuthu Sampath: REPFC: UNIVERSAL STRUCTURAL REPARAMETRIZATION BLOCK FOR HIGH PERFORMANCE, LIGHTWEIGHT DEEP NEURAL NETWORKS
- This talk introduces RepFC, a universal structural reparameterization block for Fully Connected (FC) layers in Deep Neural Networks (DNNs) to boost performance with no inference overhead.
Key Takeaways
- Deploying AI models on mobile devices requires careful selection and optimization of software and hardware.
- Modern mobile NPUs and GPUs offer significant computational power, comparable to desktop GPUs, which should be leveraged for efficient inference.
- Many research papers claiming efficient models for mobile devices often lack real-world runtime evaluation on actual hardware.
- ONNX is easy for model conversion but is limited to CPU-based inference on mobile, making it unsuitable for leveraging accelerators.
- ExecuTorch, while promising, requires complex, vendor-specific compilation for each target hardware, making it impractical for general-purpose cross-platform deployment.
- CoreML is the recommended and most straightforward solution for deploying AI models on iOS and macOS devices, offering excellent integration with Apple’s Neural Engine and GPUs.
- LiteRT (formerly TensorFlow Lite) is the leading framework for Android, allowing conversion from various ML frameworks (Keras, TensorFlow, PyTorch) and supporting one model across diverse hardware via delegates.
- LiteRT delegates (like GPU and Qualcomm QNN) enable efficient inference on mobile GPUs and NPUs without needing custom vendor SDKs, simplifying cross-platform development on Android.
- GPU inference is generally slower and consumes more power than NPU inference, and is not yet optimized for large language models (LLMs) due to their logical operation heavy nature.
- LiteRT Delegates are essential for leveraging vendor-specific hardware accelerators on Android, offering tailored solutions for Qualcomm, MediaTek, and ARM platforms.
- The Android NNAPI, once a universal interface for hardware acceleration, has been deprecated in Android 15 due to complexities in driver design and inconsistent vendor support, necessitating a shift to other frameworks.
- Mobile AI hardware is rapidly advancing, with significant year-over-year performance gains in NPUs, enabling the execution of increasingly complex models, including large language models and neural generation models like Stable Diffusion, on high-end smartphones.
- The AI Benchmark (ETHZ) tool provides a comprehensive and publicly available platform for evaluating and comparing the AI performance of various mobile chipsets across a diverse range of models and metrics.
- For optimal performance on iOS, direct conversion of TensorFlow models to CoreML using CoreML Tools is recommended over the outdated LiteRT CoreML Delegate.
- Many older or lower-end NPUs from various vendors (e.g., older MediaTek APUs, Kirin, Unisoc) have limited capabilities and outdated software support, making them unsuitable for modern or complex AI tasks, often performing worse than generic GPU delegates.
- The Mobile AI Challenges for CVPR & ICCV 2025 emphasize practical on-device AI solutions, requiring a balance between high accuracy (e.g., PSNR, MOS) and efficient runtime on specific mobile hardware platforms.
- The choice of runtime platform significantly impacts model development and optimization strategies, with newer NPUs (like Snapdragon 8 Elite and MediaTek Dimensity 9400) offering substantial performance gains.
- Quantization (INT8) is a key technique for efficiency in some challenges, while others allow floating-point (FP32) models, offering flexibility in model design.
- Visual quality and user experience are critical, especially in tasks like Bokeh Rendering and Image Denoising, where high numerical metrics alone are insufficient if the output is not aesthetically pleasing or introduces artifacts.
- For open challenges like Stable Diffusion and LLMs, participants have flexibility in dataset choice and framework, but optimization for specific hardware (e.g., Apple M4 Neural Engine, Raspberry Pi 4) is crucial for competitive runtime.
- Measuring power consumption accurately is vital for mobile AI, requiring careful methodology to isolate model-specific power usage from background processes.
- The landscape of mobile AI frameworks is evolving; while TFLite (LiteRT) currently offers broad vendor support and maturity, newer integrations like PyTorch’s Executorch are still in early stages and face stability and integration challenges.
- Optimizing AI models for specific mobile hardware, especially NPUs, requires careful consideration of driver support and execution graphs, as minor model changes can drastically alter runtime.
- Reparameterization techniques (e.g., RepConv) and Quantization-Aware Training (QAT) are effective strategies for developing compact and efficient super-resolution models with minimal accuracy loss.
- Processing video in the compressed domain offers a promising solution for enabling computationally and memory-intensive tasks like 8K video slow-motion on resource-constrained mobile devices.
- The CDVS system demonstrates significant reductions in computation, time, and memory consumption for 8K video slow-motion by moving AI blocks to the compressed domain.
- Future work in mobile AI for video processing will likely focus on further system optimization and integrating generative models with compressed domain technologies.
- A compressed domain MFP pipeline can achieve significant memory compression (3.3x) for multi-frame image processing.
- The proposed pipeline effectively integrates demosaicing, denoising, registration, deghosting, and HDR blending within a compressed feature space.
- The method successfully reduces artifacts from warping and accurately handles dynamic scenes and saturated regions.
- This approach paves the way for high-resolution multi-frame processing on mobile devices, with future work focusing on on-device implementation and unaddressed aspects like tone mapping.
- The MAI RGB Photo Enhancement Challenge successfully identified solutions capable of significantly improving low-quality smartphone photos on mobile GPUs, with the winning model demonstrating high performance despite its size.
- Mean Opinion Score (MOS) and user experience scores are crucial for evaluating image enhancement, as traditional metrics like PSNR and SSIM do not always reflect real image quality.
- The size and number of parameters in a model are not always proportional to its runtime; efficient architecture design tailored to target hardware is key for mobile deployment.
- PETAH provides a parameter-efficient framework for adapting hybrid transformers to various computer vision tasks, achieving high accuracy with minimal additional parameters.
- Applying LoRA adapters to both convolutional and attention layers in hybrid transformers is an effective strategy for task adaptation on resource-constrained edge devices.
- Sparsity and pruning techniques can significantly reduce the number of parameters in backbones while maintaining high accuracy, further enhancing model efficiency for mobile deployment.
- ActNAS proposes a novel approach to neural architecture search by optimizing activation functions per layer, leading to more efficient YOLO models.
- Mixed activation models, where each layer can have a different activation function, can potentially achieve a better latency/RAM trade-off than models using a single activation function.
- Hardware-in-the-loop (hardware-aware NAS) is critical for designing efficient models, as the performance of different activations can vary significantly across different hardware platforms (e.g., NPUs vs. GPUs).
- Mobile depth sensors (e.g., iPhone LiDAR) pose significant challenges for 6DoF pose estimation due to low-resolution, distorted, and noisy depth maps.
- The DTTD-Mobile Dataset provides a new, realistic benchmark for 6DoF pose estimation in mobile environments, featuring diverse scenes and objects captured with high accuracy.
- DTTDNet, a transformer-based pose estimator, offers robust 6DoF pose estimation against depth noise and achieves state-of-the-art performance on mobile devices.
- This work enables more reliable and accurate augmented reality and robotics applications on mobile platforms by addressing the inherent limitations of their depth sensing capabilities.
Methods / Models / Datasets Mentioned
ARM NN LiteRT DelegateActNAS (Activation NAS)Adreno GPUAlpha blend (for HDR blending)Android Neural Networks API (NNAPI)AntSRApple Neural EngineAruco markersAttention (ATTN) LoRAAzure KinectBicubic upsamplingCDVSConv 1x1Convolution LoRA AdaptersCoreMLCoreML ToolsDIV2KDPEDDTTD-Mobile DatasetDTTDNetDaHua-IIG (winning solution)DeepLabV3+Depth2SpaceEBB!ENN (Exynos Neural Network) libraryESRGANEVSRNetEfficientFormer (backbone)EfficientNet-B4Encoder-decoderExecuTorchExecutorchFGNASFrame InterpolationFujifilm UltraISPGPT-2Generative Adversarial Network (GAN) LossGenerative SlowMo algorithmsGlobal Feature Extraction ModuleGoogle Tensor SoCGoogle Tensor TPUHardswishHexagon LiteRT DelegateHexagon NPUHierarchy encoderIMDNINT8 quantizationInception-V3KerasKeras/TensorFlowL1 LossL1 lossL2 LossL2 lossLeaky ReLULearning-based method (for deghosting)Linear LoRA AdaptersLinear ProbingLiteRTLiteRT CoreML DelegateLiteRT DelegateLlama2MAI 2021 Denoising DatasetMLP LoRAMS-SSIM LossMV3 DepthMali GPUMediaTek Neuron LiteRT DelegateMetaBlock 3DMetaBlock 4DMiDaS V3Micro-NASMicroISPMobileBERTMobileNet-V3MobileViT-V2NAFNetONNXOcclusion Mask EstimationOpenCLOpenGLOptiTrack motion capture systemOptical Flow EstimationPETAH (Parameter Efficient Task Adaptation for Hybrid Transformers)PSNRParallel Linear U-net (PLU-Net)Perceptual Loss (VGG-based)PruningPyNet-V2PyTorchQNN (Qualcomm Neural Network SDK)QNN delegateQualcomm QNNQuantized-aware training (QAT)RAFTRCBRREDSReLU (Rectified Linear Unit)Real-sense CameraRepConvRepNet-VSRResNetSRGANSSIMSegment AnythingSiLU (Sigmoid Linear Unit)Skip connectionsSnapdragon toolSparsityStable DiffusionSwin TransformerTFLiteTask-Specific HeadTensorFlowTensorFlow LiteU-NetUnetVQ moduleVQ-VAEVector quantizationViT (Vision Transformer)ViT TransformerVideoSRXLSRYOLOv5nYOLOv8mYolo-V8Z6ZX VIP (solution)Zero Cost Network Architecture Search (ZC-NAS)
Topics
6DoF Pose Estimation · AI Benchmark (ETHZ) · AI Benchmarking · AI Frameworks · AI Hardware Accelerators · ARM NN Delegate · Activation Functions · Android NNAPI · Artifact Reduction · Augmented Reality (AR) · Bokeh Effect Rendering · Compressed Domain Processing · Compressed domain processing · Computer Vision · CoreML · CoreML Delegate · Cross-Platform Deployment · Deep Learning Deployment · Deghosting · Demosaicing · Denoising · Depth Map Distortion · Depth Noise · Edge AI · Edge Devices · Encoder-Decoder · Evaluation Metrics (PSNR, SSIM, MOS, Runtime) · ExecuTorch · Floating-Point Models · GPUs · HDR Blending · Hardware Acceleration (NPUs, GPUs) · Hardware Accelerators · Hardware-aware model optimization · Hybrid Transformers · Image Denoising · Image Enhancement · Image Quality · Image Registration · Image Super-Resolution · Instance Segmentation · LLMs on Mobile · Large Language Models (LLMs) · Learned Smartphone ISP · LiteRT · LiteRT Delegates · Low-Rank Adaptation (LoRA) · MediaTek Neuron Delegate · Memory Consumption · Mobile AI · Mobile AI Challenges · Mobile AI Inference · Mobile AI challenges · Mobile Depth Sensors · Mobile Devices · Mobile Platforms · Model Optimization · Model Quantization · Model efficiency (size, computation, memory) · Multi-Frame Processing (MFP) · NNAPI Deprecation · NPU Performance · NPUs · Neural Architecture Search (NAS) · Neural Image Generation · Non-Gaussian Noise · ONNX · Object Detection · On-Device Inference · On-device AI · On-device AI inference · Parameter Efficient Fine-Tuning (PEFT) · Performance Benchmarking · Performance Ranking · Photo Enhancement · Power Consumption · Power Consumption Measurement · Qualcomm Hexagon DSP/HTP · Quantization-Aware Training (QAT) · Quantized Image Super-Resolution · RGBD Dataset · Robotics · Runtime Performance Optimization · Runtime performance evaluation · Semantic Segmentation · Sparsity · Stable Diffusion on Mobile · TensorFlow Lite · Transformer-based Pose Estimation · Vector Quantization (VQ) · Video Slow Motion (SlowMo) · Video Super-Resolution · Video Super-Resolution (VSR) · Visual Quality Assessment · YOLO Models · iPhone LiDAR
Notes
Open for commentary — connections to other work, critiques, follow-up reading.