CVPR 2024 Workshop

Event: CVPR 2024 Workshop · Duration: 189 min · ▶ Watch on YouTube

Abstract

This workshop focuses on the latest advancements in retail vision, covering topics such as automating visual generative AI optimization, large-scale product recognition for ad performance, and geospatial intelligence. It delves into the GroceryVision Dataset V2 and its associated challenges, including video temporal action localization and multi-modal product retrieval. The workshop also features presentations on cutting-edge models like Grounding DINO for open-set object detection and a low-cost visual-inertial SLAM system for large-scale store mapping. Key themes include leveraging human-as-a-model approaches, addressing data collection and annotation challenges, and exploring the application of advanced AI techniques in real-world retail environments.

Talks (9)

00:00:00 — Ehud Barnea: Automating Visual GenAI Optimization with Human-as-a-Model.
- This talk discusses challenges in training and optimizing generative AI models, focusing on retail applications and a human-as-a-model approach for evaluation and optimization.
00:22:40 — Koby Bibas: Large Scale Product Recognition
- This talk focuses on large-scale product recognition for improving ad performance, detailing how product recognition can enhance static ads to perform like dynamic ads by leveraging product tags and personalized product ordering.
00:46:02 — Roberto Pierdicca: Space Sensing, Phygital Spaces and (Geo)Spatial Intelligence: are we running behind AI ?
- This talk explores the intersection of space sensing, phygital spaces, and geospatial intelligence, questioning whether current AI capabilities are sufficient to keep pace with the evolving demands of these fields.
01:18:50 — Austen Groener: The GroceryVision Dataset V2 and Challenges
- This talk introduces the GroceryVision Dataset V2, a public dataset for physical retail AI research, and outlines the challenges associated with it, including video temporal action localization and multi-modal product retrieval.
01:31:28 — Zhenhua Liu: A Unified Model for Video Temporal and Spatial Temporal Action Recognition
- This talk presents a unified model for video temporal and spatial temporal action recognition, detailing the architecture and performance of their solution in the GroceryVision Dataset V2 Challenge.
01:39:29 — Tianyi Wang: Multi-modal Product Retrieval Challenge First Place Solution
- This talk presents the first-place solution for the Multi-modal Product Retrieval Challenge, detailing the BLIP architecture, Fourier augmentation, and sigmoid loss function used to achieve high accuracy in retrieving product identities from images and text.
01:48:38 — Sarthak Srivastava: Multi-modal Product Retrieval Based on Bootstrapped Language Image Pre-training (BLIP)
- This talk presents a solution for the Multi-modal Product Retrieval Challenge based on the BLIP architecture, detailing the use of Fourier augmentation and a sigmoid loss function for improved performance.
01:58:50 — Yosi Keller: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
- This talk introduces Grounding DINO, a novel approach that combines DINO with grounded pre-training for open-set object detection, leveraging cross-attention mechanisms for image-to-image interaction and template-based prompting.
02:34:19 — Sharmin Rahman: Large-scale Store Mapping
- This talk introduces a low-cost visual-inertial based large-scale mapping service for physical retail environments, detailing the challenges of mapping in grocery stores and the proposed SLAM workflow.

Key Takeaways

Generative AI optimization in retail can be enhanced by human-as-a-model approaches, addressing challenges like subjective evaluation and the need for rapid optimization.
Large-scale product recognition is crucial for improving ad performance in retail, particularly for dynamic ads, by enabling product-aware ads and personalized content.
The integration of space sensing, phygital spaces, and geospatial intelligence presents significant opportunities for AI, requiring robust data acquisition and processing techniques.
The GroceryVision Dataset V2 offers a valuable resource for retail AI research, providing diverse data for challenges like video temporal action localization and multi-modal product retrieval.
Advanced models like Grounding DINO and visual-inertial SLAM are pushing the boundaries of object detection and mapping in complex environments, with applications extending beyond traditional ads to social media and augmented reality experiences.

Methods / Models / Datasets Mentioned

Human-as-a-Model
BLIP architecture
Fourier augmentation
Sigmoid loss
DINO
Grounded Pre-Training
Cross-attention
4DCV
SLAM
Visual Odometry (VO)
Visual Simultaneous Localization and Mapping (VSLAM)
Structure from Motion (SfM)
ORB-SLAM3
Open-VINS
ECO SLAM
Pose-graph optimization (PGO)
Local Optimization (LO)
Levenberg-Marquardt Trust Region optimization
Sparse Schur Linear solver
DBoW2
RANSAC
PnP RANSAC
LSTM
S-GAN
RAG prompt
CLIP
CLIP-ViT-large
CLIP-ConvNeXt-large
EMA
FGM
VideoMAE
Bi-LSTM
Object Proposal Layer
Interest-Oriented Object Decoding Layer
Relation Classification Layer
Non-Maximum Suppression (NMS)
Transformer Decoders
Transformer Encoder
Transformer Decoder
Cross-attention
Self-attention
FFN
Multi-head Attention
Segment Anything Model (SAM)

Topics

Retail Vision · Generative AI Optimization · Product Recognition · Geospatial Intelligence · GroceryVision Dataset · Video Temporal Action Localization · Multi-modal Product Retrieval · Grounding DINO · SLAM · Human-as-a-Model

Notes

Open for commentary — connections to other work, critiques, follow-up reading.