CVPR 2024 Workshop
Event: CVPR 2024 Workshop · Duration: 189 min · ▶ Watch on YouTube
Abstract
This workshop focuses on the latest advancements in retail vision, covering topics such as automating visual generative AI optimization, large-scale product recognition for ad performance, and geospatial intelligence. It delves into the GroceryVision Dataset V2 and its associated challenges, including video temporal action localization and multi-modal product retrieval. The workshop also features presentations on cutting-edge models like Grounding DINO for open-set object detection and a low-cost visual-inertial SLAM system for large-scale store mapping. Key themes include leveraging human-as-a-model approaches, addressing data collection and annotation challenges, and exploring the application of advanced AI techniques in real-world retail environments.
Talks (9)
- 00:00:00 — Ehud Barnea: Automating Visual GenAI Optimization with Human-as-a-Model.
- This talk discusses challenges in training and optimizing generative AI models, focusing on retail applications and a human-as-a-model approach for evaluation and optimization.
- 00:22:40 — Koby Bibas: Large Scale Product Recognition
- This talk focuses on large-scale product recognition for improving ad performance, detailing how product recognition can enhance static ads to perform like dynamic ads by leveraging product tags and personalized product ordering.
- 00:46:02 — Roberto Pierdicca: Space Sensing, Phygital Spaces and (Geo)Spatial Intelligence: are we running behind AI ?
- This talk explores the intersection of space sensing, phygital spaces, and geospatial intelligence, questioning whether current AI capabilities are sufficient to keep pace with the evolving demands of these fields.
- 01:18:50 — Austen Groener: The GroceryVision Dataset V2 and Challenges
- This talk introduces the GroceryVision Dataset V2, a public dataset for physical retail AI research, and outlines the challenges associated with it, including video temporal action localization and multi-modal product retrieval.
- 01:31:28 — Zhenhua Liu: A Unified Model for Video Temporal and Spatial Temporal Action Recognition
- This talk presents a unified model for video temporal and spatial temporal action recognition, detailing the architecture and performance of their solution in the GroceryVision Dataset V2 Challenge.
- 01:39:29 — Tianyi Wang: Multi-modal Product Retrieval Challenge First Place Solution
- This talk presents the first-place solution for the Multi-modal Product Retrieval Challenge, detailing the BLIP architecture, Fourier augmentation, and sigmoid loss function used to achieve high accuracy in retrieving product identities from images and text.
- 01:48:38 — Sarthak Srivastava: Multi-modal Product Retrieval Based on Bootstrapped Language Image Pre-training (BLIP)
- This talk presents a solution for the Multi-modal Product Retrieval Challenge based on the BLIP architecture, detailing the use of Fourier augmentation and a sigmoid loss function for improved performance.
- 01:58:50 — Yosi Keller: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
- This talk introduces Grounding DINO, a novel approach that combines DINO with grounded pre-training for open-set object detection, leveraging cross-attention mechanisms for image-to-image interaction and template-based prompting.
- 02:34:19 — Sharmin Rahman: Large-scale Store Mapping
- This talk introduces a low-cost visual-inertial based large-scale mapping service for physical retail environments, detailing the challenges of mapping in grocery stores and the proposed SLAM workflow.
Key Takeaways
- Generative AI optimization in retail can be enhanced by human-as-a-model approaches, addressing challenges like subjective evaluation and the need for rapid optimization.
- Large-scale product recognition is crucial for improving ad performance in retail, particularly for dynamic ads, by enabling product-aware ads and personalized content.
- The integration of space sensing, phygital spaces, and geospatial intelligence presents significant opportunities for AI, requiring robust data acquisition and processing techniques.
- The GroceryVision Dataset V2 offers a valuable resource for retail AI research, providing diverse data for challenges like video temporal action localization and multi-modal product retrieval.
- Advanced models like Grounding DINO and visual-inertial SLAM are pushing the boundaries of object detection and mapping in complex environments, with applications extending beyond traditional ads to social media and augmented reality experiences.
Methods / Models / Datasets Mentioned
Human-as-a-ModelBLIP architectureFourier augmentationSigmoid lossDINOGrounded Pre-TrainingCross-attention4DCVSLAMVisual Odometry (VO)Visual Simultaneous Localization and Mapping (VSLAM)Structure from Motion (SfM)ORB-SLAM3Open-VINSECO SLAMPose-graph optimization (PGO)Local Optimization (LO)Levenberg-Marquardt Trust Region optimizationSparse Schur Linear solverDBoW2RANSACPnP RANSACLSTMS-GANRAG promptCLIPCLIP-ViT-largeCLIP-ConvNeXt-largeEMAFGMVideoMAEBi-LSTMObject Proposal LayerInterest-Oriented Object Decoding LayerRelation Classification LayerNon-Maximum Suppression (NMS)Transformer DecodersTransformer EncoderTransformer DecoderCross-attentionSelf-attentionFFNMulti-head AttentionSegment Anything Model (SAM)
Topics
Retail Vision · Generative AI Optimization · Product Recognition · Geospatial Intelligence · GroceryVision Dataset · Video Temporal Action Localization · Multi-modal Product Retrieval · Grounding DINO · SLAM · Human-as-a-Model
Notes
Open for commentary — connections to other work, critiques, follow-up reading.