CVPR 2024 Workshop

Event: CVPR 2024 Workshop · Duration: 189 min · ▶ Watch on YouTube

Abstract

This workshop focuses on the latest advancements in retail vision, covering topics such as automating visual generative AI optimization, large-scale product recognition for ad performance, and geospatial intelligence. It delves into the GroceryVision Dataset V2 and its associated challenges, including video temporal action localization and multi-modal product retrieval. The workshop also features presentations on cutting-edge models like Grounding DINO for open-set object detection and a low-cost visual-inertial SLAM system for large-scale store mapping. Key themes include leveraging human-as-a-model approaches, addressing data collection and annotation challenges, and exploring the application of advanced AI techniques in real-world retail environments.

Talks (9)

  • 00:00:00 — Ehud Barnea: Automating Visual GenAI Optimization with Human-as-a-Model.
    • This talk discusses challenges in training and optimizing generative AI models, focusing on retail applications and a human-as-a-model approach for evaluation and optimization.
  • 00:22:40Koby Bibas: Large Scale Product Recognition
    • This talk focuses on large-scale product recognition for improving ad performance, detailing how product recognition can enhance static ads to perform like dynamic ads by leveraging product tags and personalized product ordering.
  • 00:46:02Roberto Pierdicca: Space Sensing, Phygital Spaces and (Geo)Spatial Intelligence: are we running behind AI ?
    • This talk explores the intersection of space sensing, phygital spaces, and geospatial intelligence, questioning whether current AI capabilities are sufficient to keep pace with the evolving demands of these fields.
  • 01:18:50Austen Groener: The GroceryVision Dataset V2 and Challenges
    • This talk introduces the GroceryVision Dataset V2, a public dataset for physical retail AI research, and outlines the challenges associated with it, including video temporal action localization and multi-modal product retrieval.
  • 01:31:28Zhenhua Liu: A Unified Model for Video Temporal and Spatial Temporal Action Recognition
    • This talk presents a unified model for video temporal and spatial temporal action recognition, detailing the architecture and performance of their solution in the GroceryVision Dataset V2 Challenge.
  • 01:39:29Tianyi Wang: Multi-modal Product Retrieval Challenge First Place Solution
    • This talk presents the first-place solution for the Multi-modal Product Retrieval Challenge, detailing the BLIP architecture, Fourier augmentation, and sigmoid loss function used to achieve high accuracy in retrieving product identities from images and text.
  • 01:48:38Sarthak Srivastava: Multi-modal Product Retrieval Based on Bootstrapped Language Image Pre-training (BLIP)
    • This talk presents a solution for the Multi-modal Product Retrieval Challenge based on the BLIP architecture, detailing the use of Fourier augmentation and a sigmoid loss function for improved performance.
  • 01:58:50Yosi Keller: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
    • This talk introduces Grounding DINO, a novel approach that combines DINO with grounded pre-training for open-set object detection, leveraging cross-attention mechanisms for image-to-image interaction and template-based prompting.
  • 02:34:19Sharmin Rahman: Large-scale Store Mapping
    • This talk introduces a low-cost visual-inertial based large-scale mapping service for physical retail environments, detailing the challenges of mapping in grocery stores and the proposed SLAM workflow.

Key Takeaways

  • Generative AI optimization in retail can be enhanced by human-as-a-model approaches, addressing challenges like subjective evaluation and the need for rapid optimization.
  • Large-scale product recognition is crucial for improving ad performance in retail, particularly for dynamic ads, by enabling product-aware ads and personalized content.
  • The integration of space sensing, phygital spaces, and geospatial intelligence presents significant opportunities for AI, requiring robust data acquisition and processing techniques.
  • The GroceryVision Dataset V2 offers a valuable resource for retail AI research, providing diverse data for challenges like video temporal action localization and multi-modal product retrieval.
  • Advanced models like Grounding DINO and visual-inertial SLAM are pushing the boundaries of object detection and mapping in complex environments, with applications extending beyond traditional ads to social media and augmented reality experiences.

Methods / Models / Datasets Mentioned

  • Human-as-a-Model
  • BLIP architecture
  • Fourier augmentation
  • Sigmoid loss
  • DINO
  • Grounded Pre-Training
  • Cross-attention
  • 4DCV
  • SLAM
  • Visual Odometry (VO)
  • Visual Simultaneous Localization and Mapping (VSLAM)
  • Structure from Motion (SfM)
  • ORB-SLAM3
  • Open-VINS
  • ECO SLAM
  • Pose-graph optimization (PGO)
  • Local Optimization (LO)
  • Levenberg-Marquardt Trust Region optimization
  • Sparse Schur Linear solver
  • DBoW2
  • RANSAC
  • PnP RANSAC
  • LSTM
  • S-GAN
  • RAG prompt
  • CLIP
  • CLIP-ViT-large
  • CLIP-ConvNeXt-large
  • EMA
  • FGM
  • VideoMAE
  • Bi-LSTM
  • Object Proposal Layer
  • Interest-Oriented Object Decoding Layer
  • Relation Classification Layer
  • Non-Maximum Suppression (NMS)
  • Transformer Decoders
  • Transformer Encoder
  • Transformer Decoder
  • Cross-attention
  • Self-attention
  • FFN
  • Multi-head Attention
  • Segment Anything Model (SAM)

Topics

Retail Vision · Generative AI Optimization · Product Recognition · Geospatial Intelligence · GroceryVision Dataset · Video Temporal Action Localization · Multi-modal Product Retrieval · Grounding DINO · SLAM · Human-as-a-Model


Notes

Open for commentary — connections to other work, critiques, follow-up reading.