The 6th International Workshop on Gaze Estimation and Prediction in the Wild

Event: CVPR 2024 Workshop, Gaze 2024 · Duration: 112 min · ▶ Watch on YouTube

Abstract

The 6th International Workshop on Gaze Estimation and Prediction in the Wild (Gaze 2024) at CVPR 2024 brought together researchers to discuss advancements in gaze-related technologies. The workshop featured opening remarks, two invited talks, and five workshop paper presentations. Topics covered included 3D eye region reconstruction, gaze estimation in diverse environments, personalized video gaze estimation, gaze scanpath prediction, zero-shot gaze following with Vision-Language Models, and gaze estimation for classroom attention measurement. The event concluded with an award ceremony recognizing outstanding contributions to the field.

Speakers

  • Hyung Jin Chang — University of Birmingham
  • Feng Xu — Tsinghua University
  • Alexander Fix — Meta Reality Labs Research
  • Swati Jindal — University of California Santa Cruz
  • Takumi Nishiyasu — Institute of Industrial Science, The University of Tokyo, Japan
  • Anshul Gupta — Idiap Research Institute
  • Yuchen Zhou — Sun Yat-sen University
  • Arshad Khan — ELM Company, Saudi Arabia & ELM Europe, London, UK
  • Xucong Zhang — Delft University of Technology

Talks (9)

  • 00:00:00 — Hyung Jin Chang: Welcome and Opening Remarks
    • Introduction to the Gaze workshop, its history, organizers, sponsors, and schedule.
  • 00:06:28Feng Xu: Eye Region Reconstruction with a Monocular Camera
    • Discusses 3D face reconstruction with eyes, portrait eyeglasses removal, and gaze estimation with eyeglasses, focusing on improving eye region reconstruction quality with a monocular camera.
  • 00:39:00Alexander Fix: Gaze Estimation from the Wild to the Lab and Back Again
    • Discusses challenges and solutions in gaze estimation, covering data collection, model development, and applications in both real-world and lab settings.
  • 01:01:45Swati Jindal: Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation
    • Proposes a Spatio-Temporal Attention for Gaze Estimation (STAGE) framework using spatial and temporal attention modules, combined with Gaussian Processes for personalized video gaze estimation, addressing challenges of irrelevant spatial changes.
  • 01:23:50Takumi Nishiyasu: Gaze Scanpath Transformer: Predicting Visual Search Target by Spatiotemporal Semantic Modeling of Gaze Scanpath
    • Introduces a Gaze Scanpath Transformer (GST) to predict visual search targets by integrating spatiotemporal and semantic information from gaze scanpaths, improving accuracy compared to previous methods.
  • 01:33:50Anshul Gupta: Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following
    • Investigates the use of Vision-Language Models (VLMs) for gaze following, focusing on extracting person-related contextual cues and incorporating them into a temporal architecture for improved performance and generalization.
  • 01:41:50Yuchen Zhou: Learning from Observer Gaze: Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition
    • Introduces a novel dataset (IG) and an interactive attention model (IA) to explore the bidirectional connection between saliency prediction and action understanding, improving HOI detection.
  • 01:54:50Arshad Khan: Gaze Estimation for Classroom Attention Measurement (GESCAM Dataset)
    • Presents the GESCAM dataset and network architecture for gaze estimation in classroom settings, focusing on naturalistic attention levels and addressing challenges in data collection and annotation for teacher-student engagement.
  • 02:00:00Xucong Zhang: Workshop Award Ceremony
    • Announcement of the Best Paper and Best Poster awards for the Gaze 2024 workshop.

Key Takeaways

  • The Gaze 2024 workshop highlighted the significant progress and diverse applications of gaze estimation and prediction, ranging from medical diagnostics to human-computer interaction.
  • Novel approaches are being developed to improve gaze estimation accuracy in challenging real-world scenarios, including dealing with eyeglasses, varying lighting, and complex scene dynamics.
  • The integration of advanced deep learning architectures, such as Transformers and Vision-Language Models, is proving crucial for extracting rich contextual cues and enhancing model generalization.
  • The creation of specialized datasets, like Interactive Gaze (IG) and GESCAM, is essential for training and evaluating models in specific domains like human-object interaction and classroom attention monitoring.
  • Future directions emphasize cross-dataset generalization, personalized attention estimation, and leveraging synthetic data generation to overcome limitations in real-world data collection.

Methods / Models / Datasets Mentioned

  • Digital Mask
  • Parametrical Bilinear Model
  • Linear Model
  • Eyeball Calibration
  • Cross-Domain Segmentation Module
  • DA Network
  • Shadow Mask Network
  • Glass Mask Network
  • De-Shadow Network
  • De-Glass Network
  • Gaze360
  • ETH-XGaze
  • MPIIFaceGaze
  • GazeFollow
  • ChildPlay
  • SWIG
  • MTGS
  • HICO
  • AVA+CP
  • CLIP
  • BLIP-2
  • VQA
  • ICL (In-Context Learning)
  • Spatio-Temporal Attention for Gaze Estimation (STAGE)
  • Spatial Attention Module (SAM)
  • Dual-SAM
  • Cross-SAM
  • Hybrid-SAM
  • Temporal Sequence Model (TSM)
  • Unidirectional LSTM
  • Causal Transformer Decoder model
  • GPT-2
  • Gaze Prediction Layer (GPL)
  • ResNet
  • Gaussian Processes (GPs)
  • Eyediap
  • Gaze Scanpath Transformer (GST)
  • Panoptic Segmentation
  • Embedding Module
  • Feature Mixer
  • MLPs
  • COCO-Search18
  • BoVW (Bag-of-Visual-Words)
  • GazeGNN
  • Interactive Gaze (IG) Dataset
  • Interactive Attention Model (IA)
  • ITTI
  • GBVS
  • DeepGaze I
  • DeepGaze IIE
  • UMB
  • MLNet
  • ConvNext
  • SSWin Transformer
  • Common HOI Model Pipeline
  • UnionDet
  • IP-Net
  • GG-Net
  • HOTR
  • OPIC
  • MUREN
  • STIP
  • UPT
  • SCG
  • GESCAM Dataset
  • Gaze Target Detection (GTD)
  • Autodesk Maya
  • Blender
  • Unreal Engine
  • Marvelous Designer
  • Adobe Premier
  • Head Conv
  • Scene Conv
  • Encode
  • Deconv
  • Attention Layer
  • Object-Attended Head Embedding
  • MSE (Mean Squared Error) Loss
  • Angular Loss
  • Random (baseline)
  • Center (baseline)
  • Recansens et al. (baseline)
  • Lian et al. (baseline)
  • Chong et al. (baseline)

Topics

Gaze Estimation · Gaze Prediction · Eye Region Reconstruction · Monocular Camera · Eyeglasses Removal · Spatio-Temporal Attention · Gaussian Processes · Personalized Gaze · Video Gaze Tracking · Gaze Scanpath · Transformer Models · Visual Search · Semantic Modeling · Vision-Language Models (VLMs) · Zero-Shot Learning · Gaze Following · Contextual Cues · Human-Object Interaction (HOI) · Saliency Prediction · Action Understanding · Classroom Attention Measurement · Synthetic Data Generation


Notes

Open for commentary — connections to other work, critiques, follow-up reading.