The 6th International Workshop on Gaze Estimation and Prediction in the Wild

Event: CVPR 2024 Workshop, Gaze 2024 · Duration: 112 min · ▶ Watch on YouTube

Abstract

The 6th International Workshop on Gaze Estimation and Prediction in the Wild (Gaze 2024) at CVPR 2024 brought together researchers to discuss advancements in gaze-related technologies. The workshop featured opening remarks, two invited talks, and five workshop paper presentations. Topics covered included 3D eye region reconstruction, gaze estimation in diverse environments, personalized video gaze estimation, gaze scanpath prediction, zero-shot gaze following with Vision-Language Models, and gaze estimation for classroom attention measurement. The event concluded with an award ceremony recognizing outstanding contributions to the field.

Speakers

Hyung Jin Chang — University of Birmingham
Feng Xu — Tsinghua University
Alexander Fix — Meta Reality Labs Research
Swati Jindal — University of California Santa Cruz
Takumi Nishiyasu — Institute of Industrial Science, The University of Tokyo, Japan
Anshul Gupta — Idiap Research Institute
Yuchen Zhou — Sun Yat-sen University
Arshad Khan — ELM Company, Saudi Arabia & ELM Europe, London, UK
Xucong Zhang — Delft University of Technology

Talks (9)

00:00:00 — Hyung Jin Chang: Welcome and Opening Remarks
- Introduction to the Gaze workshop, its history, organizers, sponsors, and schedule.
00:06:28 — Feng Xu: Eye Region Reconstruction with a Monocular Camera
- Discusses 3D face reconstruction with eyes, portrait eyeglasses removal, and gaze estimation with eyeglasses, focusing on improving eye region reconstruction quality with a monocular camera.
00:39:00 — Alexander Fix: Gaze Estimation from the Wild to the Lab and Back Again
- Discusses challenges and solutions in gaze estimation, covering data collection, model development, and applications in both real-world and lab settings.
01:01:45 — Swati Jindal: Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation
- Proposes a Spatio-Temporal Attention for Gaze Estimation (STAGE) framework using spatial and temporal attention modules, combined with Gaussian Processes for personalized video gaze estimation, addressing challenges of irrelevant spatial changes.
01:23:50 — Takumi Nishiyasu: Gaze Scanpath Transformer: Predicting Visual Search Target by Spatiotemporal Semantic Modeling of Gaze Scanpath
- Introduces a Gaze Scanpath Transformer (GST) to predict visual search targets by integrating spatiotemporal and semantic information from gaze scanpaths, improving accuracy compared to previous methods.
01:33:50 — Anshul Gupta: Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following
- Investigates the use of Vision-Language Models (VLMs) for gaze following, focusing on extracting person-related contextual cues and incorporating them into a temporal architecture for improved performance and generalization.
01:41:50 — Yuchen Zhou: Learning from Observer Gaze: Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition
- Introduces a novel dataset (IG) and an interactive attention model (IA) to explore the bidirectional connection between saliency prediction and action understanding, improving HOI detection.
01:54:50 — Arshad Khan: Gaze Estimation for Classroom Attention Measurement (GESCAM Dataset)
- Presents the GESCAM dataset and network architecture for gaze estimation in classroom settings, focusing on naturalistic attention levels and addressing challenges in data collection and annotation for teacher-student engagement.
02:00:00 — Xucong Zhang: Workshop Award Ceremony
- Announcement of the Best Paper and Best Poster awards for the Gaze 2024 workshop.

Key Takeaways

The Gaze 2024 workshop highlighted the significant progress and diverse applications of gaze estimation and prediction, ranging from medical diagnostics to human-computer interaction.
Novel approaches are being developed to improve gaze estimation accuracy in challenging real-world scenarios, including dealing with eyeglasses, varying lighting, and complex scene dynamics.
The integration of advanced deep learning architectures, such as Transformers and Vision-Language Models, is proving crucial for extracting rich contextual cues and enhancing model generalization.
The creation of specialized datasets, like Interactive Gaze (IG) and GESCAM, is essential for training and evaluating models in specific domains like human-object interaction and classroom attention monitoring.
Future directions emphasize cross-dataset generalization, personalized attention estimation, and leveraging synthetic data generation to overcome limitations in real-world data collection.

Methods / Models / Datasets Mentioned

Digital Mask
Parametrical Bilinear Model
Linear Model
Eyeball Calibration
Cross-Domain Segmentation Module
DA Network
Shadow Mask Network
Glass Mask Network
De-Shadow Network
De-Glass Network
Gaze360
ETH-XGaze
MPIIFaceGaze
GazeFollow
ChildPlay
SWIG
MTGS
HICO
AVA+CP
CLIP
BLIP-2
VQA
ICL (In-Context Learning)
Spatio-Temporal Attention for Gaze Estimation (STAGE)
Spatial Attention Module (SAM)
Dual-SAM
Cross-SAM
Hybrid-SAM
Temporal Sequence Model (TSM)
Unidirectional LSTM
Causal Transformer Decoder model
GPT-2
Gaze Prediction Layer (GPL)
ResNet
Gaussian Processes (GPs)
Eyediap
Gaze Scanpath Transformer (GST)
Panoptic Segmentation
Embedding Module
Feature Mixer
MLPs
COCO-Search18
BoVW (Bag-of-Visual-Words)
GazeGNN
Interactive Gaze (IG) Dataset
Interactive Attention Model (IA)
ITTI
GBVS
DeepGaze I
DeepGaze IIE
UMB
MLNet
ConvNext
SSWin Transformer
Common HOI Model Pipeline
UnionDet
IP-Net
GG-Net
HOTR
OPIC
MUREN
STIP
UPT
SCG
GESCAM Dataset
Gaze Target Detection (GTD)
Autodesk Maya
Blender
Unreal Engine
Marvelous Designer
Adobe Premier
Head Conv
Scene Conv
Encode
Deconv
Attention Layer
Object-Attended Head Embedding
MSE (Mean Squared Error) Loss
Angular Loss
Random (baseline)
Center (baseline)
Recansens et al. (baseline)
Lian et al. (baseline)
Chong et al. (baseline)

Topics

Gaze Estimation · Gaze Prediction · Eye Region Reconstruction · Monocular Camera · Eyeglasses Removal · Spatio-Temporal Attention · Gaussian Processes · Personalized Gaze · Video Gaze Tracking · Gaze Scanpath · Transformer Models · Visual Search · Semantic Modeling · Vision-Language Models (VLMs) · Zero-Shot Learning · Gaze Following · Contextual Cues · Human-Object Interaction (HOI) · Saliency Prediction · Action Understanding · Classroom Attention Measurement · Synthetic Data Generation

Notes

Open for commentary — connections to other work, critiques, follow-up reading.