The 6th International Workshop on Gaze Estimation and Prediction in the Wild
Event: CVPR 2024 Workshop, Gaze 2024 · Duration: 112 min · ▶ Watch on YouTube
Abstract
The 6th International Workshop on Gaze Estimation and Prediction in the Wild (Gaze 2024) at CVPR 2024 brought together researchers to discuss advancements in gaze-related technologies. The workshop featured opening remarks, two invited talks, and five workshop paper presentations. Topics covered included 3D eye region reconstruction, gaze estimation in diverse environments, personalized video gaze estimation, gaze scanpath prediction, zero-shot gaze following with Vision-Language Models, and gaze estimation for classroom attention measurement. The event concluded with an award ceremony recognizing outstanding contributions to the field.
Speakers
- Hyung Jin Chang — University of Birmingham
- Feng Xu — Tsinghua University
- Alexander Fix — Meta Reality Labs Research
- Swati Jindal — University of California Santa Cruz
- Takumi Nishiyasu — Institute of Industrial Science, The University of Tokyo, Japan
- Anshul Gupta — Idiap Research Institute
- Yuchen Zhou — Sun Yat-sen University
- Arshad Khan — ELM Company, Saudi Arabia & ELM Europe, London, UK
- Xucong Zhang — Delft University of Technology
Talks (9)
- 00:00:00 — Hyung Jin Chang: Welcome and Opening Remarks
- Introduction to the Gaze workshop, its history, organizers, sponsors, and schedule.
- 00:06:28 — Feng Xu: Eye Region Reconstruction with a Monocular Camera
- Discusses 3D face reconstruction with eyes, portrait eyeglasses removal, and gaze estimation with eyeglasses, focusing on improving eye region reconstruction quality with a monocular camera.
- 00:39:00 — Alexander Fix: Gaze Estimation from the Wild to the Lab and Back Again
- Discusses challenges and solutions in gaze estimation, covering data collection, model development, and applications in both real-world and lab settings.
- 01:01:45 — Swati Jindal: Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation
- Proposes a Spatio-Temporal Attention for Gaze Estimation (STAGE) framework using spatial and temporal attention modules, combined with Gaussian Processes for personalized video gaze estimation, addressing challenges of irrelevant spatial changes.
- 01:23:50 — Takumi Nishiyasu: Gaze Scanpath Transformer: Predicting Visual Search Target by Spatiotemporal Semantic Modeling of Gaze Scanpath
- Introduces a Gaze Scanpath Transformer (GST) to predict visual search targets by integrating spatiotemporal and semantic information from gaze scanpaths, improving accuracy compared to previous methods.
- 01:33:50 — Anshul Gupta: Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following
- Investigates the use of Vision-Language Models (VLMs) for gaze following, focusing on extracting person-related contextual cues and incorporating them into a temporal architecture for improved performance and generalization.
- 01:41:50 — Yuchen Zhou: Learning from Observer Gaze: Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition
- Introduces a novel dataset (IG) and an interactive attention model (IA) to explore the bidirectional connection between saliency prediction and action understanding, improving HOI detection.
- 01:54:50 — Arshad Khan: Gaze Estimation for Classroom Attention Measurement (GESCAM Dataset)
- Presents the GESCAM dataset and network architecture for gaze estimation in classroom settings, focusing on naturalistic attention levels and addressing challenges in data collection and annotation for teacher-student engagement.
- 02:00:00 — Xucong Zhang: Workshop Award Ceremony
- Announcement of the Best Paper and Best Poster awards for the Gaze 2024 workshop.
Key Takeaways
- The Gaze 2024 workshop highlighted the significant progress and diverse applications of gaze estimation and prediction, ranging from medical diagnostics to human-computer interaction.
- Novel approaches are being developed to improve gaze estimation accuracy in challenging real-world scenarios, including dealing with eyeglasses, varying lighting, and complex scene dynamics.
- The integration of advanced deep learning architectures, such as Transformers and Vision-Language Models, is proving crucial for extracting rich contextual cues and enhancing model generalization.
- The creation of specialized datasets, like Interactive Gaze (IG) and GESCAM, is essential for training and evaluating models in specific domains like human-object interaction and classroom attention monitoring.
- Future directions emphasize cross-dataset generalization, personalized attention estimation, and leveraging synthetic data generation to overcome limitations in real-world data collection.
Methods / Models / Datasets Mentioned
Digital MaskParametrical Bilinear ModelLinear ModelEyeball CalibrationCross-Domain Segmentation ModuleDA NetworkShadow Mask NetworkGlass Mask NetworkDe-Shadow NetworkDe-Glass NetworkGaze360ETH-XGazeMPIIFaceGazeGazeFollowChildPlaySWIGMTGSHICOAVA+CPCLIPBLIP-2VQAICL (In-Context Learning)Spatio-Temporal Attention for Gaze Estimation (STAGE)Spatial Attention Module (SAM)Dual-SAMCross-SAMHybrid-SAMTemporal Sequence Model (TSM)Unidirectional LSTMCausal Transformer Decoder modelGPT-2Gaze Prediction Layer (GPL)ResNetGaussian Processes (GPs)EyediapGaze Scanpath Transformer (GST)Panoptic SegmentationEmbedding ModuleFeature MixerMLPsCOCO-Search18BoVW (Bag-of-Visual-Words)GazeGNNInteractive Gaze (IG) DatasetInteractive Attention Model (IA)ITTIGBVSDeepGaze IDeepGaze IIEUMBMLNetConvNextSSWin TransformerCommon HOI Model PipelineUnionDetIP-NetGG-NetHOTROPICMURENSTIPUPTSCGGESCAM DatasetGaze Target Detection (GTD)Autodesk MayaBlenderUnreal EngineMarvelous DesignerAdobe PremierHead ConvScene ConvEncodeDeconvAttention LayerObject-Attended Head EmbeddingMSE (Mean Squared Error) LossAngular LossRandom (baseline)Center (baseline)Recansens et al. (baseline)Lian et al. (baseline)Chong et al. (baseline)
Topics
Gaze Estimation · Gaze Prediction · Eye Region Reconstruction · Monocular Camera · Eyeglasses Removal · Spatio-Temporal Attention · Gaussian Processes · Personalized Gaze · Video Gaze Tracking · Gaze Scanpath · Transformer Models · Visual Search · Semantic Modeling · Vision-Language Models (VLMs) · Zero-Shot Learning · Gaze Following · Contextual Cues · Human-Object Interaction (HOI) · Saliency Prediction · Action Understanding · Classroom Attention Measurement · Synthetic Data Generation
Notes
Open for commentary — connections to other work, critiques, follow-up reading.