Second Egocentric Vision (EgoVis) Workshop

Event: CVPR 2025 · Duration: 45 min · ▶ Watch on YouTube

Abstract

The Second Egocentric Vision (EgoVis) Workshop at CVPR 2025 presents the outcomes of the Ego4D and EgoExo4D Challenges. This year marks the fifth iteration of the Ego4D challenge and the first full-scale EgoExo4D challenge. The workshop features a synthesis of the challenges, spotlight talks from winning teams, and an awards ceremony. Key areas of focus include episodic memory, social understanding, forecasting/anticipation for Ego4D, and ego-pose, relations, keysteps, and proficiency estimation for EgoExo4D. The challenges attracted significant participation from 71 unique individuals across 20 different research institutions globally, with a notable increase in industry involvement. Winning methods demonstrate substantial improvements over baselines, particularly in tasks leveraging large language models and graph-based approaches. The workshop concludes with the felicitation of winners across all tracks, highlighting innovative solutions like OSGNet, BIMBA, GLEVR, O-MaMa, and UNICT.

Speakers

  • Andrew Westbury — Meta
  • Julia Romero — University of Colorado Boulder, Intel Labs
  • Lorenzo Mur-Labadia — Universidad de Zaragoza
  • Luigi Seminara — University of Catania
  • Md Mohaiminul Islam — UNC Chapel Hill
  • Yisen Feng — Harbin Institute of Technology (Shenzhen)

Talks (6)

  • 00:04:40Andrew Westbury: 2025 Ego4D & EgoExo4D Challenges
    • Introduction to the Ego4D and EgoExo4D challenges, their history, and the session schedule.
  • 01:35:00Julia Romero: Keystep Recognition using Graph Neural Networks
    • Presents GLEVR, a graph-based framework for fine-grained keystep recognition in egocentric videos, leveraging temporal relationships and multimodal data.
  • 02:02:00Lorenzo Mur-Labadia: O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views
    • Presents O-MaMa, a solution for object correspondence that reformulates the problem as an object mask matching task across egocentric and exocentric views, achieving state-of-the-art performance with high efficiency.
  • 02:32:00Luigi Seminara: Ego-Exo4D Procedure Understanding Challenge 2025 (UNICT solution)
    • Introduces a novel framework for learning task graphs from action sequences using Maximum Likelihood estimation, enabling procedural reasoning tasks like action anticipation and mistake detection.
  • 03:16:00Md Mohaiminul Islam: BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
    • Presents BIMBA, a Mamba-based selective-scan compression method for long-range video question answering that efficiently handles hours-long videos by identifying and retaining critical spatiotemporal tokens.
  • 04:18:00Yisen Feng: OSGNet for Video Localization on the Ego4D+Ego-Exo4D Challenge 2025
    • Presents OSGNet, a unified pipeline for video localization that combines object-shot enhanced grounding with a multi-modal fusion framework, achieving significant improvements in natural language queries, moments queries, and goal-step tasks.

Key Takeaways

  • The Ego4D and EgoExo4D challenges foster significant research in egocentric and exocentric vision, attracting diverse participants from academia and industry.
  • Winning methods consistently demonstrate substantial improvements over established baselines, indicating rapid progress in the field.
  • Large language models and graph-based approaches are proving particularly effective in handling complex video understanding tasks, including long-term temporal reasoning and multimodal data integration.
  • The challenges highlight the importance of efficient and lightweight models for processing extensive video data, especially in resource-constrained environments.
  • The community’s active participation and high submission counts underscore the sustained interest and dynamic nature of egocentric vision research.

Methods / Models / Datasets Mentioned

  • OSGNet
  • BIMBA
  • GLEVR
  • O-MaMa
  • UNICT
  • FastSAM
  • DINOv2
  • Task Graph Maximum Likelihood (TGML)
  • Direct Optimization (DO)
  • Task Graph Transformer (TGT)
  • Mamba
  • Video-ChatGPT
  • LongVU
  • LLAMA-VID
  • LLAVA-LLAMA
  • LLAVA-Next-Video
  • LLAVA-OneVision
  • VideoChat2
  • LongVA
  • LLAMA2-8B
  • Kangaroo
  • Video-XL
  • LLAVA-Video
  • Co-DETR
  • CausalTAD
  • BayesianVSLNet

Topics

Egocentric Vision · Ego4D · EgoExo4D · Challenges · Computer Vision · Machine Learning · Video Understanding · Episodic Memory · Natural Language Queries · Moments Queries · Goal-Step · EgoSchema · Social Understanding · Looking at Me · Talking to Me · Forecasting · Long-term Anticipation · Keystep Recognition · Procedure Understanding · Body Pose Estimation · Hand Pose Estimation · Object Correspondence · Proficiency Estimation · Graph Neural Networks · Large Language Models · Multimodal Fusion · Selective-Scan Compression


Notes

Open for commentary — connections to other work, critiques, follow-up reading.