Second Egocentric Vision (EgoVis) Workshop

Event: HoloAssist Challenges · Duration: 33 min · ▶ Watch on YouTube

Abstract

This session presents the HoloAssist Challenges, focusing on interactive remote assistance scenarios and mistake detection in egocentric videos. It features presentations from top-performing teams in the mistake detection challenge, discussing their novel approaches to identify procedural and execution mistakes, and providing explanations. The session also explores leveraging gaze prediction for unsupervised mistake detection and using behavioral signals to detect user confusion during physical tasks. Finally, it introduces a system-driven benchmarking approach for situated human-AI collaboration.

Speakers

  • Taemin Kwon — VGG, University of Oxford
  • Constantin Patsch — TUM
  • Boyu Han — CAS
  • Wei-Jin Huang — Sun Yat-sen University
  • Michele Mazzamuto — University of Catania
  • Maia Stiber — JHU
  • Sean Andrist — Microsoft Research

Talks (7)

  • 00:04Taemin Kwon: HoloAssist Challenges Session Introduction
    • Introduction to the HoloAssist dataset and the mistake detection challenge, outlining the session’s agenda.
  • 02:03Constantin Patsch: Mistake Detection (2nd Place)
    • Presents an approach for online mistake detection and explanation in egocentric videos, achieving second place in the HoloAssist challenge.
  • 07:10Boyu Han: Technical Report of Team MR-CAS for the HoloAssist Mistake Detection Challenge 2025 (1st Place)
    • Details the MR-CAS team’s two-stage Mixture-of-Experts (MoE) framework for mistake detection, addressing class imbalance and high intra-class variability in the HoloAssist dataset, and achieving first place.
  • 11:59Wei-Jin Huang: Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks
    • Introduces AMNAR, an adaptive framework that dynamically models all possible normal next actions to achieve robust error detection in procedural tasks, outperforming existing methods on various datasets including HoloAssist.
  • 16:19Michele Mazzamuto: Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities
    • Proposes an unsupervised mistake detection method that leverages gaze prediction and gaze trajectory completion to identify abnormal attention patterns, demonstrating its effectiveness on the HoloAssist dataset.
  • 21:08Maia Stiber: “Uh, This One?”: Leveraging Behavioral Signals for Detecting Confusion during Physical Tasks
    • Explores the use of behavioral signals (hand and gaze movements) from the HoloAssist dataset to detect user confusion in physical tasks, showing that contextualizing these signals improves detection performance.
  • 27:39Sean Andrist: SIGMA: Towards System-Driven Benchmarks for Situated Collaboration
    • Introduces SIGMA, an open-source testbed system for mixed-reality physically assistive agents, aiming to build living benchmarks for situated human-AI collaboration in physical tasks.

Key Takeaways

  • HoloAssist dataset presents significant challenges due to class imbalance and high intra-class variability in mistake detection.
  • Effective mistake detection requires robust models that can handle diverse task types and user behaviors.
  • Leveraging multimodal data (e.g., RGB, hands, eyes) and advanced architectures (e.g., Q-Formers, MoE) can significantly improve mistake detection performance.
  • Gaze patterns can serve as implicit indicators of user mistakes or confusion, even in unsupervised settings.
  • Detecting user confusion in physical tasks is a complex problem, but behavioral signals (hand, head, gaze movements) can be effectively used for this purpose.
  • Contextualizing behavioral signals (e.g., with instruction embeddings and time spent) improves confusion detection performance.
  • Lightweight models trained on behavioral signals can achieve comparable performance to more computationally expensive deep learning methods for confusion detection.
  • Open challenges in situated human-AI collaboration require new benchmarks that focus on timing, user state modeling, and grounding, moving beyond static datasets to interactive, system-driven approaches.

Methods / Models / Datasets Mentioned

  • TSformer
  • GazeCompl
  • MR-CAS
  • VIVIT
  • LoRA
  • Feature Mixture-of-Experts (F-MoE)
  • Classification Mixture-of-Experts (C-MoE)
  • Weighted Cross-Entropy (WCE) Loss
  • AUC Loss
  • Long-tail Learning (LA Loss)
  • Sharpness-Aware Minimization (SAM)
  • Adaptive Multiple Normal Action Representation (AMNAR)
  • Action Segmentation Module
  • Potential Action Prediction Block (PAPB)
  • Representation Reconstruction Block (RRB)
  • Representation Matching Block (RMB)
  • Gaze Trajectory Completion
  • Gaze Frame Correlation Module
  • Explicable Boosting Machine (EBM)
  • SIGMA

Topics

Egocentric Vision · HoloAssist Dataset · Mistake Detection · Remote Assistance · Interactive AI · Gaze Prediction · Behavioral Signals · User Confusion Detection · Situated Collaboration · System-Driven Benchmarking


Notes

Open for commentary — connections to other work, critiques, follow-up reading.