First Joint Egocentric Vision (EgoVis) Workshop Held in Conjunction with CVPR 2024

Event: CVPR 2024 Workshop · Duration: 512 min · ▶ Watch on YouTube

Abstract

This segment introduces the First Joint Egocentric Vision (EgoVis) Workshop, highlighting its organizers, historical context, and the day’s program. It features a keynote speech by James M. Rehg on ‘An Egocentric Approach to Social AI,’ which delves into understanding social behavior, defining Social AI, and presenting a conceptual model for social interaction, emphasizing egocentric perception. The segment also includes a presentation by Taemin Kwon on ‘HoloAssist Challenges,’ introducing the HoloAssist dataset and its associated tasks for interactive AI assistants in AR/VR. Finally, Michele Mazzamuto discusses ‘Mistake Detection and Gaze Patterns,’ proposing a method to identify user mistakes in AR based on gaze behavior. This segment features a keynote presentation by Diane Larlus from NAVER LABS Europe. The talk focuses on the challenges and advancements in achieving 3D geometric understanding within egocentric videos. Larlus introduces NeuralDiff for segmenting moving 3D objects and Neural Feature Fusion Fields (N3F) for distilling 2D features into 3D representations. The presentation also highlights the creation of the EPIC Fields dataset, which provides 3D reconstructed camera poses for a large collection of egocentric kitchen videos, enabling new research in this complex domain. This segment covers the EPIC-KITCHENS 2024 challenges, starting with an introduction to the EPIC Fields dataset and its 3D annotations for egocentric video understanding. It then transitions into an overview of the EPIC-KITCHENS dataset, the EPIC-Trilogy, and the various challenge tracks, including a new Audio-Based Interaction Detection challenge. The segment highlights the winning solutions for several tracks, detailing methods that leverage video foundation models, temporal causality, and novel loss functions for tasks like action recognition, action detection, and multi-instance retrieval. Finally, it introduces TIM, a Time Interval Machine for audio-visual action recognition, emphasizing its ability to utilize temporal context for improved performance. This segment delves into the substantial data gap observed between human children’s learning and AI models, where AI requires orders of magnitude more data. It examines various cognitive science theories, including nativism, constructivism, multimodal grounding, and comparison issues, to explain this discrepancy. The speakers introduce open data resources like SAYCam and BabyView, which provide rich egocentric video data from children, and discuss efforts to characterize social information and train self-supervised vision models on this unique dataset. The segment concludes by presenting DevBench, a multimodal developmental benchmark designed to evaluate AI models based on children’s learning outcomes, aiming to foster AI development that more closely mirrors human cognitive development. This video segment features multiple speakers presenting on various challenges and datasets related to egocentric vision and human sensing. Topics include 3D scene reconstruction, object detection in the Aria Digital Twin dataset, and the introduction of the HOT3D dataset for hand-object interaction. The segment concludes with a discussion on using egocentric vision for modeling subtle human behavior, assessing depression, and reconstructing avatars using innovative sensing methods like Dense Pose from Wifi. This segment introduces the Ego4D and EgoExo4D challenges, highlighting the increasing participation and the introduction of new tasks like GoalStep and EgoSchema. It emphasizes the importance of large, diverse datasets for training models in egocentric vision, particularly for hand and body pose estimation in unconstrained environments. The segment also covers the methodology and results of the EgoVideo team, showcasing their success in various challenges through a two-stage video-language pretraining approach.

Speakers

Antonino Furnari — University of Catania
James M. Rehg — University of Illinois Urbana-Champaign, US
Taemin Kwon — ETH Zurich
Michele Mazzamuto — University of Catania
Siddhant Bansal
Diane Larlus — Principal Scientist at NAVER LABS Europe
Dima Damen
Jacob Chalk
Shuming Liu — KAUST
Watanabe — The Hong Kong Polytechnic University
Jaesung Huh — University of Bristol
Michael C. Frank — Stanford University
Bria Long — Stanford University, UC San Diego
Shivesh Khaitan
Xiaqing Pan
Prithviraj Banerjee
Fernando De La Torre
Suyog Jain — Carnegie Mellon University
Chris — UC Berkeley
Gabriel Perez Santamaria — Universidad de Zaragoza
Jilan Xu — Fudan University

Talks (20)

00:00:00 — Antonino Furnari: First Joint Egocentric Vision (EgoVis) Workshop Held in Conjunction with CVPR 2024
- Introduction to the EgoVis workshop, its organizers, history, past events, the EgoVis Board, and an overview of the day’s program and challenges.
00:05:34 — James M. Rehg: An Egocentric Approach to Social AI
- Discusses understanding social behavior, defines Social AI, presents a conceptual model for social interaction, emphasizes the importance of egocentric perception, and introduces methods for eye contact detection, attention estimation, and the Gaze-LLE architecture using foundation models. Also covers joint attention, auditory attention, and the Ego-Exocentric Conversational Graph.
00:41:40 — Taemin Kwon: HoloAssist Challenges
- Introduces the HoloAssist dataset for interactive AI assistants in AR/VR, detailing its modalities, annotation scheme (coarse-grained, fine-grained actions, conversations, mistakes), and data statistics. Presents the challenges: Fine-grained Action Recognition, Mistake Detection, Intervention Type Prediction, and 3D Hand Pose Forecasting, along with initial results and resources.
00:52:40 — Michele Mazzamuto: Mistake Detection and Gaze Patterns
- Explores the idea that mistakes in augmented reality tasks can be identified by analyzing unusual gaze patterns, presenting a Gaze Completion Module and scoring method, and showing results from the HoloAssist challenge.
02:50:35 — Diane Larlus: EPIC Fields: Marrying 3D Geometry and Video Understanding
- This segment introduces EPIC Fields, a dataset that enhances EPIC-KITCHENS with 3D annotations, and presents a benchmark with three tasks: dynamic new view synthesis, unsupervised object segmentation, and semi-supervised video object segmentation. It also shows potential applications of 3D information in egocentric videos.
02:54:25 — Siddhant Bansal: EPIC-KITCHENS Challenges 2023-2024 Overview
- This segment provides an overview of the EPIC-KITCHENS dataset and its associated challenges, including the EPIC-Trilogy (VISOR, EPIC-Sounds, EPIC Fields). It reviews past challenge winners and introduces the new Audio-Based Interaction Detection challenge, along with the results and winners for the 2024 challenges across various tracks.
02:56:29 — Jacob Chalk: Audio-Based Interaction Detection
- This talk introduces the new Audio-Based Interaction Detection challenge, explaining its objective to detect and classify sound events in untrimmed audio from egocentric videos. It highlights the challenge’s difficulty due to variable-length, potentially overlapping sound annotations across 44 classes and presents the debut year’s results, showing a significant improvement over the baseline.
03:01:27 — Shuming Liu: Harnessing Temporal Causality for Advanced Egocentric Video Understanding
- This talk presents a method for egocentric video understanding that achieved top ranks in multiple EPIC-KITCHENS 2024 challenges. The approach leverages InternVideo2 as a video foundation model, fine-tunes it on the EPIC-KITCHENS dataset, and introduces a hybrid causal block combining Mamba and self-attention for action detection, demonstrating the importance of temporal causality.
03:06:54 — Watanabe: Symmetric Multi-Instance Loss
- This talk introduces a novel Symmetric Multi-Similarity (SMS) Loss function for multi-instance retrieval in egocentric videos, which addresses limitations of previous max-margin losses by considering non-zero relevancy for negative pairs and handling cases where negative pairs are more positive than positive pairs. The method, combined with a “Flip and Add” augmentation and ensemble strategy, achieved improved performance in the EPIC-KITCHENS 2024 challenge.
03:10:30 — Jaesung Huh: TIM: A Time Interval Machine for Audio-Visual Action Recognition
- This talk introduces TIM (Time Interval Machine), a novel transformer-based model for audio-visual action recognition that effectively utilizes temporal context by taking start and end times of an interval as input and predicting the action within that interval. TIM addresses the limitations of current approaches that fail to leverage true context in untrimmed videos and demonstrates strong performance in both visual and auditory action recognition/detection tasks on the EPIC-KITCHENS dataset.
03:53:42 — Diane Larlus: Towards a Geometric Understanding in Egocentric Videos
- The talk discusses challenges and approaches for achieving 3D geometric understanding in egocentric videos, focusing on methods like NeuralDiff and N3F, and introducing the EPIC Fields dataset.
04:16:16 — Michael C. Frank & Bria Long: Bridging the data gap between human children and AI models
- This talk explores the significant data gap between human children and AI models, discussing cognitive science explanations and introducing new open data resources and benchmarks for evaluating AI models based on children’s learning outcomes.
05:41:31 — Shivesh Khaitan: Project Aria Challenge CVPR EgoVis 2024
- This talk introduces the Aria Scene Reconstruction challenge, detailing methods for 3D scene reconstruction from egocentric data, including wall, door, and window detection, and discusses challenges and future steps.
06:51:11 — Xiaqing Pan: Aria Digital Twin Object Detection Challenges
- This presentation introduces the Aria Digital Twin (ADT) dataset, a comprehensive resource for 3D object detection, highlighting its features, web-based visualization, 3D model releases, and the ADT 3D object detection challenge.
07:06:29 — Suyog Jain: Ego4D Challenge Synthesis Talks
- Suyog Jain introduces the Ego4D and EgoExo4D challenges, highlighting the growth in participation, new challenges like GoalStep and EgoSchema, and the strong interest from academic teams.
07:10:36 — Chris: EgoExo4D Hand Pose Challenge
- Chris introduces the EgoExo4D Hand Pose Challenge, emphasizing the use of a large, manually annotated ground truth dataset collected in diverse, unconstrained environments to overcome limitations of existing datasets.
07:15:18 — Gabriel Perez Santamaria: EgoExo4D Body Pose Challenge
- Gabriel Perez Santamaria presents the EgoExo4D Body Pose Challenge, detailing the task of estimating 3D full-body pose from a first-person perspective, the challenges of rare body visibility in egocentric frames, and the metrics used for evaluation.
07:18:39 — Jilan Xu: Team EgoVideo on the First EgoVis Workshop Ego4D & EgoExo4D Challenge
- Jilan Xu introduces EgoVideo, a powerful vision foundation model for egocentric videos, detailing its two-stage training process involving video-language pretraining in a general domain followed by post-training in the egocentric domain.
07:27:11 — Prithviraj Banerjee: New Research Challenges & Datasets for 2024-25
- This talk introduces the HOT3D dataset, a large-scale egocentric hand-object dataset with 3D pose annotations and gaze data, and discusses its application in BOP and HANDS challenges for joint hand-object pose estimation.
08:25:11 — Fernando De La Torre: Egocentric Vision for Human Sensing
- This presentation explores egocentric vision for human sensing, covering applications in modeling subtle human behavior, depression assessment, elderly well-being monitoring, and reconstructing face and body avatars using novel sensing techniques like Dense Pose from Wifi.

Key Takeaways

Egocentric vision provides a unique and crucial perspective for understanding complex human social interactions and nonverbal communication (NVC).
AI methods, especially those leveraging large-scale visual foundation models, can be developed to understand and predict social behavior, offering new tools for applications like AR/VR and treating conditions such as autism.
The HoloAssist dataset and its associated challenges (Fine-grained Action Recognition, Mistake Detection, Intervention Type Prediction, 3D Hand Pose Forecasting) are vital for advancing research in interactive AI assistants for AR/VR.
Analyzing gaze patterns is a promising approach for detecting user mistakes and understanding cognitive states in egocentric contexts, with models achieving human-level accuracy in specific tasks.
Egocentric videos present unique challenges for 3D understanding due to dynamic viewpoints, occlusions, and long durations.
Specialized neural rendering architectures like NeuralDiff can effectively segment moving objects by modeling background, foreground, and actor components separately.
Fusing 2D semantic features (from self-supervised models like DINO) with 3D scene representations (from NeRF-like models) can enrich geometric understanding.
The EPIC Fields dataset provides a valuable resource for advancing 3D geometric understanding in egocentric videos by offering reconstructed camera poses for a large collection of kitchen videos.
Intelligent sampling strategies are crucial for handling the complexity and scale of egocentric video datasets for 3D reconstruction.
EPIC Fields enhances egocentric video datasets with 3D annotations, enabling new benchmarks for dynamic new view synthesis, object segmentation, and object tracking.
The EPIC-KITCHENS Challenges drive innovation in egocentric video understanding, with new tasks like Audio-Based Interaction Detection pushing the boundaries of multimodal analysis.
Advanced models leveraging video foundation models and temporal causality, such as those incorporating Mamba and self-attention, demonstrate superior performance in action recognition and detection tasks.
Novel loss functions like Symmetric Multi-Similarity (SMS) Loss and context-aware models like TIM are crucial for effectively handling complex scenarios like multi-instance retrieval and temporal action localization in untrimmed videos.
AI models require significantly more data (1,000x to 1,000,000x) than human children to achieve comparable learning outcomes, highlighting a substantial data gap.
Cognitive science offers several explanations for this gap, including innate endowments (nativism), active and social learning (constructivism), and the richness of multimodal, grounded real-world data.
New open data resources like SAYCam and BabyView provide high-resolution egocentric video data from children, enabling researchers to study human learning environments and train AI models on ecologically valid data.
The DevBench benchmark is proposed as a framework to evaluate AI models against developmental learning outcomes in children, fostering AI development that aligns more closely with human cognitive processes.
The Aria Scene Reconstruction challenge focuses on generating 3D layouts from egocentric video, utilizing techniques like heatmap conversion, YOLO for wall detection, and segmentation for doors/windows.
The Aria Digital Twin (ADT) dataset provides comprehensive, high-quality ground-truth data for 3D object detection and tracking, with new web-based visualization tools and upcoming 3D model releases.
The HOT3D dataset is a novel large-scale egocentric dataset designed for understanding hand-object interactions, featuring 3D pose annotations, gaze data, and multi-view calibrated recordings, and is integrated into BOP and HANDS challenges.
Egocentric vision can be applied to subtle human sensing tasks like depression assessment and elderly well-being monitoring, with new methods like ‘Dense Pose from Wifi’ offering non-intrusive human detection and pose estimation.
The Ego4D and EgoExo4D challenges are experiencing significant growth in participation, with a strong interest from academic teams and notable improvements in forecasting and new tasks like GoalStep.
The EgoExo4D Hand Pose Challenge utilizes a large, manually annotated ground truth dataset from diverse, unconstrained environments, leading to a 23% reduction in error for hand pose estimation compared to previous methods.
The EgoExo4D Body Pose Challenge focuses on estimating 3D full-body pose from egocentric video, with top-performing methods leveraging multi-scale model fusion and transformer-based architectures, achieving a 17% reduction in MPJPE.
The EgoVideo team’s success across multiple Ego4D challenges is attributed to their two-stage video-language pretraining approach, which combines general domain pretraining with egocentric domain post-training, and their use of multi-scale fusion and ensemble techniques.

Methods / Models / Datasets Mentioned

AMASS
AV-CONV
Action Sensitivity Learning
ActionFormer
AdaTAD
Adaptive Max Margin Loss
Aria
Audio-Based Interaction Detection
Audio-visual Transformer
BabyView
CLIP
Chinchilla
CocoFormer
Conditional Human Motion Diffusion Model
DINO
DINOv2
Databrary.org
Dense Pose from Wifi
DevBench
EGO-EXO4D
ELECTRA
EPIC Fields
EPIC-Aff
EPIC-Kitchens
EPIC-Sounds
Ego4D
EgoExoLearn
EgoLoc
EgoSchema
EgoTracks
EgoVLP
EgoVideo
FPN
Flip and Add
FrankMocap
GPT-3
GSF
Gaze-LLE
GoalStep
Grammar (TROG, WG)
GroundNLQ
GroundingModel
HBHA
HaMeR
HandOccNet
HandOccNet (no param)
HoloAssist
Hybrid Causal Block
ImageNet
InceptionV3
Intelligent sampling
InternVideo
InternVideo2
LLaMA-2
Level-wise Cross Attention ViT
Lexicon (LWL)
Local Aggregation
METRO
MSSD
Mamba
Mask R-CNN
Max Margin Loss
MeshGraphormer
Modality Translation Network
Moment Queries
Multi-camera system
Multi-scale Model Fusion
NFM
NeRF
NeRF On-the-go
NeRF-W
Neural Feature Fusion Fields (N3F)
NeuralDiff
OSNOM
Omnivore
OpenCLIP
OpenPose
Optitrack Motion Capture System
POTTER
PredNet
ResNet
ResNet50
RoBERTa
RobustNeRF
SAYCam
SIFT features
Semantics (WAT, VOC, THINGS)
SiamRCNN
SimCLR
SlowFast
Stillfast
Structure from Motion (SfM)
Symmetric Multi-Similarity (SMS) Loss
TIM (Time Interval Machine)
TSN
TimeSformer
Transformer Encoder
UNet
Uniform sampling
VISOR
VIT-B/14
VQAVLP
VQLoc+
Visual Vocab (VV)
Volume Rendering
Werewolf Among Us
YOLO

Topics

3D Geometric Understanding · 3D Scene Reconstruction · 3D reconstruction · AI models · AR/VR Applications · Action Recognition · Action detection · Action recognition · Audio-visual understanding · Auditory Attention · Body Avatars · Body Pose Estimation · Cognitive science · Data gap · Dataset Annotation · Dataset Creation · Depression Assessment · Developmental psychology · Dynamic Scenes · EPIC Fields · EPIC-KITCHENS Challenges · Ego4D Challenge · EgoExo4D Challenge · Egocentric Vision · Egocentric video · Elderly Well-being · Eye Contact Detection · Face Avatars · Feature Fusion · Forecasting · Foundation Models · Gaze Analysis · Hand Pose Estimation · Hand Pose Forecasting · Hand-Object Interaction · Human Behavior Analysis · Human Sensing · Human learning · Joint Attention · Loss functions · Mistake Detection · Model evaluation · Multi-instance retrieval · Multimodal Interaction · Multimodal Learning · Multimodal learning · Neural Rendering · Object Detection · Object Segmentation · Social AI · Temporal causality · Transformer Models · Video foundation models · Video-Language Pretraining · WiFi Sensing

Notes

Open for commentary — connections to other work, critiques, follow-up reading.