Second Egocentric Vision (EgoVis) Workshop
Event: EgoVis 2025 · Duration: 24 min · ▶ Watch on YouTube
Abstract
This session presents four oral talks from the Second Egocentric Vision (EgoVis) Workshop. The talks cover a range of topics including the creation of a large-scale egocentric life dataset and its associated question-answering benchmark, novel methods for world-space hand motion reconstruction from egocentric videos, learning generalizable 3D actions from human videos for zero-shot robotic manipulation, and computational modeling of infant learning to discover hidden visual concepts beyond linguistic input.
Speakers
- Jingkang Yang — Nanyang Technological University, Singapore
- Jiankang Deng — Shanghai Jiao Tong University, Imperial College London
- Hanzhi Chen — ETH Zurich, Microsoft
- Satoshi Tsutsui — Nanyang Technological University (NTU), The Max Planck Institute for Psycholinguistics
Talks (4)
- 00:00:14 — Jingkang Yang: EgoLife: Towards Egocentric Life Assistant
- Introduces EgoLife, a dataset of 7-day egocentric and third-person videos from 6 strangers, and EgoLifeQA, a benchmark with 5 life-oriented Q&A tasks requiring long-term memory tracing, solved by the EgoButler baseline using EgoGPT for omni-modal captioning and EgoRAG for question answering.
- 00:06:36 — Jiankang Deng: HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos
- Presents HaWoR, a framework for accurate world-space hand motion reconstruction from egocentric videos, addressing limitations of camera-space methods and SLAM in dynamic environments through adaptive egocentric SLAM and foundational metric priors.
- 00:12:05 — Hanzhi Chen: VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
- Introduces VidBot, a framework that learns generalizable language-conditioned 3D affordance from in-the-wild human videos, enabling zero-shot robotic manipulation across diverse robots and environments with high success rates.
- 00:19:03 — Satoshi Tsutsui: Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning
- Investigates how computational models trained on infant egocentric vision-language data learn visual concepts beyond explicit linguistic input, mirroring human infant learning processes where objects are understood before their names are known.
Key Takeaways
- The EgoLife dataset offers a unique resource for long-context understanding and AI research in daily life scenarios.
- HaWoR significantly improves the accuracy of hand motion reconstruction in world-space from egocentric videos, addressing key limitations of previous methods.
- VidBot demonstrates a powerful approach for robots to learn complex manipulation tasks from human demonstrations and generalize effectively to new robots and environments.
- Computational models can reveal insights into human cognitive development, showing that visual concepts can be learned implicitly beyond explicit linguistic cues, similar to infant learning.
Methods / Models / Datasets Mentioned
EgoGPTEgoRAGMulti-level RetrievalKeyword ExtractionHaWoRAdaptive Egocentric SLAMMetric3DWiLoRVidBotCOLMAPMonocular Metric Depth PredictorHand-Object DetectorSegmentation ModelsGlobal Scale OptimizationPer-frame Pose & Scale RefinementCoarse Affordance PredictorTrajectory DenoiserMulti-goal GuidanceContact Normals GuidanceCollision Avoidance GuidanceCLIP-like modelContrastive LearningText EncoderVision Encoder
Topics
Egocentric Vision · Life Assistant · Long-term Video Dataset · Multi-modal Data · Question Answering · Human Behavior Analysis · Hand Motion Reconstruction · World-Space Reconstruction · SLAM · 3D Hand Pose Estimation · Zero-Shot Robotic Manipulation · 3D Affordance Learning · In-the-Wild Human Videos · Language-Conditioned Actions · Infant Learning · Hidden Visual Concepts · Linguistic Input · Computational Modeling · CLIP-like Models
Notes
Open for commentary — connections to other work, critiques, follow-up reading.