Second Egocentric Vision (EgoVis) Workshop

Event: EgoVis 2025 · Duration: 24 min · ▶ Watch on YouTube

Abstract

This session presents four oral talks from the Second Egocentric Vision (EgoVis) Workshop. The talks cover a range of topics including the creation of a large-scale egocentric life dataset and its associated question-answering benchmark, novel methods for world-space hand motion reconstruction from egocentric videos, learning generalizable 3D actions from human videos for zero-shot robotic manipulation, and computational modeling of infant learning to discover hidden visual concepts beyond linguistic input.

Speakers

Jingkang Yang — Nanyang Technological University, Singapore
Jiankang Deng — Shanghai Jiao Tong University, Imperial College London
Hanzhi Chen — ETH Zurich, Microsoft
Satoshi Tsutsui — Nanyang Technological University (NTU), The Max Planck Institute for Psycholinguistics

Talks (4)

00:00:14 — Jingkang Yang: EgoLife: Towards Egocentric Life Assistant
- Introduces EgoLife, a dataset of 7-day egocentric and third-person videos from 6 strangers, and EgoLifeQA, a benchmark with 5 life-oriented Q&A tasks requiring long-term memory tracing, solved by the EgoButler baseline using EgoGPT for omni-modal captioning and EgoRAG for question answering.
00:06:36 — Jiankang Deng: HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos
- Presents HaWoR, a framework for accurate world-space hand motion reconstruction from egocentric videos, addressing limitations of camera-space methods and SLAM in dynamic environments through adaptive egocentric SLAM and foundational metric priors.
00:12:05 — Hanzhi Chen: VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
- Introduces VidBot, a framework that learns generalizable language-conditioned 3D affordance from in-the-wild human videos, enabling zero-shot robotic manipulation across diverse robots and environments with high success rates.
00:19:03 — Satoshi Tsutsui: Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning
- Investigates how computational models trained on infant egocentric vision-language data learn visual concepts beyond explicit linguistic input, mirroring human infant learning processes where objects are understood before their names are known.

Key Takeaways

The EgoLife dataset offers a unique resource for long-context understanding and AI research in daily life scenarios.
HaWoR significantly improves the accuracy of hand motion reconstruction in world-space from egocentric videos, addressing key limitations of previous methods.
VidBot demonstrates a powerful approach for robots to learn complex manipulation tasks from human demonstrations and generalize effectively to new robots and environments.
Computational models can reveal insights into human cognitive development, showing that visual concepts can be learned implicitly beyond explicit linguistic cues, similar to infant learning.

Methods / Models / Datasets Mentioned

EgoGPT
EgoRAG
Multi-level Retrieval
Keyword Extraction
HaWoR
Adaptive Egocentric SLAM
Metric3D
WiLoR
VidBot
COLMAP
Monocular Metric Depth Predictor
Hand-Object Detector
Segmentation Models
Global Scale Optimization
Per-frame Pose & Scale Refinement
Coarse Affordance Predictor
Trajectory Denoiser
Multi-goal Guidance
Contact Normals Guidance
Collision Avoidance Guidance
CLIP-like model
Contrastive Learning
Text Encoder
Vision Encoder

Topics

Egocentric Vision · Life Assistant · Long-term Video Dataset · Multi-modal Data · Question Answering · Human Behavior Analysis · Hand Motion Reconstruction · World-Space Reconstruction · SLAM · 3D Hand Pose Estimation · Zero-Shot Robotic Manipulation · 3D Affordance Learning · In-the-Wild Human Videos · Language-Conditioned Actions · Infant Learning · Hidden Visual Concepts · Linguistic Input · Computational Modeling · CLIP-like Models

Notes

Open for commentary — connections to other work, critiques, follow-up reading.