Second Egocentric Vision (EgoVis) Workshop

Event: EgoVis 2025 · Duration: 24 min · ▶ Watch on YouTube

Abstract

This session presents four oral talks from the Second Egocentric Vision (EgoVis) Workshop. The talks cover a range of topics including the creation of a large-scale egocentric life dataset and its associated question-answering benchmark, novel methods for world-space hand motion reconstruction from egocentric videos, learning generalizable 3D actions from human videos for zero-shot robotic manipulation, and computational modeling of infant learning to discover hidden visual concepts beyond linguistic input.

Speakers

  • Jingkang Yang — Nanyang Technological University, Singapore
  • Jiankang Deng — Shanghai Jiao Tong University, Imperial College London
  • Hanzhi Chen — ETH Zurich, Microsoft
  • Satoshi Tsutsui — Nanyang Technological University (NTU), The Max Planck Institute for Psycholinguistics

Talks (4)

  • 00:00:14Jingkang Yang: EgoLife: Towards Egocentric Life Assistant
    • Introduces EgoLife, a dataset of 7-day egocentric and third-person videos from 6 strangers, and EgoLifeQA, a benchmark with 5 life-oriented Q&A tasks requiring long-term memory tracing, solved by the EgoButler baseline using EgoGPT for omni-modal captioning and EgoRAG for question answering.
  • 00:06:36Jiankang Deng: HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos
    • Presents HaWoR, a framework for accurate world-space hand motion reconstruction from egocentric videos, addressing limitations of camera-space methods and SLAM in dynamic environments through adaptive egocentric SLAM and foundational metric priors.
  • 00:12:05Hanzhi Chen: VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
    • Introduces VidBot, a framework that learns generalizable language-conditioned 3D affordance from in-the-wild human videos, enabling zero-shot robotic manipulation across diverse robots and environments with high success rates.
  • 00:19:03Satoshi Tsutsui: Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning
    • Investigates how computational models trained on infant egocentric vision-language data learn visual concepts beyond explicit linguistic input, mirroring human infant learning processes where objects are understood before their names are known.

Key Takeaways

  • The EgoLife dataset offers a unique resource for long-context understanding and AI research in daily life scenarios.
  • HaWoR significantly improves the accuracy of hand motion reconstruction in world-space from egocentric videos, addressing key limitations of previous methods.
  • VidBot demonstrates a powerful approach for robots to learn complex manipulation tasks from human demonstrations and generalize effectively to new robots and environments.
  • Computational models can reveal insights into human cognitive development, showing that visual concepts can be learned implicitly beyond explicit linguistic cues, similar to infant learning.

Methods / Models / Datasets Mentioned

  • EgoGPT
  • EgoRAG
  • Multi-level Retrieval
  • Keyword Extraction
  • HaWoR
  • Adaptive Egocentric SLAM
  • Metric3D
  • WiLoR
  • VidBot
  • COLMAP
  • Monocular Metric Depth Predictor
  • Hand-Object Detector
  • Segmentation Models
  • Global Scale Optimization
  • Per-frame Pose & Scale Refinement
  • Coarse Affordance Predictor
  • Trajectory Denoiser
  • Multi-goal Guidance
  • Contact Normals Guidance
  • Collision Avoidance Guidance
  • CLIP-like model
  • Contrastive Learning
  • Text Encoder
  • Vision Encoder

Topics

Egocentric Vision · Life Assistant · Long-term Video Dataset · Multi-modal Data · Question Answering · Human Behavior Analysis · Hand Motion Reconstruction · World-Space Reconstruction · SLAM · 3D Hand Pose Estimation · Zero-Shot Robotic Manipulation · 3D Affordance Learning · In-the-Wild Human Videos · Language-Conditioned Actions · Infant Learning · Hidden Visual Concepts · Linguistic Input · Computational Modeling · CLIP-like Models


Notes

Open for commentary — connections to other work, critiques, follow-up reading.