Lifting Monocular Events to 3D Human Poses

Event: CVPR 2025 · Duration: 3 min · ▶ Watch on YouTube

Abstract

This paper presents the first events-only monocular approach for 3D human pose estimation (HPE). The methodology leverages marginal heatmaps, generated from event-based frames processed by a deep learning backbone, to triangulate 3D joint positions. A novel synthetic dataset, Event-Human3.6m, is introduced, derived from the standard Human3.6m dataset, to facilitate research in event-based HPE. Extensive ablation studies on DHP19 and Event-Human3.6m datasets demonstrate that the constant-count event representation outperforms the spatio-temporal voxel-grid, and ImageNet pretraining significantly enhances performance. The work also identifies static movements and occluded body parts as primary challenges for event-based pose estimation.

Speakers

Gianluca Scarpellini — Istituto Italiano di Tecnologia (IIT) - PAVIS
Pietro Morerio — Istituto Italiano di Tecnologia (IIT) - PAVIS
Alessio Del Bue — Istituto Italiano di Tecnologia (IIT) - VGM

Talks (1)

00:00:00 — Gianluca Scarpellini: Lifting Monocular Events to 3D Human Poses
- A novel approach for 3D human pose estimation using only monocular event camera data, including a new synthetic dataset and ablation studies on event representations and backbones.

Key Takeaways

Introduced the first events-only monocular approach for 3D Human Pose Estimation.
Developed a novel synthetic dataset, Event-Human3.6m, for event-based HPE.
Demonstrated that the constant-count event representation yields better results than spatio-temporal voxel-grid.
Showed that ImageNet pretraining significantly improves the performance of event-based HPE models.
Identified static movements and occluded body parts as key limitations for event-based 3D human pose estimation.

Methods / Models / Datasets Mentioned

DHP19
Human3.6m
Event-Human3.6m
ResNet-34
ResNet-50
ImageNet
Constant-count
Voxel-grid
Stacked Hourglass
MPJPE
Calabrese et al. [5]
Metha et al. [38]
Kanazawa et al. [22]
Nibali et al. [43]
Pavlakos et al. [44]
Luvizon et al. [33]
Cheng et al. [9]

Topics

3D Human Pose Estimation · Event Cameras · Monocular Vision · Deep Learning · Synthetic Datasets · Marginal Heatmaps · Event-based Vision · Human Motion Analysis · ResNet · ImageNet Pretraining

Notes

Open for commentary — connections to other work, critiques, follow-up reading.