Object Motion Segmentation: Advantages from Event Data

Event: CVPR 2025 Workshop on Event-Based Vision · Duration: 26 min · ▶ Watch on YouTube

Abstract

This presentation explores the significant advantages of event-based vision for robust object motion segmentation. The speaker highlights how event cameras, inspired by biological vision, excel in capturing fast-moving objects and handling challenging lighting conditions where traditional cameras fail. The talk covers various approaches, from classic optimization techniques using event clouds to self-supervised and supervised deep learning architectures for SLAM, motion segmentation, and 3D motion estimation. A new dataset, EV-IMO, is introduced to address the lack of ground truth for independently moving objects in event data, enabling the development of lightweight neural networks capable of real-time actions on drones, such as obstacle dodging and pursuit tasks.

Speakers

Cornelia Fermüller — Computer Vision Laboratory, UMIACS, University of Maryland

Talks (34)

00:00:00 — Cornelia Fermüller: Object Motion Segmentation: Advantages from Event Data
- Introduction to the importance of motion segmentation for event-based processing and a comparison of event cameras vs. traditional cameras for fast motion.
00:12:00 — Cornelia Fermüller: Collaborators
- Acknowledges collaborators Anton Mitrokhin, Chethan Parameshwara, ChengXi Ye, and Yannis Aloimonos for their contributions to the presented work.
00:33:00 — Cornelia Fermüller: Fast events aid in segmentation
- Demonstrates the advantage of event cameras over traditional cameras in capturing fast motion, using an arrow example, and highlights the biological inspiration from the mammalian visual system.
02:03:00 — Cornelia Fermüller: Stepping Feet Illusion
- Explains the Stepping Feet Illusion to illustrate how event cameras generate events based on contrast changes, which can be leveraged for motion segmentation, especially when traditional methods struggle.
03:57:00 — Cornelia Fermüller: Overview
- Outlines the main topics of the talk: event alignment optimization, self-supervised deep learning for SLAM, supervised/unsupervised deep learning for motion segmentation, EVDodge for drone navigation, and coupling 3D motion with scene images.
05:15:00 — Cornelia Fermüller: Properties of this sensor
- Discusses the key properties of event-based sensors, including high temporal resolution, high dynamic range, low bandwidth, low latency, and the challenge of high noise.
05:34:00 — Cornelia Fermüller: I. Egomotion+ Independent Motion
- Explains the problem of estimating egomotion and independent motion, emphasizing the coupled nature of image flow, 3D motion, and scene structure discontinuities.
07:26:00 — Cornelia Fermüller: Treat events as point clouds
- Introduces the concept of treating events as point clouds in a 3D space (x, y, time) and using a warp field to model rigid movement of a fronto-parallel plane.
08:06:00 — Cornelia Fermüller: Approximation of 3D Motion Estimation
- Presents the mathematical approximation for 3D motion estimation, breaking it down into translation, rotation, and expansion components derived from curl and divergence.
08:44:00 — Cornelia Fermüller: How to compute it?
- Describes how to efficiently compute motion parameters using density (from event count images) and average time (from timestamp images), leveraging gradients for minimization.
09:57:00 — Cornelia Fermüller: Results
- Shows visual results of event count, timestamp, and gradient of timestamp images, demonstrating the effectiveness of the proposed representations.
10:20:00 — Cornelia Fermüller: Algorithm
- Presents the overall algorithm pipeline, which involves processing event slices, minimizing timestamp/event counts, extracting misaligned events, motion refinement, and object tracking in an iterative loop.
10:54:00 — Cornelia Fermüller: Dataset
- Introduces a dataset collected using a drone equipped with a DAVIS240B camera, featuring various objects, lighting conditions (including strobe light), and occlusions.
11:57:00 — Cornelia Fermüller: II. Flow Depth and 3D Motion Estimation
- Discusses self-supervised deep learning for estimating optical flow, depth, and 3D motion from event data, highlighting the relationship between flow, 3D motion, and depth.
12:47:00 — Cornelia Fermüller: II. Flow Depth and 3D Motion Estimation (cont.)
- Presents the neural network architecture used for self-supervised learning, which processes event slices to predict depth and pose, and uses optical flow for alignment.
13:42:00 — Cornelia Fermüller: II. Highlights
- Highlights the key achievements: unsupervised learning of dense optical flow, depth, and egomotion from sparse event data, including day-to-night transferability, handling data sparsity, and good results.
14:19:00 — Cornelia Fermüller: II. Highlights (cont.)
- Details the new light-weight ECN architecture, which uses a multi-resolution approach with feedback and generation mechanisms to efficiently learn features.
16:41:00 — Cornelia Fermüller: Outdoor Day 1
- Presents the ground truth and inferred trajectories for an outdoor day scene, showing good alignment and minimal drift.
17:01:00 — Cornelia Fermüller: III. EV-IMO: Motion Segmentation Dataset and Learning Pipeline for Event Cameras
- Introduces EV-IMO, the first dataset for event cameras that provides pixel-wise object masks, depth ground truth, and object/camera trajectories for independently moving objects.
18:18:00 — Cornelia Fermüller: Using motion masks to learn a pose mixture model
- Explains how motion masks are used in conjunction with depth and optical flow to learn a pose mixture model for multiple independently moving objects.
18:32:00 — Cornelia Fermüller: Our Dataset: EV-IMO
- Details the EV-IMO dataset creation process, including static room scans, high-resolution object scans, and tracking objects with a Vicon system to generate ground truth.
19:20:00 — Cornelia Fermüller: First dataset featuring
- Summarizes the key features of the EV-IMO dataset: pixel-wise object masks, depth ground truth, and object/camera trajectories, which are crucial for motion segmentation research.
19:44:00 — Cornelia Fermüller: Scene Motion With Event-Based Vision: Learning (II)
- Highlights that this is the first work to estimate and evaluate 3D object motion using supervised learning (mask and depth) and warping on tiny subslices for DVS data.
20:12:00 — Cornelia Fermüller: Comparison of full and small network
- Compares the performance of a full network (2000K parameters) versus a small network (40K parameters) for depth and mask estimation, showing that smaller networks can still yield reasonable qualitative results.
21:33:00 — Cornelia Fermüller: EVDodge
- Introduces EVDodge, an embedded AI system for high-speed dodging on a quadrotor using event cameras, with all computations done online on an NVIDIA TX2 CPU+GPU.
21:55:00 — Cornelia Fermüller: Training in Simulation Environment
- Explains that the EVDodge system was trained entirely in a simulated environment, including deblurring techniques to make simulated data compatible with real-world data.
22:49:00 — Cornelia Fermüller: AI Navigation Stack for Dodging Objects
- Presents the AI navigation stack architecture for dodging objects, which processes event data through deblurring, homography estimation, and segmentation flow networks to predict obstacle avoidance actions.
23:41:00 — Cornelia Fermüller: Obstacle Detected!
- Demonstrates real-time obstacle detection and avoidance by the drone, showcasing its ability to react quickly to moving objects.
24:11:00 — Cornelia Fermüller: Pursuit Task
- Shows a pursuit task where the drone successfully tracks and ‘hits’ another drone, demonstrating the system’s capability for dynamic interaction.
24:25:00 — Cornelia Fermüller: Summary:
- Summarizes the talk, reiterating the importance of events for robust motion segmentation, the development of new datasets and neural network approaches, and demonstrated real-time actions on drones.

Key Takeaways

Event-based vision offers crucial advantages for robust motion segmentation, especially in scenarios involving fast motion, challenging lighting, and occlusions, surpassing the capabilities of traditional cameras.
Novel deep learning architectures, such as the lightweight ECN, can effectively process sparse event data for tasks like optical flow, depth, and egomotion estimation, demonstrating transferability across different lighting conditions (day to night).
The introduction of new datasets like EV-IMO, which provides pixel-wise object masks, depth ground truth, and object/camera trajectories for independently moving objects, is vital for advancing supervised and unsupervised learning in event-based vision.
Event-based systems can be integrated into real-time applications on drones, enabling high-speed actions like obstacle dodging and pursuit tasks through efficient online computations and specialized AI navigation stacks.

Methods / Models / Datasets Mentioned

DVS (Dynamic Vision Sensor)
DSLR Camera
Stepping Feet Illusion
Event Camera
Event Cloud Alignment
SLAM (Simultaneous Localization and Mapping)
EVDodge
DAVIS240B camera
Qualcomm Flight platform
Snapdragon APQ8074 ARM CPU
ECN (Event-based Convolutional Network)
EV-IMO dataset
Vicon system
NVIDIA TX2 CPU+GPU
EVDeblurNet
EVHomographyNet
EVSegFlowNet

Topics

Event-based vision · Motion segmentation · Deep learning · SLAM · Drone navigation · Optical flow · Depth estimation · Egomotion · Neural network architecture · Real-time control

Notes

Open for commentary — connections to other work, critiques, follow-up reading.