Learning Spatiotemporal Filters to Track Visual Saliency

Event: CVPR 2025 · Duration: 19 min · ▶ Watch on YouTube

Abstract

The presentation explores visual saliency, its applications, and the challenges of tracking it using event-based cameras. It proposes an unsupervised learning model that utilizes spatiotemporal filters, learned through clustering and decision trees, to identify and follow salient objects. The model incorporates lifelong learning principles to manage information over time, aiming for robustness and efficiency in real-time applications, particularly for on-chip resources. Experimental results on Streetcar and Motorway datasets demonstrate the model’s ability to adapt to different environments and distinguish between obvious and nuanced features, aligning with human observer behavior.

Speakers

Khaled Aboumerhi — ECE Ph.D. Candidate, Johns Hopkins
Ralph Etienne-Cummings — Professor of ECE, Johns Hopkins

Talks (1)

00:00:00 — Khaled Aboumerhi: Learning Spatiotemporal Filters to Track Visual Saliency
- This presentation introduces an unsupervised visual saliency model that leverages event-based camera data and lifelong learning principles to dynamically learn spatiotemporal filters for tracking salient features in complex environments.

Key Takeaways

The proposed model effectively learns spatiotemporal filters from event-based camera data using unsupervised clustering and decision trees, enabling dynamic tracking of visual saliency.
Lifelong learning principles are crucial for managing large event-based datasets, ensuring consistency, preventing catastrophic forgetting, and maintaining space efficiency for real-time and online applications.
The model’s ability to differentiate between obvious and nuanced salient features aligns with human visual attention, suggesting its potential for more sophisticated robotic and computer vision systems.
Future work involves acquiring ground-truth spike-based saliency datasets using closed-environment eye-tracking devices (like HTC Vive or Google HoloLens) to validate and compare visual saliency algorithms more accurately.
Optimizing data processing by breaking down event streams into time blocks and applying learned filters during intermittent latent phases can improve accuracy and processing speed.

Methods / Models / Datasets Mentioned

Prophesee.ai Dataset
ATIS Camera
APS events
HOTS (Hierarchy of Event-Based Time Surfaces)
Decision Trees
Random Forest
HTC Vive
Google HoloLens

Topics

Visual Saliency · Spatiotemporal Filters · Event-Based Cameras · Unsupervised Learning · Lifelong Learning · Data Management · Eye-Tracking · Computer Vision · Robotics · Real-time Processing

Notes

Open for commentary — connections to other work, critiques, follow-up reading.