Visual Perception and Learning in an Open World (VPLOW) Workshop Session

Event: CVPR 2024 Workshop on Visual Perception and Learning in an Open World · Duration: 179 min · ▶ Watch on YouTube

Abstract

This workshop session from CVPR 2024, titled “Visual Perception and Learning in an Open World (VPLOW)”, explores the challenges and opportunities in developing multimodal foundation models for understanding and interacting with the physical world. Speakers discuss the evolution of open-world learning, the limitations of current vision systems, and the potential of integrating diverse sensory inputs like touch and and sound. Key topics include predicting performance under distribution shifts, handling novelties, leveraging large language and visual language models, and developing self-supervised learning techniques for 3D reconstruction, sound localization, and instance detection. The session highlights the importance of robust feature representations, cross-modal consistency, and the need for high-quality, diverse, and well-annotated datasets, including those generated synthetically or through human-powered collection, to advance multimodal perception in complex, real-world environments.

Speakers

Shu Kong — Texas A&M University, University of Macau
Andrew Owens — UC San Diego
Yunhan Zhao — Carnegie Mellon University
Xiaolong Wang — UC San Diego

Talks (4)

00:00:00 — Shu Kong: VPLOW@CVPR’24: The 4th Workshop of Visual Perception and Learning in an Open World
- Introduces the workshop, discusses the evolution of open-world concepts, presents insights on predicting performance with distribution shift using LCA, and explores failure cases of LLMs on scientific names with proposed remedies and retrieval augmented learning.
01:12:00 — Andrew Owens: Learning Multimodal Models of the Physical World
- Discusses learning multimodal models by connecting vision with touch and sound, introducing vision-based touch sensors, generating images from touch, tactile-driven stylization, and learning 3D through analogies.
02:24:45 — Yunhan Zhao: InsDet: Object Instance Detection Challenge
- Presents the InsDet challenge for object instance detection, highlighting the limitations of closed-world approaches and advocating for leveraging open-world foundation models like SAM and DINOv2 for improved performance.
02:58:00 — Xiaolong Wang: Spatial Perception and Control in the Wild
- Explores spatial perception and control in the wild, emphasizing multimodal learning beyond vision, including physical properties, visual-tactile data collection, and self-supervised learning for sound localization from motion.

Key Takeaways

Open-world learning is a critical frontier in AI, moving beyond closed-set assumptions to enable models to adapt and perform in dynamic, unpredictable real-world environments.
Multimodal foundation models, integrating vision, touch, and sound, are essential for a comprehensive understanding of the physical world and for robust spatial perception and control.
Leveraging diverse data sources, including human-powered collection, synthetic data generation, and large-scale internet data, is crucial for training and evaluating open-world models.
Self-supervised learning, cross-modal consistency, and learning through analogies offer powerful paradigms for developing generalizable and robust multimodal representations without extensive manual annotation.
Addressing challenges like novelty detection, distribution shift, and the inherent ambiguity in human labeling instructions requires innovative approaches in model architecture, training protocols, and evaluation metrics.

Methods / Models / Datasets Mentioned

GPT-4
CLIP
DINOv2
SAM
DETIC
GelSight
NeRF
SDEdit
Cut-Paste-Learn (CPL)
GCC-PHAT
IID
MonoCLR
StereoCRW
SIFT
Superglue
Ours-L2R
ImageNet
ImageNetV2
ObjectNet
iNat
Aves
LAION-2B
LAION-400M
REACT
DALL-E 3
SD-XL

Topics

Open World Learning · Multimodal Foundation Models · Visual Perception · Spatial Perception · Physical World Understanding · Novelty Detection · Distribution Shift · Visual-Tactile Sensing · Audio-Visual Learning · 3D Reconstruction · Self-Supervised Learning · Instance Detection · Data Augmentation · Prompt Engineering

Notes

Open for commentary — connections to other work, critiques, follow-up reading.