VPLOW@CVPR’24: The 4th Workshop of Visual Perception and Learning in an Open World

Event: CVPR 2024 Workshop · Duration: 385 min · ▶ Watch on YouTube

Abstract

This segment covers the opening remarks of the VPLOW@CVPR’24 workshop, an invited talk on open issues in open-world learning, and a keynote on multimodal foundation models. The opening remarks introduce the workshop’s goals, schedule, and challenges, emphasizing the shift towards training foundation models in the open world and the need for new evaluation metrics like LCA. The invited talk delves into the ‘novelty problem’ in AI, particularly for self-driving cars, and proposes an agent template for open-world learning. The keynote explores continual learning, the importance of realistic datasets, and argues for treating labeling instructions as multimodal artifacts to facilitate model adaptation in the era of foundation models. This segment features two talks on multimodal learning and object detection. The first talk, “Learning Multimodal Models of the Physical World” by Andrew Owens, discusses leveraging sensory supervision from touch and sound to build robust models. It covers techniques like vision-based touch sensors, human-powered data collection, and audio-visual analogies for generating sound from silent videos. The second talk, “InsDet: Object Instance Detection Challenge” by Yunhan Zhao, introduces a challenge for detecting specific object instances in an open-world context, emphasizing the use of foundation models and synthetic data generation to overcome limitations of traditional closed-world approaches. This segment introduces the concept of unifying generative and discriminative AI to tackle the complexities of the 4D open world. It highlights the limitations of current generative models in understanding and acting within dynamic, real-world environments, proposing a framework that integrates 3D/4D generation, discriminative perception, and agent-based problem-solving. The speaker presents several research efforts, including ConsistDreamer for 3D-consistent editing, Panoptic Scene Graph variants (PSG, PVSG, PSG4D) for 4D perception, and multi-modal AI assistants (Otter, LLaVA-NeXT) for reasoning and instruction following. The ultimate goal is to build embodied multi-modal AI assistants capable of generating executable code and learning from environmental feedback. This video segment features three distinct presentations on the advancements and challenges in computer vision. Yu-Xiong Wang provides a comprehensive overview of generative models, detailing their application in 3D/4D scene editing, physics-informed generation, and their repurposing for discriminative tasks, including the use of LLMs as powerful visual encoders and decision-making agents. Following this, Nehar Peri introduces the Foundational Few-Shot Object Detection Challenge, highlighting the limitations of current benchmarking protocols for foundational vision-language models and proposing new approaches for concept alignment. Finally, Meng Wei presents the Open Vocabulary Part Segmentation Challenge, emphasizing the role of vision-language foundational models in open-world visual perception. This segment provides a comprehensive overview of the VPLOW workshop, featuring presentations from various speakers on the OV-PARTS and V3Det challenges. It highlights the innovative solutions developed by participating teams, including methods for part-level open-world learning, zero-shot generalization, and vast vocabulary object detection. The segment also covers the challenges encountered, such as data scarcity and open granularity, and discusses future directions for research in these areas. The workshop’s goals of connecting people, identifying new opportunities, exchanging ideas, and understanding existing challenges are emphasized, showcasing the collaborative efforts of organizers and sponsors.

Speakers

Shu Kong — Texas A&M University, University of Macau
Nehar Peri
Walter Scheirer — University of Notre Dame
Deva Ramanan — Carnegie Mellon University
Andrew Owens
Yunhan Zhao
Yuxiong Wang — Department of Computer Science, UIUC
Yu-Xiong Wang — Carnegie Mellon University
Meng Wei — The University of Hong Kong
Meng Kelly
Xuejian Gou — School of Artificial Intelligence, Xidian University
Jiho Choi — Graduate School of AI, KAIST
Yunhan Yang — The University of Hong Kong
Jiaqi Wang — Research Scientist @ Shanghai AI Laboratory
Lingchen Meng — Fudan University
Zeming Chen — Tsinghua University
Bosong Chai — Zhejiang University
Neehar Peri — CMU

Talks (18)

00:00:00 — Shu Kong: Opening Remarks
- Introduces the 4th VPLOW@CVPR’24 workshop, outlines its goals, schedule, and acknowledges organizers. Discusses the evolution of open-world recognition, challenges with foundation models, and proposes solutions like LCA and multimodal instruction-based adaptation.
00:15:55 — Walter Scheirer: Open Issues in Open World Learning
- Discusses the ‘novelty problem’ in real-world AI applications like self-driving cars, highlighting the limitations of big data. Proposes an ‘Open World Agent Template’ and outlines key open issues in theory, agent design, and evaluation of novelty.
01:00:10 — Deva Ramanan: Open World Learning in the Era of MultiModal Foundation Models
- Motivates the importance of addressing domain shift in deployed AI. Discusses continual learning, the CLEAR benchmark, and the role of unlabeled data. Proposes treating labeling instructions as multimodal artifacts and replacing open-world tasks with model adaptation via instructions, especially in the context of multimodal foundation models.
01:17:04 — Andrew Owens: Learning Multimodal Models of the Physical World
- This talk explores methods for learning multimodal models of the physical world, focusing on sensory supervision from touch and sound, and leveraging analogies for audio-visual learning.
01:41:41 — Yunhan Zhao: InsDet: Object Instance Detection Challenge
- This presentation introduces the InsDet challenge, focusing on object instance detection in an open-world setting by leveraging foundation models and synthesizing training data.
02:34:08 — Yuxiong Wang: All-in-One: Bridging Generative and Discriminative Learning in the 4D Open World
- This talk introduces a comprehensive approach to unify generative and discriminative AI, focusing on 4D open-world perception, generation, and problem-solving through multi-modal AI assistants and embodied agents.
03:51:12 — Yu-Xiong Wang: Generative Models: From Perception to Reasoning and Action
- This talk provides a comprehensive overview of advancements in generative models, showcasing their application in 3D/4D scene editing, physics-informed generation, and their repurposing for discriminative tasks like data augmentation and multimodal perception. It delves into the surprising strength of LLMs as visual encoders and their role in empowering decision-making agents through frameworks like LATS. The presentation concludes by advocating for the synergy of heterogeneous generative models to achieve enhanced world understanding and text-driven 3D human-object interaction generation.
03:56:19 — Nehar Peri: Foundational Few-Shot Object Detection Challenge
- This talk introduces the Foundational Few-Shot Object Detection (FSOD) Challenge, highlighting the limitations of current FSOD benchmarking protocols for foundational vision-language models (VLMs). It discusses issues like poor concept alignment between pre-training and target domains, concept leakage, and the need for multimodal annotator instructions. The challenge aims to align foundational models with human annotators using few-shot multimodal examples and proposes new baselines and evaluation metrics for this task.
03:57:04 — Meng Wei: Open Vocabulary Part Segmentation Challenge @ CVPR 2024 (OV-PARTS)
- This presentation introduces the Open Vocabulary Part Segmentation (OV-PARTS) Challenge at CVPR 2024, designed to push the boundaries of visual perception in open-world scenarios. It emphasizes the critical role of vision-language foundational models in unlocking the vast and ever-evolving information available on the web.
05:08:16 — Meng Kelly: Background & Motivation of OV-PARTS
- This segment introduces the OV-PARTS challenge, highlighting the importance of part-level open-world learning and the challenges of using foundation models for this task, such as data scarcity and open granularity.
05:11:47 — Xuejian Gou: Open Vocabulary Part Segmentation Challenge
- This segment introduces the team’s approach to the OV-PARTS challenge, detailing the datasets used, the task description for Track 1 and Track 2, and an overview of their methods involving model fusion and post-processing to achieve superior segmentation results.
05:13:32 — Jiho Choi: PartCLIPSeg
- This segment presents the PartCLIPSeg solution for the OV-PARTS challenge, outlining the method’s approach to leveraging generalized parts and object contexts, the object and part embedding generation process, object-specific part construction, and the multi-level mask supervision strategy.
05:14:47 — Yunhan Yang: OV-PARTS Challenge 2024
- This segment presents a two-stage framework for the OV-PARTS challenge, detailing the use of Segment Anything Model for part mask proposals and post-processing, followed by a Multi-Modal LLM for assigning part classes to these masks, and discusses the results on Track 2 and future extensions to 3D part segmentation.
05:16:17 — Jiaqi Wang: V3Det Challenge: Challenge of Vast Vocabulary Visual Detection
- This segment introduces the V3Det dataset and challenge, highlighting its vast vocabulary, hierarchical category organization, and rich annotations, and outlines the two tracks for supervised and open vocabulary object detection, along with the evaluation policies and training data.
05:19:02 — Lingchen Meng: RichSem-DINO-FocalNet for V3Det Challenge 2024
- This segment presents the RichSem-DINO-FocalNet solution for the V3Det Challenge, detailing the method’s strong detector, FocalNet-Huge backbone, Object365 pre-training, and an OVD classifier that ensembles vision and language prototypes, achieving first place in the OVD track and second in the supervised track.
05:20:17 — Zeming Chen: Mixed Pseudo Labels based on Co-DETR
- This segment presents a semi-supervised object detection framework called MixPL, which utilizes Co-DETR as the detector and leverages both labeled and unlabeled data with strong and weak augmentations to generate pseudo-labels, achieving first place in the V3Det Challenge.
05:21:32 — Bosong Chai: The 4th Open World Vision Workshop: V3Det Challenge 2024
- This segment presents the team’s solution for the V3Det Challenge, detailing their approach for Track 1 and Track 2, including architecture adjustments, data augmentation strategies, loss function choices, and implementation details, achieving third place in Track 1 and second place in Track 2.
05:23:24 — Neehar Peri: Closing Remarks
- This segment provides a workshop review, highlighting the six speakers and four challenges, and emphasizes the goals of connecting people, identifying new opportunities, exchanging ideas, and understanding existing challenges, while acknowledging the organizers and sponsors.

Key Takeaways

Foundation models are increasingly trained in open-world settings, necessitating new evaluation paradigms beyond traditional accuracy metrics, such as LCA-on-the-line.
The ‘novelty problem’ is a critical challenge in real-world AI, requiring agents to detect, characterize, and adapt to unforeseen situations and concepts, which cannot be solved by simply scaling data.
Labeling instructions and other multimodal artifacts are crucial for defining concepts and enabling model adaptation in open-world scenarios, especially with the rise of large foundation models.
Future research in open-world learning should focus on developing robust methods for continual learning from evolving data streams, leveraging unlabeled data, and rethinking evaluation protocols to account for domain shift and the inherent limitations of current datasets.
Multimodal models can learn about the physical world by integrating sensory data from vision, touch, and sound, providing richer information than vision alone.
Vision-based touch sensors (like GelSight) convert touch into visual data, enabling the use of off-the-shelf video understanding techniques and allowing for the estimation of material properties.
Audio-visual analogies, inspired by Foley artists, can be used to generate realistic sound for silent videos by transferring sound characteristics from a conditional example.
Object instance detection in an open-world setting can be significantly improved by leveraging pre-trained foundation models and synthesizing diverse training data, moving beyond closed-world assumptions.
Generative AI is evolving beyond mere creation, with a growing emphasis on understanding and taking action in complex 4D open-world environments.
Achieving 3D-consistent instruction-guided editing is a major challenge, addressed by propagating editing information across multiple views using strategies like structured noise and surrounding views.
Comprehensive 4D open-world perception requires advanced discriminative models capable of generating panoptic scene graphs, dynamic video scene graphs, and 4D scene graphs, moving beyond simple object recognition.
The development of multi-modal AI assistants (e.g., Otter, LLaVA-NeXT) and specialized benchmarks (e.g., MIMIC-IT, FunQA, WorldQA) is crucial for evaluating and improving reasoning, planning, and instruction-following capabilities in diverse and challenging scenarios.
The ultimate goal is to build embodied multi-modal AI assistants (e.g., Octopus) that can generate executable code and learn from environmental feedback, enabling autonomous problem-solving in open-world settings.
Generative models, particularly LLMs, are increasingly being repurposed for discriminative tasks, serving as data engines for augmentation or as powerful pre-learned feature extractors, demonstrating strong performance across various visual and multimodal tasks.
Integrating geometry and physics awareness into generative models is crucial for achieving realistic and consistent 3D and 4D scene editing, high-quality shape deformation, and plausible human-object interaction prediction.
Foundational vision-language models (VLMs) offer significant potential for few-shot object detection, but current benchmarking protocols need re-evaluation to account for concept leakage and the unique strengths of large-scale pre-training.
The future of visual perception in open-world scenarios lies in synergizing heterogeneous generative models, including LLMs, motion models, and physics priors, to build comprehensive AI agents capable of advanced reasoning, acting, and planning.
The OV-PARTS challenge emphasizes the importance of part-level open-world learning, addressing the limitations of current foundation models in fine-grained object understanding due to data scarcity and open granularity.
The V3Det challenge introduces a vast vocabulary dataset with over 13,000 categories, hierarchical organization, and rich annotations, pushing the boundaries of object detection in complex, real-world scenarios.
Winning solutions leverage advanced techniques like model fusion, multi-level mask supervision, and semi-supervised learning with strong detectors and backbones, demonstrating significant performance improvements in both supervised and open vocabulary object detection tracks.
Future research directions include bridging the performance gap between supervised and open vocabulary methods, enhancing controlled granularity in part segmentation, and exploring the integration of vision foundation models with multi-modal large language models for more robust and context-aware perception.

Methods / Models / Datasets Mentioned

AAD (Absent Answer Detection)
ATraDiff
ActivityNet-QA
BERT
CIFAR-10
CLEAR Benchmark
CLIP
COCO 2014
COCO 2017
Cascade R-CNN
CenterNet2-RN50-OVD
ChatGPT
ChatGPT4o
Co-DETR
CoCo
ConsistDreamer
ControlNet
Copy-Paste Learn
CreativeQA
DALL-E 3
DINOv2
DPO (Direct Preference Optimization)
Detic-RN50-ImageNet
Dolly 3
EVA-CLIP
Ego4D
EpicKitchen
FedLoss
Flamingo
FocalNet-Huge
Fuyu Transformer Decoder
GCC-PHAT
GPT-3
GPT-3.5
GPT-4
GPT-4V(ision)
GPT-4o
GPT4v
GRIT
GTA (Grand Theft Auto V)
GelSight
GoldG
GroundingDINO
HOI-4D
HumorQA
IASD (Incompatible Answer Set Detection)
IID
IVQD (Incompatible Visual Question Detection)
ImageNet
ImageNetV2
InsDet
Instruct 4D-to-4D
Instruct-NeRF-to-NeRF
Instruct-NeRF2NeRF
InstructPix2Pix
InterDreamer
LAION-2B
LAION-400M
LATS
LCA-on-the-line
LLaMA
LLaVA-NeXT
LVIS
MCTS
MIMIC-IT Dataset
MMC4
MMbench
MNIST
MQ-GLIP
MSVD-QA
MagicQA
MagnifierBench
MetaCLIP
Midjourney
MixPL
MonoCLR
MotionGPT
Multi-Modal LLM (MLLM)
NeRF
NeuralEditor
Nulimages
OPT
OTTER
OTTER-HD
OV-PARTS
Object365
Objects 365 V2
Octopus
Open Assistant
OpenAI's CLIP Model
OpenCLIP
OpenFlamingo
OpenImagesV6
Ours-L2R
PSG
PSG4D
PSGFormer
PSGTR
PVSG
PartCLIPSeg
Permuted-MNIST
PhraseCut
REACT
RFS
RLEF Training (Reinforcement Learning from Environmental Feedback)
RefCOCO
RefCOCO+
RefCOCOg
RichSem-DINO-FocalNet
SD-XL
SDEdit
SIFT
ScanNetv2
Segment Anything Model (SAM)
Social IQ-QA
Sora
Split-CIFAR
Stable Diffusion
StereoCRW
Superglue
TGIF-QA
TIDE
TVQA
ToT
Transformer
V3Det
V3Det-CascadeRCNN-EVA-Huge
V3Det-DINO-SwinB
V3Det-DeformDETR-SwinB
VQGAN
Vicuna/Flan-T5
VidOR
Video-MME
Waymo
WorldQA
Yahoo Flickr 100 million
cGAN
gRef-COCO

Topics

3D Consistent Editing · 3D part segmentation · 3D/4D Scene Editing · 4D Open World · Agent-based AI · Audio-Visual Analogies · Concept Alignment · Continual Learning · Discriminative AI · Domain Shift · Embodied AI · Few-Shot Object Detection (FSOD) · Few-Shot Recognition · Foundation Models · Foundation models · Foundational Models · Generative AI · Generative Models · Labeling Instructions · Large Language Models (LLMs) · Model Adaptation · Multi-modal AI Assistants · Multi-modal LLM · Multimodal Learning · Multimodal Perception · Novelty Detection · OV-PARTS challenge · Object Instance Detection · Object detection errors · Open Vocabulary Part Segmentation · Open-World Learning · Open-World Vision · Part-level open-world learning · Physics-Informed Generation · Problem Solving · Reinforcement Learning (RL) · Scene Graph · Semi-supervised learning · Sensory Supervision · Sound Localization · Touch Sensors · V3Det challenge · Vast vocabulary object detection · Vision-Language Models (VLMs) · Zero-Shot Recognition · Zero-shot generalization

Notes

Open for commentary — connections to other work, critiques, follow-up reading.