CVPR 2024 Object-Centric Representation For Computer Vision Tutorial
Event: CVPR 2024 Tutorial · Duration: 80 min · ▶ Watch on YouTube
Abstract
This tutorial provides a comprehensive overview of object-centric representation learning in computer vision, focusing on bridging the gap between current research and real-world applications. The first part introduces the core motivations for object-centric learning, analyzes the limitations of early models on complex data, and details various strategies for improvement, including reconstructing richer signals beyond RGB pixels, upgrading encoder architectures, and enhancing decoders through 3D inductive biases or decoupling techniques. The second part delves into the theoretical framework of causal representation learning, discussing its role in achieving systematic generalization, addressing challenges in unsupervised disentanglement, and exploring its applications in diverse scientific domains. The tutorial concludes with an introduction to an open-source framework and future outlooks.
Speakers
- Tianjun Xiao — Senior Applied Scientist, AWS AI
- Francesco Locatello — IST Austria
Talks (2)
- 00:00:00 — Tianjun Xiao: Bridging the gap to real-world object-centric learning
- This talk introduces the motivation behind object-centric learning, highlights the gap between current models and real-world data, and explores methods to bridge this gap by upgrading learning objectives (reconstructing beyond RGB pixels), encoders, and decoders. It also provides an overview of the Object-Centric Learning Framework (OCLF).
- 00:28:28 — Francesco Locatello: Causal Representation Learning
- This talk delves into the theoretical underpinnings of causal representation learning, discussing how structured representations can aid in generalization and robustness. It covers challenges in unsupervised disentanglement, the concept of identifiability, and various applications of causal representation learning in diverse scientific domains.
Key Takeaways
- Object-centric learning aims for structured visual representations that enable systematic generalization and causal reasoning, moving beyond pixel-level reconstruction.
- Bridging the gap to real-world data requires upgrading learning objectives (e.g., reconstructing optical flow or depth), leveraging powerful encoders (e.g., self-supervised Vision Transformers), and employing sophisticated decoders (e.g., NeRF-based or diffusion models).
- Causal representation learning provides a framework for understanding and manipulating the underlying generative processes of data, offering benefits in robustness, generalization, and interpretability.
- Identifiability of disentangled representations is a critical theoretical challenge, with recent work exploring conditions under which causal factors can be uniquely recovered, often requiring weak supervision or specific structural assumptions.
- The field is moving towards developing modular architectures that facilitate mechanism reuse at the object level, enabling controllable generation, scene editing, and applications in scientific discovery.
Methods / Models / Datasets Mentioned
On the Binding Problem in Artificial Neural NetworksObject-Centric Learning with Slot AttentionCLEVRMulti-dSpritesTetrominoesBRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNINGObject Scene Representation TransformerObject-Centric Slot DiffusionConditional Object-Centric Learning from VideoSAVIMOVIMOVI++KubricSelf-Supervised Video Object Segmentation by Motion GroupingDAVIS2016SegTrackV2FBMS59MoCASAVI++Waymo Open datasetDINOSAURDINOMAEPASCAL VOC 2012COCOFG-ARImBOCorLocmIoUDeepSpectralTokenCutMaskContrastCOMUSSlotConSTEGOAdaptive Slot AttentionAdaptive Slot Attention: Object Discovery with Dynamic Slot NumberTo Understand Language is to Understand LearningSLATE/STEVEVQ-VAEDALL-EStable DiffusionLSDFFHQDORSALMSNStreet ViewLearning Open-Vocabulary Semantic Segmentation From Image-Text SupervisionOVSegmenterADE20KNEURAL SYSTEMATIC BINDEROCLFISTAnt dataset
Topics
Object-Centric Learning · Structured Visual Representation · Self-Supervised Learning · End-to-End Differentiable Architecture · Real-World Vision Data · RGB Pixel Reconstruction · Optical Flow Reconstruction · Depth Reconstruction · Encoder Upgrading (Vision Transformers) · Decoder Upgrading (3D Inductive Bias, Decoupling) · Slot-Decoding Dilemma · Causal Representation Learning · Disentangled Representation · Identifiability · Distribution Shifts · Scene Controllability · Partial Observability · Differential Equations in Climate · Experimental Ecology
Notes
Open for commentary — connections to other work, critiques, follow-up reading.