CVPR 2024 Object-Centric Representation For Computer Vision Tutorial

Event: CVPR 2024 Tutorial · Duration: 80 min · ▶ Watch on YouTube

Abstract

This tutorial provides a comprehensive overview of object-centric representation learning in computer vision, focusing on bridging the gap between current research and real-world applications. The first part introduces the core motivations for object-centric learning, analyzes the limitations of early models on complex data, and details various strategies for improvement, including reconstructing richer signals beyond RGB pixels, upgrading encoder architectures, and enhancing decoders through 3D inductive biases or decoupling techniques. The second part delves into the theoretical framework of causal representation learning, discussing its role in achieving systematic generalization, addressing challenges in unsupervised disentanglement, and exploring its applications in diverse scientific domains. The tutorial concludes with an introduction to an open-source framework and future outlooks.

Speakers

Tianjun Xiao — Senior Applied Scientist, AWS AI
Francesco Locatello — IST Austria

Talks (2)

00:00:00 — Tianjun Xiao: Bridging the gap to real-world object-centric learning
- This talk introduces the motivation behind object-centric learning, highlights the gap between current models and real-world data, and explores methods to bridge this gap by upgrading learning objectives (reconstructing beyond RGB pixels), encoders, and decoders. It also provides an overview of the Object-Centric Learning Framework (OCLF).
00:28:28 — Francesco Locatello: Causal Representation Learning
- This talk delves into the theoretical underpinnings of causal representation learning, discussing how structured representations can aid in generalization and robustness. It covers challenges in unsupervised disentanglement, the concept of identifiability, and various applications of causal representation learning in diverse scientific domains.

Key Takeaways

Object-centric learning aims for structured visual representations that enable systematic generalization and causal reasoning, moving beyond pixel-level reconstruction.
Bridging the gap to real-world data requires upgrading learning objectives (e.g., reconstructing optical flow or depth), leveraging powerful encoders (e.g., self-supervised Vision Transformers), and employing sophisticated decoders (e.g., NeRF-based or diffusion models).
Causal representation learning provides a framework for understanding and manipulating the underlying generative processes of data, offering benefits in robustness, generalization, and interpretability.
Identifiability of disentangled representations is a critical theoretical challenge, with recent work exploring conditions under which causal factors can be uniquely recovered, often requiring weak supervision or specific structural assumptions.
The field is moving towards developing modular architectures that facilitate mechanism reuse at the object level, enabling controllable generation, scene editing, and applications in scientific discovery.

Methods / Models / Datasets Mentioned

On the Binding Problem in Artificial Neural Networks
Object-Centric Learning with Slot Attention
CLEVR
Multi-dSprites
Tetrominoes
BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING
Object Scene Representation Transformer
Object-Centric Slot Diffusion
Conditional Object-Centric Learning from Video
SAVI
MOVI
MOVI++
Kubric
Self-Supervised Video Object Segmentation by Motion Grouping
DAVIS2016
SegTrackV2
FBMS59
MoCA
SAVI++
Waymo Open dataset
DINOSAUR
DINO
MAE
PASCAL VOC 2012
COCO
FG-ARI
mBO
CorLoc
mIoU
DeepSpectral
TokenCut
MaskContrast
COMUS
SlotCon
STEGO
Adaptive Slot Attention
Adaptive Slot Attention: Object Discovery with Dynamic Slot Number
To Understand Language is to Understand Learning
SLATE/STEVE
VQ-VAE
DALL-E
Stable Diffusion
LSD
FFHQ
DORSAL
MSN
Street View
Learning Open-Vocabulary Semantic Segmentation From Image-Text Supervision
OVSegmenter
ADE20K
NEURAL SYSTEMATIC BINDER
OCLF
ISTAnt dataset

Topics

Object-Centric Learning · Structured Visual Representation · Self-Supervised Learning · End-to-End Differentiable Architecture · Real-World Vision Data · RGB Pixel Reconstruction · Optical Flow Reconstruction · Depth Reconstruction · Encoder Upgrading (Vision Transformers) · Decoder Upgrading (3D Inductive Bias, Decoupling) · Slot-Decoding Dilemma · Causal Representation Learning · Disentangled Representation · Identifiability · Distribution Shifts · Scene Controllability · Partial Observability · Differential Equations in Climate · Experimental Ecology

Notes

Open for commentary — connections to other work, critiques, follow-up reading.