CVPR 2024 Object-Centric Representation For Computer Vision Tutorial

Event: CVPR 2024 Tutorial · Duration: 80 min · ▶ Watch on YouTube

Abstract

This tutorial provides a comprehensive overview of object-centric representation learning in computer vision, focusing on bridging the gap between current research and real-world applications. The first part introduces the core motivations for object-centric learning, analyzes the limitations of early models on complex data, and details various strategies for improvement, including reconstructing richer signals beyond RGB pixels, upgrading encoder architectures, and enhancing decoders through 3D inductive biases or decoupling techniques. The second part delves into the theoretical framework of causal representation learning, discussing its role in achieving systematic generalization, addressing challenges in unsupervised disentanglement, and exploring its applications in diverse scientific domains. The tutorial concludes with an introduction to an open-source framework and future outlooks.

Speakers

  • Tianjun Xiao — Senior Applied Scientist, AWS AI
  • Francesco Locatello — IST Austria

Talks (2)

  • 00:00:00 — Tianjun Xiao: Bridging the gap to real-world object-centric learning
    • This talk introduces the motivation behind object-centric learning, highlights the gap between current models and real-world data, and explores methods to bridge this gap by upgrading learning objectives (reconstructing beyond RGB pixels), encoders, and decoders. It also provides an overview of the Object-Centric Learning Framework (OCLF).
  • 00:28:28Francesco Locatello: Causal Representation Learning
    • This talk delves into the theoretical underpinnings of causal representation learning, discussing how structured representations can aid in generalization and robustness. It covers challenges in unsupervised disentanglement, the concept of identifiability, and various applications of causal representation learning in diverse scientific domains.

Key Takeaways

  • Object-centric learning aims for structured visual representations that enable systematic generalization and causal reasoning, moving beyond pixel-level reconstruction.
  • Bridging the gap to real-world data requires upgrading learning objectives (e.g., reconstructing optical flow or depth), leveraging powerful encoders (e.g., self-supervised Vision Transformers), and employing sophisticated decoders (e.g., NeRF-based or diffusion models).
  • Causal representation learning provides a framework for understanding and manipulating the underlying generative processes of data, offering benefits in robustness, generalization, and interpretability.
  • Identifiability of disentangled representations is a critical theoretical challenge, with recent work exploring conditions under which causal factors can be uniquely recovered, often requiring weak supervision or specific structural assumptions.
  • The field is moving towards developing modular architectures that facilitate mechanism reuse at the object level, enabling controllable generation, scene editing, and applications in scientific discovery.

Methods / Models / Datasets Mentioned

  • On the Binding Problem in Artificial Neural Networks
  • Object-Centric Learning with Slot Attention
  • CLEVR
  • Multi-dSprites
  • Tetrominoes
  • BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING
  • Object Scene Representation Transformer
  • Object-Centric Slot Diffusion
  • Conditional Object-Centric Learning from Video
  • SAVI
  • MOVI
  • MOVI++
  • Kubric
  • Self-Supervised Video Object Segmentation by Motion Grouping
  • DAVIS2016
  • SegTrackV2
  • FBMS59
  • MoCA
  • SAVI++
  • Waymo Open dataset
  • DINOSAUR
  • DINO
  • MAE
  • PASCAL VOC 2012
  • COCO
  • FG-ARI
  • mBO
  • CorLoc
  • mIoU
  • DeepSpectral
  • TokenCut
  • MaskContrast
  • COMUS
  • SlotCon
  • STEGO
  • Adaptive Slot Attention
  • Adaptive Slot Attention: Object Discovery with Dynamic Slot Number
  • To Understand Language is to Understand Learning
  • SLATE/STEVE
  • VQ-VAE
  • DALL-E
  • Stable Diffusion
  • LSD
  • FFHQ
  • DORSAL
  • MSN
  • Street View
  • Learning Open-Vocabulary Semantic Segmentation From Image-Text Supervision
  • OVSegmenter
  • ADE20K
  • NEURAL SYSTEMATIC BINDER
  • OCLF
  • ISTAnt dataset

Topics

Object-Centric Learning · Structured Visual Representation · Self-Supervised Learning · End-to-End Differentiable Architecture · Real-World Vision Data · RGB Pixel Reconstruction · Optical Flow Reconstruction · Depth Reconstruction · Encoder Upgrading (Vision Transformers) · Decoder Upgrading (3D Inductive Bias, Decoupling) · Slot-Decoding Dilemma · Causal Representation Learning · Disentangled Representation · Identifiability · Distribution Shifts · Scene Controllability · Partial Observability · Differential Equations in Climate · Experimental Ecology


Notes

Open for commentary — connections to other work, critiques, follow-up reading.