Workshop on Scene Graphs and Graph Representation Learning (SG2RL 2024)

Event: SG2RL 2024 · Duration: 210 min · ▶ Watch on YouTube

Abstract

This workshop on Scene Graphs and Graph Representation Learning (SG2RL 2024) explores the pervasive nature of graph structures in various domains, from molecules to social media and 3D meshes, with a particular focus on their applications in computer vision. The event features presentations on diverse topics including reconstruction-free 3D scene graph prediction, efficient spatio-temporal trajectory graph modeling, unbiased dynamic scene graph generation, and graph-based video summarization. Invited talks delve into open-vocabulary 3D perception for robotics, the cognitive neuroscience-inspired “Tensor Brain” model for perception and memory, and innovative methods for representing structured data for large language models. The workshop highlights current challenges and future directions in the field, emphasizing the growing importance of graph-based approaches for understanding and generating complex visual and linguistic data.

Speakers

Azade Farshad — Munich Center for Machine Learning
Ehsan Adeli — Munich Center for Machine Learning
Congrui Hetang — Not specified
Ziyi Zhou — Dartmouth College
Anant Khandelwal — Microsoft
Jose M. Rojas — Intel Flex
Dr. Krishna Murthy — MIT
Prof. Volker Tresp — LMU Munich and Munich Center for Machine Learning
Dr. Bryan Perozzi — Google Research

Talks (7)

00:06:39 — Congrui Hetang: Segment Anything Model for Road Network Graph Extraction
- This talk presents a method for extracting vectorized road network graphs from aerial images by leveraging the Segment Anything Model (SAM) for dense segmentation and a transformer-based GNN for topology prediction, achieving state-of-the-art accuracy with significantly faster inference.
01:00:00 — Ziyi Zhou: Efflex: Efficient and Flexible Pipeline for Spatio-Temporal Trajectory Graph Modeling and Representation Learning
- This presentation introduces Efflex, an efficient and flexible pipeline that uses multi-scale graph construction and graph neural networks (GNNs) to model and learn representations from high-volume spatio-temporal trajectory data, offering dual models for speed and accuracy.
01:10:00 — Anant Khandelwal: FloCoDe: Unbiased Dynamic Scene Graph Generation with Temporal Consistency and Correlation Debasing
- This talk introduces FloCoDe, a method for unbiased dynamic scene graph generation in videos, addressing challenges like class imbalance and dynamic relations through flow-aware temporal consistency, correlation debiasing, and uncertainty-aware weighted loss.
01:34:00 — Jose M. Rojas: VideoSAGE: Video Summarization with Graph Representation Learning
- This presentation introduces VideoSAGE, a graph-based framework for long-form video summarization that constructs multi-modal graphs from video frames and uses a binary node classification approach with a Knapsack algorithm to select important segments.
01:42:00 — Dr. Krishna Murthy: Open-vocabulary 3D perception for robots
- This talk presents a framework for open-vocabulary 3D perception for robots, leveraging off-the-shelf vision and language models to build 3D scene graphs from RGB-D sequences, enabling zero-shot object detection, attribute querying, and affordance prompting for robot planning.
02:00:00 — Prof. Volker Tresp: Perception, Memory and Semantic Decoding
- This talk introduces the “Tensor Brain” as a computational model for perception, memory, and semantic decoding, proposing that episodic and semantic memory rely on the same brainware and interact through a “triple language” of concepts and embeddings.
02:15:00 — Dr. Bryan Perozzi: Giving a Voice to Your Graph: Representing Structured Data for LLMs
- This talk explores methods for representing structured graph data for large language models (LLMs) to overcome limitations of current GenAI models, proposing “Graphs as Text” and “GraphToken” approaches to encode graph structures for improved efficiency and performance in graph-centric tasks.

Key Takeaways

Graph structures are fundamental for representing complex relationships in various domains, including computer vision, social media, and molecular biology.
Scene graphs are evolving from static 2D representations to dynamic, 3D, and temporal models, enabling advanced applications in video understanding, robotics, and medical imaging.
Integrating graph structures with large language models (LLMs) offers promising avenues to address current GenAI limitations such as hallucinations, staleness, and efficiency.
New methods are being developed to overcome challenges in scene graph generation, including handling class imbalance, ensuring temporal consistency, and enabling open-vocabulary perception in 3D environments.
The “Tensor Brain” model provides a theoretical framework for understanding how perception, episodic, and semantic memory interact and contribute to human intelligence, suggesting a unified brainware for these cognitive functions.

Methods / Models / Datasets Mentioned

Segment Anything Model (SAM)
Graph Neural Networks (GNNs)
Visual Genome
DeepWalk
Graph Convolutional Networks (GCN)
GraVi-T framework
Knapsack algorithm
CLIP-Fields
LERF (Language-conditioned Embedding Radiance Fields)
OpenScene
Tensor Brain
GraphToken
PaLM (Pathways Language Model)
Gemini
GPT-3.5-turbo
GPT-4
Claude Haiku
Claude Opus
ResNet
ViT (Vision Transformer)
SwinV2-T
PointNet
PointNet++
MiDaS (Monocular Depth Estimation)
Dirichlet Fusion
Knapsack Algorithm
Transformer
BERT
MPNN (Message Passing Neural Network)
GIN (Graph Isomorphism Network)
DCNN (Deep Convolutional Neural Network)
BFS (Breadth-First Search)
NMS (Non-Maximum Suppression)
DETR (Detection Transformer)
RNGDet++
DeepRoadMapper
SGbench (Scene Graph Benchmark)
Open3DSG
EgoSG

Topics

Scene Graphs · Graph Representation Learning · Dynamic Scene Graphs · Temporal Consistency · Open-Vocabulary Perception · 3D Scene Graphs · Video Summarization · Multimodal Models · Transformer Reasoning · Cognitive Neuroscience of Memory

Notes

Open for commentary — connections to other work, critiques, follow-up reading.