Concept Learning Across Domains and Modalities

Event: CVPR 2025 Workshop Session · Duration: 45 min · ▶ Watch on YouTube

Abstract

The presentation delves into the realm of concept learning, highlighting its increasing relevance with the advent of large language models. It introduces Neuro-Symbolic Concept Learners (NS-CL) as a framework for joint learning of concepts and semantic parsing, demonstrating its data efficiency and combinatorial generalization capabilities. The talk then extends this idea to various domains including dynamic scenes, 3D environments, human motion understanding, and robotic manipulation. Finally, it proposes Logic-Enhanced Foundation models (LEFT) as a unified framework that combines foundation models with differentiable first-order logic to achieve domain-independent reasoning and strong generalization across different modalities and tasks.

Speakers

Jiajun Wu — Stanford University
Anoosha Cherian
Suraj Lohit
Kevin Smith

Talks (1)

00:00:00 — Jiajun Wu: Concept Learning Across Domains and Modalities
- This talk explores the development of neuro-symbolic concept learners and Logic-Enhanced Foundation models (LEFT) to enable robust concept learning and reasoning across diverse visual and linguistic domains, emphasizing data efficiency, generalization, and interpretability.

Key Takeaways

Neuro-symbolic approaches, which combine neural networks with symbolic reasoning, offer superior data efficiency, interpretability, and combinatorial generalization compared to purely end-to-end models.
The LEFT framework provides a unified and flexible solution for concept learning and reasoning across diverse domains (2D, 3D, temporal, robotics) by integrating foundation models with differentiable first-order logic.
LEFT demonstrates strong zero-shot and compositional generalization to novel tasks and achieves high data efficiency, outperforming prior methods in various complex reasoning benchmarks.
The framework’s modular design allows for the integration of both differentiable neural network modules and non-differentiable off-the-shelf tools, offering adaptability to different domain-specific grounding requirements.
The ability to learn domain-independent logical forms from natural language queries, combined with domain-specific grounding, is crucial for building robust and generalizable AI systems.

Methods / Models / Datasets Mentioned

CLEVR
NS-VQA
MAC
IEP
FiLM
SAN
ViperGPT
Visual Programming
OpenAI Codex
NS-CL
CLRVER
NS-3D
BABEL-QA Dataset
NS-Pose
ProgramPort
BUTD-DETR
MVT
SAT
TransRefer
LEFT (Logic-Enhanced Foundation models)
Faster R-CNN
PointNet++
2s-AGCN
Dense CLIP
Flamingo
MotionCLIP

Topics

Concept Learning · Neuro-Symbolic AI · Multimodal Learning · Visual Question Answering (VQA) · Scene Understanding · Data Efficiency · Combinatorial Generalization · 3D Vision · Dynamic Scenes · Human Motion Understanding · Robotic Manipulation · First-Order Logic · Foundation Models · Differentiable Reasoning

Notes

Open for commentary — connections to other work, critiques, follow-up reading.