From Multimodal LLM to Human-level AI

Event: CVPR 2024 · Duration: 278 min · ▶ Watch on YouTube

Abstract

This segment introduces a comprehensive tutorial on Multimodal Large Language Models (MLLMs) and their path towards Human-level AI. It begins with an overview of the tutorial’s structure, the contributing team, and the foundational concepts of LLMs and their rapid evolution. The segment then delves into the architectural designs of MLLMs, exploring different approaches for multimodal encoding, input/decoding-side projection, and generation. Finally, it provides a detailed survey of MLLMs based on their supported modalities and functionalities, including discussions on unified and fine-grained capabilities, and outlines future research directions. This segment delves into the challenges and strategies for creating multimodal models that effectively follow human intent, particularly focusing on efficiency. The speaker proposes framing multimodal learning as a translation problem, where visual information is ‘translated’ into a language format that large language models (LLMs) can understand. The discussion covers various pretraining and instruction-tuning data generation techniques, highlighting the importance of diverse and high-quality datasets, and addresses the difficulties in handling complex visual instructions and ensuring proper alignment between modalities. This segment delves into multimodal hallucinations, defining them as generated text responses that do not align with visual content, and categorizing them into object, attribution, and relation hallucinations. It explores the causes, including noisy data, lack of data diversity, and limitations of vision and language models, and discusses various mitigation techniques. The segment then transitions to multimodal reasoning, outlining its basics, evolution from task-specific to centralized paradigms, and the development of multimodal Chain-of-Thought reasoning. Finally, it introduces the concept of multimodal LLM agents, their architecture, applications, and the challenges in building general, autonomous, and safe agents. This segment features a panel discussion following a tutorial on Multimodal Large Language Models (MLLMs). Speakers engage in a lively Q&A session, delving into critical aspects of MLLM development, including the definition of AGI, future architectural trends, the role of data quality and composition, and effective training strategies. Key topics revolve around the path to human-level AI, the importance of multimodal generalist models, and the challenges of evaluating and scaling MLLMs.

Speakers

Hao Fei — National University of Singapore
Yuan Yao — National University of Singapore
Haotian Liu — University of Wisconsin-Madison
Yuan-Hong Liao — MIT
Fuxiao Liu — University of Maryland, College Park
Zhuosheng Zhang — Shanghai Jiao Tong University
Ao Zhang — National University of Singapore
Hanwang Zhang — Nanyang Technological University
Shuicheng Yan — Kunlun 2050 Research, Skywork AI

Talks (9)

00:02:38 — Hao Fei: Background and Introduction: From MLLM to Human-level AI
- Introduces the tutorial, its scope, the team, and provides an overview of LLMs and MLLMs, their evolution, and the tutorial’s goals.
00:10:30 — Yuan Yao: MLLM Design: Architecture
- Discusses the basic architecture of MLLMs, including multimodal encoding, input/decoding-side projection, and multimodal generation, highlighting discrete vs. joint system approaches.
00:38:10 — Hao Fei: Modality and Functionality
- Surveys existing MLLMs based on their supported modalities (image, video, audio, 3D) and functionalities (perceiving, generating, unified, fine-grained), and discusses future directions.
01:01:00 — Haotian Liu: Why Multimodal Instruction Tuning?
- Explains the motivation for multimodal instruction tuning, contrasting it with traditional task-specific models and multitask learning, and introduces the MLLM instruction tuning framework and training paradigms.
01:09:29 — Yuan-Hong Liao: How to Create Multimodal Models that Follow Human Intent Efficiently
- This segment discusses strategies for building efficient multimodal models that follow human intent, focusing on data generation techniques and framing multimodal learning as a translation problem.
02:57:59 — Fuxiao Liu: Multimodal Hallucinations
- This talk defines multimodal hallucinations, categorizes their types, discusses causes related to training data and model architecture, and explores mitigation strategies including negative data, counterfactual data, noise reduction, resolution scaling, vision encoders, RLHF-V, and post-hoc correction.
03:28:28 — Ao Zhang: Efficient MLLM
- This talk delves into efficient multimodal large language models (MLLMs), covering architectural innovations like lightweight compression and image slicing, the importance of high-quality data and effective training strategies, and acceleration techniques for reducing computational costs.
03:28:59 — Zhuosheng Zhang: Multimodal Reasoning
- This talk introduces multimodal reasoning, its evolution from single-step to multi-step reasoning, and the paradigm shift towards multimodal LLM agents. It covers model architectures, in-context learning, chain-of-thought reasoning, and challenges in building robust and safe agents.
03:33:08 — Multiple: Panel Discussion: Path to Multimodal Generalist
- A panel discussion covering the definition of AGI, future MLLM architectures, data quality, training strategies, evaluation, and the challenges and opportunities in data collection and model development.

Key Takeaways

The tutorial aims to provide a comprehensive understanding of MLLMs, covering their architecture, modalities, instruction tuning, reasoning, and efficiency, with a focus on achieving human-level AI.
MLLMs are evolving rapidly, moving beyond language-only models to integrate diverse modalities like vision, audio, and 3D data, enabling more complex real-world applications.
Current MLLM architectures primarily leverage strong language models as a central processing unit, extending their capabilities to multimodal information through various encoding and decoding mechanisms.
Future directions for MLLMs include broadening the range of supported modalities, deepening task-level understanding, enhancing generation abilities through better tokenization, and fostering greater multimodality and multi-task synergy to achieve true AGI.
Framing multimodal learning as a translation problem allows leveraging LLMs’ existing language understanding capabilities by ‘translating’ visual information into a language format.
Efficiently building multimodal models requires a multi-stage approach: pretraining for modality alignment and instruction tuning for intent following, using both coarse-grained and fine-grained paired data.
Diverse and high-quality instruction-following data, often generated through methods like self-instruction with advanced LLMs, is crucial for improving model performance and generalization across various tasks.
Careful prompt formatting is essential to guide multimodal models to provide desired output lengths and avoid ambiguous responses, especially when dealing with short-answer VQA datasets.
Multimodal hallucinations are a significant challenge in MLLMs, categorized by object, attribution, and relation errors, often stemming from noisy training data or model limitations.
Mitigation strategies for hallucinations include improving data quality (negative/counterfactual data, noise reduction), enhancing model capabilities (resolution, vision encoders), and training-related methods like RLHF-V and post-hoc correction.
Multimodal reasoning has evolved from task-specific models to unified MLLMs capable of in-context learning and multi-step Chain-of-Thought reasoning, offering improved interpretability and controllability.
The future of MLLMs involves developing autonomous and communicative agents that can interact with physical and virtual environments, utilize tools, and collaborate, requiring robust multimodal perception, long-context modeling, and effective workflow management.
The path to AGI likely involves multimodal generalist models that can integrate various modalities beyond just language, leveraging diverse data types and robust training strategies.
Efficient MLLM architectures prioritize high-resolution visual encoding with lightweight compression layers and image slicing techniques to manage computational costs.
High-quality, human-annotated, and GPT-generated data, especially for long-text, VQA, and OCR tasks, significantly boosts MLLM capabilities and efficiency.
Future MLLM development will focus on smaller, more efficient models that achieve strong performance with fewer parameters, driven by advancements in training techniques like transfer learning and tool integration.

Methods / Models / Datasets Mentioned

3D-GPT
3D-LLM
ADE20K
ADEPT Action Transformer
AI2
ALLaVA
AMBER
Adept
AgentGPT
AlexaTM
All-Seeing V2
AlpacaCare
AlphaCode
AnyGPT
Aquila2
AudioCLIP
AudioGPT
Auto-GUI
Auto-UI
AutoGPT
BEATs
BERT
BLIP-2
BLOOM
BLOOMZ
BabyAGI
Baichuan2
BenTsao
BianQue
BioGPT
BioMedGPT
BioMedLM
Bunny
CAMEL
CLIP
CLIP ViT-L/14
CLIP-L-112x
CLIP-L-224x
CLIP-L-336x
COCO
CPLLM
Chain-of-Action
ChatDev
ChatDoctor
ChemCrow
ChemGPT
ChemLLM
Chinchilla
ClinicalGPT
Clotho-Detail
CoDi-2
CodeGeeX
CodeGen
Codex
CogAgent
CogVLM-Chat
Cohere
DPO (Dense Direct Preference Optimization)
DeepSeek-VL
DeepSeek-VL 1.3B
DeepSpeed
DocOdia
DoctorGLM
DreamLLM
DrugGPT
EMU
EOS
Ernie 3.0 Titan
FLAN
FLAN-Alpaca
FLAN-Alpaca base
FLAN-Alpaca large
FLAN-Alpaca small
FLAN-T5large
FLM
FSDP
Falcon
Flamingo
Flan-T5
Flickr-30K
Fuyu
GILL
GIMLET
GLAM
GLM-4V
GPT-1
GPT-2
GPT-3
GPT-4
GPT-4V
GPT-NeoX-20B
GPT4Graph
GQA
GShard
Galactica
GatorTron
GatorTronGPT
Gemini Pro
Generative Agents
GeoGPT
Google AITW
Gopher
GranD
GraphGPT
Grok-1
HACL
Halle-Switch
HalluciDoctor
HallusionBench
HiGPT
HuBERT
HuatuoGPT
HuggingGPT
HyperCLOVA
IVE
IdeFics2
ImageBind
ImageNet
Imp V1/V2
InstructBLIP
InstructGPT
InternLM
InternLM-XComposer
InternLM-XComposer2-4KHD
Kosmos-2
LAMM
LL3DA
LLaGA
LLaMA
LLaMA-VID
LLaMA2
LLaVA
LLaVA 1.5
LLaVA-1.5
LLaVA-HR
LLaVA-Instruct-158k
LLaVA-NeXT
LLaVA-Phi
LLaVA-Plus
LLaVA-UHD
LRV
LRV-Instruction
LVIS-Instruct4V
LaMDA
LaVIT
LanguageBind
Luminous
MACAW-LLM
MEDITRON
MIMIC-IT
MM-REACT
MM1
MMC-Instruction
MME Benchmark
MPT
MSR-VTT
Med-PalM
MedAlpaca
MedPalM 2
Meta-GPT
Mini-Gemini
Mini-Gemini 2B
MiniCPM-Llama3-V 2.5
MiniCPM-V
MiniCPM-V 1.0/2.0/2.5
MiniCPM-V 2.5
MiniGPT-5
MiniGPT-v2
MiniGPT4
MobileVLM V1/V2
Modaverse
MolCA
MolSTM
MolXPT
Momentor
Monkey
MovieChat
MultiModel-GPT
NExT-Chat
NLLB
Next-Chat
OCR
OPT
OPT-IML
OphGLM
OtterHD
PLUG
PMC-LLaMA
POPE
PSG
PaLM
PaliGemma
PanGu-α
PandaGPT
Phi-3-vision
Pix2emb
Pix2seq
Plan-PalM
Point-BERT
Point-Bind
PointLLM
Pythia
QFormer
Qilin-Med
Qwen
Qwen-VL
Qwen-VL-Chat
RLHF-V
RWKV
ReCaption
RefCOCO series
Robotics@Google
SALMONN
SAM
SEED-LLaMA
SPHINX
SPT
Self-Instruct
ShareGPT4V
SigLIP
Sparrow
SpatialVLM
StableLLaVA
StructGPT
T0
T5
T5 Chem
TextVQA
Tk-Instruct
UL2
Ultra-Chat
Unified-IO 2
VCR
VIGC-LLaVA
VL-Vicuna
VPGTrans
VQA
VQA-v2
Vanilla-T5large
Vcoder
ViT-Lens
Vicuna
Video-ChatGPT
Video-LLaMA
Video-LaVIT
VideoChat
VideoPoet
ViperGPT
VisCPM
VisWiz
Visual-ChatGTP
VisualGLM
Vitron
Volcano
Voyager
WavCaps
WebArena
WebGPT
Whisper
Windows Copilot
Woodpecker
X-LLM
X-Tuner-Llama
YaLM
Yi-VL
Zhongjing
mPLUG-DocOwl 1.5
mPLUG-Owl
mT5
mTO

Topics

Agent Systems · Artificial General Intelligence (AGI) · Chain-of-Thought Reasoning · Consciousness and AI · Cross-modal alignment · Data Quality and Composition · Data generation · Efficiency in model training · Evaluation Benchmarks · Fine-grained Capabilities · Future of AI · Hallucination Mitigation · Human-level AI · In-Context Learning · Instruction Tuning · Instruction tuning · Language models (LLMs) · MLLM Architecture · Multilingual MLLMs · Multimodal Encoding · Multimodal Hallucinations · Multimodal LLM Agents · Multimodal LLMs · Multimodal Large Language Models (MLLMs) · Multimodal Reasoning · Multimodal models · Multimodality · Multitask Learning · Reasoning in multimodal models · Training Strategies · Vision-Language Models · Visual question answering (VQA)

Notes

Open for commentary — connections to other work, critiques, follow-up reading.