From Multimodal LLM to Human-level AI
Event: CVPR 2024 · Duration: 278 min · ▶ Watch on YouTube
Abstract
This segment introduces a comprehensive tutorial on Multimodal Large Language Models (MLLMs) and their path towards Human-level AI. It begins with an overview of the tutorial’s structure, the contributing team, and the foundational concepts of LLMs and their rapid evolution. The segment then delves into the architectural designs of MLLMs, exploring different approaches for multimodal encoding, input/decoding-side projection, and generation. Finally, it provides a detailed survey of MLLMs based on their supported modalities and functionalities, including discussions on unified and fine-grained capabilities, and outlines future research directions. This segment delves into the challenges and strategies for creating multimodal models that effectively follow human intent, particularly focusing on efficiency. The speaker proposes framing multimodal learning as a translation problem, where visual information is ‘translated’ into a language format that large language models (LLMs) can understand. The discussion covers various pretraining and instruction-tuning data generation techniques, highlighting the importance of diverse and high-quality datasets, and addresses the difficulties in handling complex visual instructions and ensuring proper alignment between modalities. This segment delves into multimodal hallucinations, defining them as generated text responses that do not align with visual content, and categorizing them into object, attribution, and relation hallucinations. It explores the causes, including noisy data, lack of data diversity, and limitations of vision and language models, and discusses various mitigation techniques. The segment then transitions to multimodal reasoning, outlining its basics, evolution from task-specific to centralized paradigms, and the development of multimodal Chain-of-Thought reasoning. Finally, it introduces the concept of multimodal LLM agents, their architecture, applications, and the challenges in building general, autonomous, and safe agents. This segment features a panel discussion following a tutorial on Multimodal Large Language Models (MLLMs). Speakers engage in a lively Q&A session, delving into critical aspects of MLLM development, including the definition of AGI, future architectural trends, the role of data quality and composition, and effective training strategies. Key topics revolve around the path to human-level AI, the importance of multimodal generalist models, and the challenges of evaluating and scaling MLLMs.
Speakers
- Hao Fei — National University of Singapore
- Yuan Yao — National University of Singapore
- Haotian Liu — University of Wisconsin-Madison
- Yuan-Hong Liao — MIT
- Fuxiao Liu — University of Maryland, College Park
- Zhuosheng Zhang — Shanghai Jiao Tong University
- Ao Zhang — National University of Singapore
- Hanwang Zhang — Nanyang Technological University
- Shuicheng Yan — Kunlun 2050 Research, Skywork AI
Talks (9)
- 00:02:38 — Hao Fei: Background and Introduction: From MLLM to Human-level AI
- Introduces the tutorial, its scope, the team, and provides an overview of LLMs and MLLMs, their evolution, and the tutorial’s goals.
- 00:10:30 — Yuan Yao: MLLM Design: Architecture
- Discusses the basic architecture of MLLMs, including multimodal encoding, input/decoding-side projection, and multimodal generation, highlighting discrete vs. joint system approaches.
- 00:38:10 — Hao Fei: Modality and Functionality
- Surveys existing MLLMs based on their supported modalities (image, video, audio, 3D) and functionalities (perceiving, generating, unified, fine-grained), and discusses future directions.
- 01:01:00 — Haotian Liu: Why Multimodal Instruction Tuning?
- Explains the motivation for multimodal instruction tuning, contrasting it with traditional task-specific models and multitask learning, and introduces the MLLM instruction tuning framework and training paradigms.
- 01:09:29 — Yuan-Hong Liao: How to Create Multimodal Models that Follow Human Intent Efficiently
- This segment discusses strategies for building efficient multimodal models that follow human intent, focusing on data generation techniques and framing multimodal learning as a translation problem.
- 02:57:59 — Fuxiao Liu: Multimodal Hallucinations
- This talk defines multimodal hallucinations, categorizes their types, discusses causes related to training data and model architecture, and explores mitigation strategies including negative data, counterfactual data, noise reduction, resolution scaling, vision encoders, RLHF-V, and post-hoc correction.
- 03:28:28 — Ao Zhang: Efficient MLLM
- This talk delves into efficient multimodal large language models (MLLMs), covering architectural innovations like lightweight compression and image slicing, the importance of high-quality data and effective training strategies, and acceleration techniques for reducing computational costs.
- 03:28:59 — Zhuosheng Zhang: Multimodal Reasoning
- This talk introduces multimodal reasoning, its evolution from single-step to multi-step reasoning, and the paradigm shift towards multimodal LLM agents. It covers model architectures, in-context learning, chain-of-thought reasoning, and challenges in building robust and safe agents.
- 03:33:08 — Multiple: Panel Discussion: Path to Multimodal Generalist
- A panel discussion covering the definition of AGI, future MLLM architectures, data quality, training strategies, evaluation, and the challenges and opportunities in data collection and model development.
Key Takeaways
- The tutorial aims to provide a comprehensive understanding of MLLMs, covering their architecture, modalities, instruction tuning, reasoning, and efficiency, with a focus on achieving human-level AI.
- MLLMs are evolving rapidly, moving beyond language-only models to integrate diverse modalities like vision, audio, and 3D data, enabling more complex real-world applications.
- Current MLLM architectures primarily leverage strong language models as a central processing unit, extending their capabilities to multimodal information through various encoding and decoding mechanisms.
- Future directions for MLLMs include broadening the range of supported modalities, deepening task-level understanding, enhancing generation abilities through better tokenization, and fostering greater multimodality and multi-task synergy to achieve true AGI.
- Framing multimodal learning as a translation problem allows leveraging LLMs’ existing language understanding capabilities by ‘translating’ visual information into a language format.
- Efficiently building multimodal models requires a multi-stage approach: pretraining for modality alignment and instruction tuning for intent following, using both coarse-grained and fine-grained paired data.
- Diverse and high-quality instruction-following data, often generated through methods like self-instruction with advanced LLMs, is crucial for improving model performance and generalization across various tasks.
- Careful prompt formatting is essential to guide multimodal models to provide desired output lengths and avoid ambiguous responses, especially when dealing with short-answer VQA datasets.
- Multimodal hallucinations are a significant challenge in MLLMs, categorized by object, attribution, and relation errors, often stemming from noisy training data or model limitations.
- Mitigation strategies for hallucinations include improving data quality (negative/counterfactual data, noise reduction), enhancing model capabilities (resolution, vision encoders), and training-related methods like RLHF-V and post-hoc correction.
- Multimodal reasoning has evolved from task-specific models to unified MLLMs capable of in-context learning and multi-step Chain-of-Thought reasoning, offering improved interpretability and controllability.
- The future of MLLMs involves developing autonomous and communicative agents that can interact with physical and virtual environments, utilize tools, and collaborate, requiring robust multimodal perception, long-context modeling, and effective workflow management.
- The path to AGI likely involves multimodal generalist models that can integrate various modalities beyond just language, leveraging diverse data types and robust training strategies.
- Efficient MLLM architectures prioritize high-resolution visual encoding with lightweight compression layers and image slicing techniques to manage computational costs.
- High-quality, human-annotated, and GPT-generated data, especially for long-text, VQA, and OCR tasks, significantly boosts MLLM capabilities and efficiency.
- Future MLLM development will focus on smaller, more efficient models that achieve strong performance with fewer parameters, driven by advancements in training techniques like transfer learning and tool integration.
Methods / Models / Datasets Mentioned
3D-GPT3D-LLMADE20KADEPT Action TransformerAI2ALLaVAAMBERAdeptAgentGPTAlexaTMAll-Seeing V2AlpacaCareAlphaCodeAnyGPTAquila2AudioCLIPAudioGPTAuto-GUIAuto-UIAutoGPTBEATsBERTBLIP-2BLOOMBLOOMZBabyAGIBaichuan2BenTsaoBianQueBioGPTBioMedGPTBioMedLMBunnyCAMELCLIPCLIP ViT-L/14CLIP-L-112xCLIP-L-224xCLIP-L-336xCOCOCPLLMChain-of-ActionChatDevChatDoctorChemCrowChemGPTChemLLMChinchillaClinicalGPTClotho-DetailCoDi-2CodeGeeXCodeGenCodexCogAgentCogVLM-ChatCohereDPO (Dense Direct Preference Optimization)DeepSeek-VLDeepSeek-VL 1.3BDeepSpeedDocOdiaDoctorGLMDreamLLMDrugGPTEMUEOSErnie 3.0 TitanFLANFLAN-AlpacaFLAN-Alpaca baseFLAN-Alpaca largeFLAN-Alpaca smallFLAN-T5largeFLMFSDPFalconFlamingoFlan-T5Flickr-30KFuyuGILLGIMLETGLAMGLM-4VGPT-1GPT-2GPT-3GPT-4GPT-4VGPT-NeoX-20BGPT4GraphGQAGShardGalacticaGatorTronGatorTronGPTGemini ProGenerative AgentsGeoGPTGoogle AITWGopherGranDGraphGPTGrok-1HACLHalle-SwitchHalluciDoctorHallusionBenchHiGPTHuBERTHuatuoGPTHuggingGPTHyperCLOVAIVEIdeFics2ImageBindImageNetImp V1/V2InstructBLIPInstructGPTInternLMInternLM-XComposerInternLM-XComposer2-4KHDKosmos-2LAMMLL3DALLaGALLaMALLaMA-VIDLLaMA2LLaVALLaVA 1.5LLaVA-1.5LLaVA-HRLLaVA-Instruct-158kLLaVA-NeXTLLaVA-PhiLLaVA-PlusLLaVA-UHDLRVLRV-InstructionLVIS-Instruct4VLaMDALaVITLanguageBindLuminousMACAW-LLMMEDITRONMIMIC-ITMM-REACTMM1MMC-InstructionMME BenchmarkMPTMSR-VTTMed-PalMMedAlpacaMedPalM 2Meta-GPTMini-GeminiMini-Gemini 2BMiniCPM-Llama3-V 2.5MiniCPM-VMiniCPM-V 1.0/2.0/2.5MiniCPM-V 2.5MiniGPT-5MiniGPT-v2MiniGPT4MobileVLM V1/V2ModaverseMolCAMolSTMMolXPTMomentorMonkeyMovieChatMultiModel-GPTNExT-ChatNLLBNext-ChatOCROPTOPT-IMLOphGLMOtterHDPLUGPMC-LLaMAPOPEPSGPaLMPaliGemmaPanGu-αPandaGPTPhi-3-visionPix2embPix2seqPlan-PalMPoint-BERTPoint-BindPointLLMPythiaQFormerQilin-MedQwenQwen-VLQwen-VL-ChatRLHF-VRWKVReCaptionRefCOCO seriesRobotics@GoogleSALMONNSAMSEED-LLaMASPHINXSPTSelf-InstructShareGPT4VSigLIPSparrowSpatialVLMStableLLaVAStructGPTT0T5T5 ChemTextVQATk-InstructUL2Ultra-ChatUnified-IO 2VCRVIGC-LLaVAVL-VicunaVPGTransVQAVQA-v2Vanilla-T5largeVcoderViT-LensVicunaVideo-ChatGPTVideo-LLaMAVideo-LaVITVideoChatVideoPoetViperGPTVisCPMVisWizVisual-ChatGTPVisualGLMVitronVolcanoVoyagerWavCapsWebArenaWebGPTWhisperWindows CopilotWoodpeckerX-LLMX-Tuner-LlamaYaLMYi-VLZhongjingmPLUG-DocOwl 1.5mPLUG-OwlmT5mTO
Topics
Agent Systems · Artificial General Intelligence (AGI) · Chain-of-Thought Reasoning · Consciousness and AI · Cross-modal alignment · Data Quality and Composition · Data generation · Efficiency in model training · Evaluation Benchmarks · Fine-grained Capabilities · Future of AI · Hallucination Mitigation · Human-level AI · In-Context Learning · Instruction Tuning · Instruction tuning · Language models (LLMs) · MLLM Architecture · Multilingual MLLMs · Multimodal Encoding · Multimodal Hallucinations · Multimodal LLM Agents · Multimodal LLMs · Multimodal Large Language Models (MLLMs) · Multimodal Reasoning · Multimodal models · Multimodality · Multitask Learning · Reasoning in multimodal models · Training Strategies · Vision-Language Models · Visual question answering (VQA)
Notes
Open for commentary — connections to other work, critiques, follow-up reading.