BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models
Event: CVPR 2025 Precognition Workshop · Duration: 6 min · ▶ Watch on YouTube
Abstract
Large Vision-Language Models (LVLMs) have shown remarkable success in various real-world applications, but their trustworthiness is critical due to the challenge of hallucination. Hallucination occurs when an LVLM’s response is coherent but misaligned or inconsistent with the provided context, leading to spurious decision-making or misleading information. This paper proposes BIMA, a bijective maximum likelihood learning approach, to address this issue. BIMA models the distribution of desirable, non-hallucinated responses and learns a bijective mapping between this reference distribution and the machine’s responses, effectively bridging the gap between generated and desired outputs.
Speakers
- Huu-Thien Tran — CVIU Lab, University of Arkansas
- Thanh-Dat Truong — CVIU Lab, University of Arkansas
- Khoa Luu — CVIU Lab, University of Arkansas
Talks (1)
- 00:00:00 — Huu-Thien Tran: BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models
- This presentation introduces BIMA, a novel approach to predict and mitigate hallucinations in Large Vision-Language Models by modeling a reference distribution of non-hallucinated responses.
Key Takeaways
- Hallucination in LVLMs is a significant problem, leading to inconsistent and misleading information, and its mitigation is crucial for reliable AI applications.
- BIMA proposes a novel approach by modeling a ‘Reference Distribution’ of non-hallucinated responses using normalizing flows.
- The framework integrates a bijective loss during fine-tuning to map generated responses closer to the desired reference distribution, alongside the autoregressive loss.
- Experimental results on CHAIR benchmark show BIMA significantly reduces hallucination metrics (CHAIRs and CHAIRi) compared to prior methods.
- Qualitative analysis demonstrates that BIMA produces more meaningful and contextually aligned responses with fewer hallucinated objects than baseline LLaVA models.
Methods / Models / Datasets Mentioned
BIMANormalizing FlowLLaVA v1.5POPE BenchmarkCHAIR BenchmarkGreedy (decoding)Nucleus (decoding)Beam Search (decoding)OPERAICDVCDSIDProjectAway
Topics
Large Vision-Language Models (LVLMs) · Hallucination Mitigation · Bijective Maximum Likelihood Learning (BIMA) · Reference Distribution · Normalizing Flow · Visual Question Answering (VQA) · Image Captioning · Trustworthiness in AI
Notes
Open for commentary — connections to other work, critiques, follow-up reading.