BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models

Event: CVPR 2025 Precognition Workshop · Duration: 6 min · ▶ Watch on YouTube

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable success in various real-world applications, but their trustworthiness is critical due to the challenge of hallucination. Hallucination occurs when an LVLM’s response is coherent but misaligned or inconsistent with the provided context, leading to spurious decision-making or misleading information. This paper proposes BIMA, a bijective maximum likelihood learning approach, to address this issue. BIMA models the distribution of desirable, non-hallucinated responses and learns a bijective mapping between this reference distribution and the machine’s responses, effectively bridging the gap between generated and desired outputs.

Speakers

Huu-Thien Tran — CVIU Lab, University of Arkansas
Thanh-Dat Truong — CVIU Lab, University of Arkansas
Khoa Luu — CVIU Lab, University of Arkansas

Talks (1)

00:00:00 — Huu-Thien Tran: BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models
- This presentation introduces BIMA, a novel approach to predict and mitigate hallucinations in Large Vision-Language Models by modeling a reference distribution of non-hallucinated responses.

Key Takeaways

Hallucination in LVLMs is a significant problem, leading to inconsistent and misleading information, and its mitigation is crucial for reliable AI applications.
BIMA proposes a novel approach by modeling a ‘Reference Distribution’ of non-hallucinated responses using normalizing flows.
The framework integrates a bijective loss during fine-tuning to map generated responses closer to the desired reference distribution, alongside the autoregressive loss.
Experimental results on CHAIR benchmark show BIMA significantly reduces hallucination metrics (CHAIRs and CHAIRi) compared to prior methods.
Qualitative analysis demonstrates that BIMA produces more meaningful and contextually aligned responses with fewer hallucinated objects than baseline LLaVA models.

Methods / Models / Datasets Mentioned

BIMA
Normalizing Flow
LLaVA v1.5
POPE Benchmark
CHAIR Benchmark
Greedy (decoding)
Nucleus (decoding)
Beam Search (decoding)
OPERA
ICD
VCD
SID
ProjectAway

Topics

Large Vision-Language Models (LVLMs) · Hallucination Mitigation · Bijective Maximum Likelihood Learning (BIMA) · Reference Distribution · Normalizing Flow · Visual Question Answering (VQA) · Image Captioning · Trustworthiness in AI

Notes

Open for commentary — connections to other work, critiques, follow-up reading.