Toward More Reliable Multimodal Systems: Mitigating Hallucinations in Large Vision-Language Models - Robotics Institute Carnegie Mellon University

Toward More Reliable Multimodal Systems: Mitigating Hallucinations in Large Vision-Language Models

Master's Thesis, Tech. Report, CMU-RI-TR-25-47, June, 2025

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have led to impressive performance across a wide range of multimodal tasks. However, their tendency to produce hallucinated responses—text that is inconsistent with the visual input—poses a significant challenge to their reliability and real-world applicability. In this thesis, we investigate two training-free approaches for mitigating hallucinations during the decoding process. First, we propose Self-Correcting Decoding with Generative Feedback (DeGF), which leverages the inverse nature of text-to-image generation to detect and correct hallucinations. By synthesizing an auxiliary image from the model’s initial textual response, DeGF provides visual self-feedback to verify and revise hallucinated outputs via contrastive or complementary decoding. Second, we introduce ONLY, a highly efficient decoding method that requires only a single query and a lightweight one-layer intervention. By selectively amplifying textual signal based on a text-to-visual entropy ratio, ONLY improves response reliability while maintaining real-time efficiency with minimal computational overhead. Extensive experiments across multiple hallucination benchmarks demonstrate that both DeGF and ONLY significantly outperform existing methods, offering practical and effective solutions for enhancing the trustworthiness of LVLMs in real-world applications.

BibTeX

@mastersthesis{Wan-2025-147099,
author = {Zifu Wan},
title = {Toward More Reliable Multimodal Systems: Mitigating Hallucinations in Large Vision-Language Models},
year = {2025},
month = {June},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-47},
keywords = {Large Vision Language Models, Hallucination Mitigation},
}