Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

6 April 2026

Tuan Dung Nguyen

Minh Khoi Ho

Qi Chen

Yutong Xie

Nguyen Cam-Tu

Minh Khoi Nguyen

Dang Huy Pham Nguyen

Anton van den Hengel

Johan W. Verjans

Phi Le Nguyen

Vu Minh Hieu Phan

MLLM

ArXiv (abs)PDF HTML Github

Main:8 Pages

11 Figures

Bibliography:2 Pages

8 Tables

Appendix:3 Pages

Abstract

Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention seen in faithful tokens; and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.

View on arXiv

Comments on this paper