HueManity: Probing Fine-Grained Visual Perception in MLLMs

31 May 2025

Main:7 Pages

3 Figures

Bibliography:3 Pages

3 Tables

Appendix:3 Pages

Abstract

Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric `easy' task and a striking 3% on the alphanumeric `hard' task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.

View on arXiv

@article{grover2025_2506.03194,
  title={ HueManity: Probing Fine-Grained Visual Perception in MLLMs },
  author={ Rynaa Grover and Jayant Sravan Tamarapalli and Sahiti Yerramilli and Nilay Pande },
  journal={arXiv preprint arXiv:2506.03194},
  year={ 2025 }
}

Comments on this paper