8

Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

Mohammadjavad Ahmadpour
Amirmahdi Meighani
Payam Taebi
Omid Ghahroodi
Amirmohammad Izadi
Mahdieh Soleymani Baghshah
Main:9 Pages
Bibliography:2 Pages
5 Tables
Appendix:11 Pages
Abstract

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.

View on arXiv
Comments on this paper