Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

11 December 2025

Mohammadjavad Ahmadpour

Amirmahdi Meighani

Payam Taebi

Omid Ghahroodi

Amirmohammad Izadi

Mahdieh Soleymani Baghshah

LRM

VLM

ArXiv (abs)PDF HTML

Main:9 Pages

Bibliography:2 Pages

5 Tables

Appendix:11 Pages

Abstract

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.

View on arXiv

Comments on this paper