35
0

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

Abstract

Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc., and highlights its potential in complex real-world scenarios.

View on arXiv
@article{ni2025_2505.19702,
  title={ Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning },
  author={ Minheng Ni and Zhengyuan Yang and Linjie Li and Chung-Ching Lin and Kevin Lin and Wangmeng Zuo and Lijuan Wang },
  journal={arXiv preprint arXiv:2505.19702},
  year={ 2025 }
}
Comments on this paper