ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.08084
29
0

Visually Interpretable Subtask Reasoning for Visual Question Answering

12 May 2025
Yu Cheng
A. Goel
Hakan Bilen
    LRM
ArXivPDFHTML
Abstract

Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available atthis https URL.

View on arXiv
@article{cheng2025_2505.08084,
  title={ Visually Interpretable Subtask Reasoning for Visual Question Answering },
  author={ Yu Cheng and Arushi Goel and Hakan Bilen },
  journal={arXiv preprint arXiv:2505.08084},
  year={ 2025 }
}
Comments on this paper