ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.23803
57
0

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

31 March 2025
Yingwei Ma
Binhua Li
Yihong Dong
Xue Jiang
Rongyu Cao
J. Chen
Fei Huang
Y. Li
    LLMAG
    LRM
ArXivPDFHTML
Abstract

Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment challenges in private environments, prompting a critical question: \textit{How can personally deployable open-source LLMs achieve comparable code reasoning performance?}To this end, we propose a unified Test-Time Compute scaling framework that leverages increased inference-time computation instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. Internally, we introduce a \textit{development-contextualized trajectory synthesis} method leveraging real-world software repositories to bootstrap multi-stage reasoning processes, such as fault localization and patch generation. We further enhance trajectory quality through rejection sampling, rigorously evaluating trajectories along accuracy and complexity. Externally, we propose a novel \textit{development-process-based search} strategy guided by reward models and execution verification. This approach enables targeted computational allocation at critical development decision points, overcoming limitations of existing "end-point only" verification methods.Evaluations on SWE-bench Verified demonstrate our \textbf{32B model achieves a 46\% issue resolution rate}, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1. Additionally, we provide the empirical validation of the test-time scaling phenomenon within SWE agents, revealing that \textbf{models dynamically allocate more tokens to increasingly challenging problems}, effectively enhancing reasoning capabilities. We publicly release all training data, models, and code to facilitate future research.this https URL

View on arXiv
@article{ma2025_2503.23803,
  title={ Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute },
  author={ Yingwei Ma and Yongbin Li and Yihong Dong and Xue Jiang and Rongyu Cao and Jue Chen and Fei Huang and Binhua Li },
  journal={arXiv preprint arXiv:2503.23803},
  year={ 2025 }
}
Comments on this paper