ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.00172
16
0

Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

30 May 2025
Kaivalya Hariharan
Uzay Girit
Atticus Wang
Jacob Andreas
    LLMAGLRM
ArXiv (abs)PDFHTML
Main:13 Pages
15 Figures
Bibliography:2 Pages
Appendix:6 Pages
Abstract

Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate. However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly comprehend and manipulate novel, complex structures dynamically; evaluating these capabilities requires the ability to construct large and varied sets of problems for agents to solve. We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions within real-world software repositories. Breakpoint systematically controls task difficulty along two clear dimensions: local reasoning (characterized by code complexity metrics such as cyclomatic complexity) and system-level reasoning (characterized by call-graph centrality and the number of simultaneously corrupted interdependent functions). In experiments across more than 900 generated tasks we demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.

View on arXiv
@article{hariharan2025_2506.00172,
  title={ Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents },
  author={ Kaivalya Hariharan and Uzay Girit and Atticus Wang and Jacob Andreas },
  journal={arXiv preprint arXiv:2506.00172},
  year={ 2025 }
}
Comments on this paper