Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

10 April 2026

Zhi Chen

Zhensu Sun

Yuling Shi

Chao Peng

Xiaodong Gu

David Lo

Lingxiao Jiang

LLMAG

AIFin

ELM

ArXiv (abs)PDF HTML HuggingFace (4 upvotes)Github (626★)

Main:18 Pages

3 Figures

Bibliography:3 Pages

7 Tables

Abstract

Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, but the value of this behavior remains unclear. For example, GPT-5.2 writes almost no new tests yet achieves performance comparable to top-rankingthis http URLraises a central question: do such tests meaningfully improve issue resolution, or do they mainly mimic a familiar software-development practice while consuming interaction budget?To better understand the role of agent-written tests, we analyze trajectories produced by six strong LLMs on SWE-bench Verified. Our results show that test writing is common, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies. When tests are written, they mainly serve as observational feedback channels, with value-revealing print statements appearing much more often than assertion-based checks. Based on these insights, we perform a prompt-intervention study by revising the prompts used with four models to either increase or reduce test writing. The results suggest that prompt-induced changes in the volume of agent-written tests do not significantly change final outcomes in this setting. Taken together, these results suggest that current agent-written testing practices reshape process and cost more than final task outcomes.

View on arXiv

Comments on this paper