15
1
v1v2 (latest)

The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

Main:11 Pages
11 Figures
Bibliography:2 Pages
3 Tables
Appendix:3 Pages
Abstract

As large language models (LLMs) become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs' software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models' true capabilities. It is crucial to distinguish LLMs' generalizable problem-solving ability and other learned artifacts. In this work, we introduce two diagnostic tasks: file path identification from issue descriptions alone, and ground truth function reproduction with only the current file context and issue description to probe models' underlying knowledge. We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving. We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure. This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization. A similar pattern is also observed for the function reproduction task, where the verbatim similarity is much higher on SWE-Bench-Verified than on other similar coding benchmarks. These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks to reliably evaluate LLMs' coding abilities.

View on arXiv
@article{liang2025_2506.12286,
  title={ The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason },
  author={ Shanchao Liang and Spandan Garg and Roshanak Zilouchian Moghaddam },
  journal={arXiv preprint arXiv:2506.12286},
  year={ 2025 }
}
Comments on this paper