Are Large Language Models Memorizing Bug Benchmarks?
- PILMELM

Abstract
Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage.
View on arXiv@article{ramos2025_2411.13323, title={ Are Large Language Models Memorizing Bug Benchmarks? }, author={ Daniel Ramos and Claudia Mamede and Kush Jain and Paulo Canelas and Catarina Gamboa and Claire Le Goues }, journal={arXiv preprint arXiv:2411.13323}, year={ 2025 } }
Comments on this paper