CODEMENV: Benchmarking Large Language Models on Code Migration

1 June 2025

Main:9 Pages

5 Figures

Bibliography:2 Pages

8 Tables

Appendix:14 Pages

Abstract

Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available atthis https URL.

View on arXiv

@article{cheng2025_2506.00894,
  title={ CODEMENV: Benchmarking Large Language Models on Code Migration },
  author={ Keyuan Cheng and Xudong Shen and Yihao Yang and Tengyue Wang and Yang Cao and Muhammad Asif Ali and Hanbin Wang and Lijie Hu and Di Wang },
  journal={arXiv preprint arXiv:2506.00894},
  year={ 2025 }
}

Comments on this paper