Evaluating Large Language Models for Code Review

26 May 2025

Main:6 Pages

10 Figures

Bibliography:2 Pages

Abstract

Context: Code reviews are crucial for software quality. Recent AI advances have allowed large language models (LLMs) to review and fix code; now, there are tools that perform these reviews. However, their reliability and accuracy have not yet been systematically evaluated. Objective: This study compares different LLMs' performance in detecting code correctness and suggesting improvements. Method: We tested GPT4o and Gemini 2.0 Flash on 492 AI generated code blocks of varying correctness, along with 164 canonical code blocks from the HumanEval benchmark. To simulate the code review task objectively, we expected LLMs to assess code correctness and improve the code if needed. We ran experiments with different configurations and reported on the results. Results: With problem descriptions, GPT4o and Gemini 2.0 Flash correctly classified code correctness 68.50% and 63.89% of the time, respectively, and corrected the code 67.83% and 54.26% of the time for the 492 code blocks of varying correctness. Without problem descriptions, performance declined. The results for the 164 canonical code blocks differed, suggesting that performance depends on the type of code. Conclusion: LLM code reviews can help suggest improvements and assess correctness, but there is a risk of faulty outputs. We propose a process that involves humans, called the "Human in the loop LLM Code Review" to promote knowledge sharing while mitigating the risk of faulty outputs.

View on arXiv

@article{cihan2025_2505.20206,
  title={ Evaluating Large Language Models for Code Review },
  author={ Umut Cihan and Arda İçöz and Vahid Haratian and Eray Tüzün },
  journal={arXiv preprint arXiv:2505.20206},
  year={ 2025 }
}

Comments on this paper