ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.12466
93
1
v1v2 (latest)

EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

18 February 2025
Anjiang Wei
Jiannan Cao
Ran Li
Hong Chen
Yuhui Zhang
Ziheng Wang
Yuan Liu
Thiago S. F. X. Teixeira
Diyi Yang
Ke Wang
Ke Wang
Alex Aiken
    LRM
ArXiv (abs)PDFHTML
Main:8 Pages
6 Figures
Bibliography:3 Pages
5 Tables
Appendix:3 Pages
Abstract

Equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs, underpins a broad range of applications, including software refactoring, testing, and optimization. We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models (LLMs). We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories. These pairs are systematically generated through program analysis, compiler scheduling, and superoptimization, covering nontrivial structural transformations that demand deep semantic reasoning beyond simple syntactic variations. Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%. In the most challenging categories, the best accuracies are 62.3% and 68.8%, only modestly above the 50% random baseline for binary classification, indicating significant room for improvement in current models' code reasoning capabilities.

View on arXiv
@article{wei2025_2502.12466,
  title={ EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking },
  author={ Anjiang Wei and Jiannan Cao and Ran Li and Hongyu Chen and Yuhui Zhang and Ziheng Wang and Yuan Liu and Thiago S. F. X. Teixeira and Diyi Yang and Ke Wang and Alex Aiken },
  journal={arXiv preprint arXiv:2502.12466},
  year={ 2025 }
}
Comments on this paper