RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

18 June 2025

Xinnuo Xu

Rachel Lawrence

Kshitij Dubey

Atharva Pandey

Risa Ueno

Fabian Falck

Aditya V. Nori

Rahul Sharma

Amit Sharma

Javier Gonzalez

Author Contacts:

xinnuoxu@microsoft.com rachel.lawrence@microsoft.com

LRM

ArXiv (abs)PDF HTML

30 Figures

Bibliography:1 Pages

9 Tables

Appendix:31 Pages

Abstract

Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

View on arXiv

@article{xu2025_2506.15455,
  title={ RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation },
  author={ Xinnuo Xu and Rachel Lawrence and Kshitij Dubey and Atharva Pandey and Risa Ueno and Fabian Falck and Aditya V. Nori and Rahul Sharma and Amit Sharma and Javier Gonzalez },
  journal={arXiv preprint arXiv:2506.15455},
  year={ 2025 }
}

Comments on this paper