Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.12896
Cited By
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks
18 February 2025
Eva Sánchez Salido
Julio Gonzalo
Guillermo Marco
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks"
33 / 33 papers shown
Title
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations
Kaixuan Huang
Jiacheng Guo
Zihao Li
X. Ji
Jiawei Ge
...
Yangsibo Huang
Chi Jin
Xinyun Chen
Chiyuan Zhang
Mengdi Wang
AAML
LRM
153
14
0
10 Feb 2025
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Yaniv Nikankin
Anja Reusch
Aaron Mueller
Yonatan Belinkov
AIFin
LRM
96
32
0
28 Oct 2024
Not All LLM Reasoners Are Created Equal
Arian Hosseini
Alessandro Sordoni
Daniel Toyama
Rameswar Panda
Rishabh Agarwal
LRM
94
15
0
02 Oct 2024
LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
Karthik Valmeekam
Kaya Stechly
Subbarao Kambhampati
LLMAG
LRM
ELM
110
53
0
20 Sep 2024
Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges
Vinay Samuel
Yue Zhou
Henry Peng Zou
AAML
51
8
0
16 Sep 2024
Understanding Foundation Models: Are We Back in 1924?
Alan F. Smeaton
AI4CE
58
3
0
11 Sep 2024
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team
Gemma Team Morgane Riviere
Shreya Pathak
Pier Giuseppe Sessa
Cassidy Hardin
...
Noah Fiedel
Armand Joulin
Kathleen Kenealy
Robert Dadashi
Alek Andreev
VLM
MoE
OSLM
115
858
0
31 Jul 2024
Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
Akshara Prabhakar
Thomas Griffiths
R. Thomas McCoy
LRM
81
19
0
01 Jul 2024
A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
Bowen Jiang
Yangxinyu Xie
Zhuoqun Hao
Xiaomeng Wang
Tanwi Mallick
Weijie J. Su
Camillo J Taylor
Dan Roth
LRM
106
51
0
16 Jun 2024
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team
Gemma Team Thomas Mesnard
Cassidy Hardin
Robert Dadashi
Surya Bhupatiraju
...
Armand Joulin
Noah Fiedel
Evan Senter
Alek Andreev
Kathleen Kenealy
VLM
LLMAG
191
487
0
13 Mar 2024
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang
Lianmin Zheng
Ying Sheng
Anastasios Nikolas Angelopoulos
Tianle Li
...
Hao Zhang
Banghua Zhu
Michael I. Jordan
Joseph E. Gonzalez
Ion Stoica
OSLM
142
570
0
07 Mar 2024
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
Saurabh Srivastava
B. AnnaroseM
V. AntoP
Shashank Menon
Ajay Sukumar
T. AdwaithSamod
Alan Philipose
Stevin Prince
Sooraj Thomas
ELM
ReLM
LRM
46
54
0
29 Feb 2024
Do Large Language Models Understand Logic or Just Mimick Context?
Junbing Yan
Chengyu Wang
Junyuan Huang
Wei Zhang
ReLM
ELM
LRM
43
5
0
19 Feb 2024
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
Simone Balloccu
Patrícia Schmidtová
Mateusz Lango
Ondrej Dusek
SILM
ELM
PILM
69
174
0
06 Feb 2024
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions
Pengfei Hong
Navonil Majumder
Deepanway Ghosal
Somak Aditya
Rada Mihalcea
Soujanya Poria
LRM
61
4
0
17 Jan 2024
Investigating Data Contamination for Pre-training Language Models
Minhao Jiang
Ken Ziyu Liu
Ming Zhong
Rylan Schaeffer
Siru Ouyang
Jiawei Han
Sanmi Koyejo
64
71
0
11 Jan 2024
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein
Betty Li Hou
Asa Cooper Stickland
Jackson Petty
Richard Yuanzhe Pang
Julien Dirani
Julian Michael
Samuel R. Bowman
AI4MH
ELM
83
667
0
20 Nov 2023
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Shuo Yang
Wei-Lin Chiang
Lianmin Zheng
Joseph E. Gonzalez
Ion Stoica
ALM
53
126
0
08 Nov 2023
NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark
Oscar Sainz
Jon Ander Campos
Iker García-Ferrero
Julen Etxaniz
Oier López de Lacalle
Eneko Agirre
65
180
0
27 Oct 2023
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
Zhaofeng Wu
Linlu Qiu
Alexis Ross
Ekin Akyürek
Boyuan Chen
Bailin Wang
Najoung Kim
Jacob Andreas
Yoon Kim
LRM
ReLM
159
216
0
05 Jul 2023
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Wenxuan Zhang
Sharifah Mahani Aljunied
Chang Gao
Yew Ken Chia
Lidong Bing
ELM
87
85
0
08 Jun 2023
Faith and Fate: Limits of Transformers on Compositionality
Nouha Dziri
Ximing Lu
Melanie Sclar
Xiang Lorraine Li
Liwei Jian
...
Sean Welleck
Xiang Ren
Allyson Ettinger
Zaïd Harchaoui
Yejin Choi
ReLM
LRM
120
376
0
29 May 2023
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong
Ruixiang Cui
Yiduo Guo
Yaobo Liang
Shuai Lu
Yanlin Wang
Amin Saied
Weizhu Chen
Nan Duan
ALM
ELM
75
532
0
13 Apr 2023
MEGA: Multilingual Evaluation of Generative AI
Kabir Ahuja
Harshita Diddee
Rishav Hada
Millicent Ochieng
Krithika Ramesh
...
T. Ganu
Sameer Segal
Maxamed Axmed
Kalika Bali
Sunayana Sitaram
LM&MA
LRM
ELM
87
282
0
22 Mar 2023
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALM
PILM
1.5K
13,182
0
27 Feb 2023
On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
Jindong Wang
Xixu Hu
Wenxin Hou
Hao Chen
Runkai Zheng
...
Weirong Ye
Xiubo Geng
Binxing Jiao
Yue Zhang
Xingxu Xie
AI4MH
106
233
0
22 Feb 2023
GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective
Linyi Yang
Shuibai Zhang
Libo Qin
Yafu Li
Yidong Wang
Hanmeng Liu
Jindong Wang
Xingxu Xie
Yue Zhang
ELM
91
81
0
15 Nov 2022
Measure and Improve Robustness in NLP Models: A Survey
Xuezhi Wang
Haohan Wang
Diyi Yang
217
138
0
15 Dec 2021
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models
Wei Ping
Chejian Xu
Shuohang Wang
Zhe Gan
Yu Cheng
Jianfeng Gao
Ahmed Hassan Awadallah
Yangqiu Song
VLM
ELM
AAML
58
222
0
04 Nov 2021
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
274
4,397
0
27 Oct 2021
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
D. Song
Jacob Steinhardt
ReLM
FaML
163
2,255
0
05 Mar 2021
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
743
41,932
0
28 May 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
584
4,801
0
23 Jan 2020
1