Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.00332
Cited By
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
1 May 2024
Hugh Zhang
Jeff Da
Dean Lee
Vaughn Robinson
Catherine Wu
Will Song
Tiffany Zhao
P. Raja
Dylan Slack
Qin Lyu
Sean Hendryx
Russell Kaplan
Michele Lunati
Summer Yue
ALM
LRM
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Careful Examination of Large Language Model Performance on Grade School Arithmetic"
50 / 74 papers shown
Title
Towards Contamination Resistant Benchmarks
Rahmatullah Musawi
Sheng Lu
47
0
0
13 May 2025
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Mehmet Hamza Erol
Batu El
Mirac Suzgun
Mert Yuksekgonul
J. Zou
ELM
40
0
0
17 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
31
0
0
14 Apr 2025
Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance
Zuoli Tang
Junjie Ou
Kaiqin Hu
Chunwei Wu
Zhaoxin Huan
Chilin Fu
Xiaolu Zhang
Jun Zhou
Chenliang Li
ReLM
LRM
43
0
0
13 Apr 2025
Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition
Rishi Hazra
Gabriele Venturato
Pedro Zuidberg Dos Martires
Luc de Raedt
ReLM
LRM
63
0
0
04 Apr 2025
Generative Evaluation of Complex Reasoning in Large Language Models
Haowei Lin
Xiang Wang
Ruilin Yan
Baizhou Huang
Haotian Ye
Jianhua Zhu
Zihao Wang
James Zou
Jianzhu Ma
Yitao Liang
ReLM
ELM
LRM
234
0
0
03 Apr 2025
Do "New Snow Tablets" Contain Snow? Large Language Models Over-Rely on Names to Identify Ingredients of Chinese Drugs
Sifan Li
Yujun Cai
Bryan Hooi
Nanyun Peng
Yansen Wang
31
0
0
03 Apr 2025
Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models
Irtaza Khalid
Amir Masoud Nourollah
Steven Schockaert
LRM
52
0
0
30 Mar 2025
MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
Jian Zhang
Zihan Wang
Haiping Zhu
Jun Liu
Qika Lin
Min Zhang
LLMAG
83
1
0
21 Mar 2025
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs
Zhaofeng Wu
Michihiro Yasunaga
Andrew Cohen
Yoon Kim
Asli Celikyilmaz
Marjan Ghazvininejad
48
2
0
14 Mar 2025
Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach
Afrar Jahin
Arif Hassan Zidan
Wei Zhang
Yu Bao
Tianming Liu
LRM
76
1
0
13 Mar 2025
NeurIPS 2023 LLM Efficiency Fine-tuning Competition
Mark Saroufim
Yotam Perlitz
Leshem Choshen
Luca Antiga
Greg Bowyer
...
Ashvini Kumar
Jindal Pawan Kumar
Rajpoot Ankur Parikh
Joe Isaacson
Weiwei Yang
ELM
54
0
0
13 Mar 2025
Toward an Evaluation Science for Generative AI Systems
Laura Weidinger
Deb Raji
Hanna M. Wallach
Margaret Mitchell
Angelina Wang
Olawale Salaudeen
Rishi Bommasani
Sayash Kapoor
Deep Ganguli
Sanmi Koyejo
EGVM
ELM
67
4
0
07 Mar 2025
Are Large Vision Language Models Good Game Players?
Xinyu Wang
Bohan Zhuang
Qi Wu
MLLM
ELM
LRM
102
4
0
04 Mar 2025
EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants
Franck Cappello
Sandeep Madireddy
Robert Underwood
N. Getty
Nicholas Chia
...
M. Rafique
Eliu A. Huerta
Yangqiu Song
Ian Foster
Rick L. Stevens
79
1
0
27 Feb 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Rylan Schaeffer
Punit Singh Koura
Binh Tang
R. Subramanian
Aaditya K. Singh
...
Vedanuj Goswami
Sergey Edunov
Dieuwke Hupkes
Sanmi Koyejo
Sharan Narang
ALM
71
0
0
24 Feb 2025
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Magnus F. Gjerde
Vanessa Cheung
David Lagnado
ReLM
LRM
65
0
0
23 Feb 2025
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Andreas Opedal
Haruki Shirakami
Bernhard Schölkopf
Abulhair Saparov
Mrinmaya Sachan
LRM
57
2
0
17 Feb 2025
A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks
Hieu Minh "Jord" Nguyen
LM&MA
LRM
56
0
0
10 Feb 2025
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
Xin Xu
Qiyun Xu
Tong Xiao
Tianhao Chen
Yuchen Yan
Jiaxin Zhang
Shizhe Diao
Can Yang
Yang Wang
ELM
LRM
AI4CE
113
4
0
01 Feb 2025
Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping
Pu Yang
Yunzhen Feng
Ziyuan Chen
Yuhang Wu
Zhuoyuan Li
DiffM
106
0
0
31 Jan 2025
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Samira Abnar
Harshay Shah
Dan Busbridge
Alaaeldin Mohamed Elnouby Ali
J. Susskind
Vimal Thilak
MoE
LRM
44
5
0
28 Jan 2025
Multi-Step Reasoning in Korean and the Emergent Mirage
Guijin Son
Hyunwoo Ko
Dasol Choi
LRM
ReLM
70
0
0
10 Jan 2025
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Yuhui Zhang
Yuchang Su
Yiming Liu
Xiaohan Wang
James Burgess
...
Josiah Aklilu
Alejandro Lozano
Anjiang Wei
Ludwig Schmidt
Serena Yeung-Levy
64
3
0
06 Jan 2025
Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap
Hyunwoo Ko
Guijin Son
Dasol Choi
RALM
LRM
83
9
0
05 Jan 2025
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
Xiaobao Wu
Liangming Pan
Yuxi Xie
Ruiwen Zhou
Shuai Zhao
Yubo Ma
Mingzhe Du
Rui Mao
Anh Tuan Luu
William Yang Wang
144
10
0
18 Dec 2024
Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments
Andrii Nikolaiev
Yiannos Stathopoulos
Simone Teufel
LRM
83
0
0
16 Dec 2024
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Angelika Romanou
Negar Foroutan
Anna Sotnikova
Zeming Chen
Sree Harsha Nelaturu
...
Mike Zhang
Imanol Schlag
Marzieh Fadaee
Sara Hooker
Antoine Bosselut
ELM
113
6
0
29 Nov 2024
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
Sohee Yang
Nora Kassner
E. Gribovskaya
Sebastian Riedel
Mor Geva
KELM
LRM
ReLM
78
5
0
25 Nov 2024
On Memorization of Large Language Models in Logical Reasoning
Chulin Xie
Yangsibo Huang
Chiyuan Zhang
Da Yu
Xinyun Chen
Bill Yuchen Lin
Bo Li
Badih Ghazi
Ravi Kumar
LRM
55
24
0
30 Oct 2024
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation
João Matos
Shan Chen
Siena Placino
Yingya Li
Juan Carlos Climent Pardo
...
Hugo J. W. L. Aerts
Leo Anthony Celi
A. I. Wong
Danielle S. Bitterman
Jack Gallifant
34
0
0
16 Oct 2024
3DArticCyclists: Generating Synthetic Articulated 8D Pose-Controllable Cyclist Data for Computer Vision Applications
Eduardo R. Corral-Soto
Yang Liu
Tongtong Cao
Y. Ren
Liu Bingbing
55
0
0
14 Oct 2024
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Jacob Haimes
Cenny Wenner
Kunvar Thaman
Vassil Tashev
Clement Neo
Esben Kran
Jason Schreiber
36
5
0
11 Oct 2024
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Iman Mirzadeh
Keivan Alizadeh
Hooman Shahrokhi
Oncel Tuzel
Samy Bengio
Mehrdad Farajtabar
AIMat
LRM
66
139
0
07 Oct 2024
Not All LLM Reasoners Are Created Equal
Arian Hosseini
Alessandro Sordoni
Daniel Toyama
Rameswar Panda
Rishabh Agarwal
LRM
49
11
0
02 Oct 2024
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Shubham Toshniwal
Wei Du
Ivan Moshkov
Branislav Kisacanin
Alexan Ayrapetyan
Igor Gitman
LRM
26
51
0
02 Oct 2024
Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems
Stephen Miner
Yoshiki Takashima
Simeng Han
Ferhat Erata
Timos Antonopoulos
R. Piskac
Scott J. Shapiro
LRM
36
3
0
30 Sep 2024
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
Ezra Karger
Houtan Bastani
Chen Yueh-Han
Zachary Jacobs
Danny Halawi
Fred Zhang
P. Tetlock
58
7
0
30 Sep 2024
Revisiting the Superficial Alignment Hypothesis
Mohit Raghavendra
Vaskar Nath
Sean Hendryx
LRM
25
0
0
27 Sep 2024
Small Language Models: Survey, Measurements, and Insights
Zhenyan Lu
Xiang Li
Dongqi Cai
Rongjie Yi
Fangming Liu
Xiwen Zhang
Nicholas D. Lane
Mengwei Xu
ObjD
LRM
61
37
0
24 Sep 2024
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Zayne Sprague
Fangcong Yin
Juan Diego Rodriguez
Dongwei Jiang
Manya Wadhwa
Prasann Singhal
Xinyu Zhao
Xi Ye
Kyle Mahowald
Greg Durrett
ReLM
LRM
125
89
0
18 Sep 2024
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
Jasper Dekoninck
Maximilian Baader
Martin Vechev
ALM
92
0
0
01 Sep 2024
Can Large Language Models Reason? A Characterization via 3-SAT
Rishi Hazra
Gabriele Venturato
Pedro Zuidberg Dos Martires
Luc de Raedt
ELM
ReLM
LRM
38
4
0
13 Aug 2024
A Perspective on Large Language Models, Intelligent Machines, and Knowledge Acquisition
V. Cherkassky
Eng Hock Lee
ELM
41
1
0
13 Aug 2024
Evaluating Language Model Math Reasoning via Grounding in Educational Curricula
L. Lucy
Tal August
Rose E. Wang
Luca Soldaini
Courtney Allison
Kyle Lo
ReLM
LRM
31
3
0
08 Aug 2024
Active Testing of Large Language Model via Multi-Stage Sampling
Yuheng Huang
Jiayang Song
Qiang Hu
Felix Juefei-Xu
Lei Ma
35
2
0
07 Aug 2024
AI-Assisted Generation of Difficult Math Questions
Vedant Shah
Dingli Yu
Kaifeng Lyu
Simon Park
Nan Rosemary Ke
...
Yoshua Bengio
Sanjeev Arora
Anirudh Goyal
Sanjeev Arora
Anirudh Goyal
53
16
0
30 Jul 2024
Questionable practices in machine learning
Gavin Leech
Juan J. Vazquez
Misha Yagudin
Niclas Kupper
Laurence Aitchison
61
4
0
17 Jul 2024
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang
Bo Li
Peiyuan Zhang
Fanyi Pu
Joshua Adrian Cahyono
...
Shuai Liu
Yuanhan Zhang
Jingkang Yang
Chunyuan Li
Ziwei Liu
99
76
0
17 Jul 2024
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
Zihao Zhou
Shudong Liu
Maizhen Ning
Wei Liu
Jindong Wang
Derek F. Wong
Xiaowei Huang
Qiufeng Wang
Kaizhu Huang
ELM
LRM
71
25
0
11 Jul 2024
1
2
Next