Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.00334
Cited By
v1
v2
v3
v4 (latest)
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
1 February 2025
Xin Xu
Qiyun Xu
Tong Xiao
Tianhao Chen
Yuchen Yan
Jiaxin Zhang
Shizhe Diao
Can Yang
Yang Wang
LRM
AI4CE
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models"
50 / 52 papers shown
Title
Superstudent intelligence in thermodynamics
Rebecca Loubet
Pascal Zittlau
Marco Hoffmann
Luisa Vollmer
Sophie Fellenz
Heike Leitte
Fabian Jirasek
Johannes Lenhard
Hans Hasse
ELM
52
0
0
11 Jun 2025
PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models
Yinggan Xu
Yue Liu
Zhiqiang Gao
Changnan Peng
Di Luo
LRM
37
0
0
30 May 2025
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning
Kun Xiang
Heng Li
Terry Jingchen Zhang
Yinya Huang
Zirong Liu
...
J. N. Han
Hang Xu
Hanhui Li
Mrinmaya Sachan
Xiaodan Liang
LRM
199
0
0
25 May 2025
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Shivam Agarwal
Zimin Zhang
Lifan Yuan
Jiawei Han
Hao Peng
180
8
0
21 May 2025
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Chujie Zheng
Zizhuo Zhang
Beichen Zhang
Runji Lin
Keming Lu
Bowen Yu
Dayiheng Liu
Jingren Zhou
Junyang Lin
LRM
229
77
0
09 Dec 2024
Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents
Raj Jaiswal
Dhruv Jain
Harsh Parimal Popat
Avinash Anand
Abhishek Dharmadhikari
Atharva Marathe
R. Shah
LRM
AI4CE
151
5
0
01 Dec 2024
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Bofei Gao
Feifan Song
Zhiyong Yang
Zefan Cai
Yibo Miao
...
Lei Sha
Yichang Zhang
Xuancheng Ren
Tianyu Liu
Baobao Chang
ELM
LRM
128
66
0
10 Oct 2024
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Shubham Toshniwal
Wei Du
Ivan Moshkov
Branislav Kisacanin
Alexan Ayrapetyan
Igor Gitman
LRM
107
71
0
02 Oct 2024
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
An Yang
Beichen Zhang
Binyuan Hui
Bofei Gao
Bowen Yu
...
Mingfeng Xue
Runji Lin
Tianyu Liu
Xingzhang Ren
Zhenru Zhang
OSLM
LRM
162
321
0
18 Sep 2024
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White
Samuel Dooley
Manley Roberts
Arka Pal
Ben Feuer
...
Willie Neiswanger
Micah Goldblum
Tom Goldstein
Willie Neiswanger
Micah Goldblum
ELM
125
20
0
27 Jun 2024
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
Kun Qian
Shunji Wan
Claudia Tang
Youzhi Wang
Xuanming Zhang
Maximillian Chen
Zhou Yu
AAML
93
12
0
25 Jun 2024
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
Yuxuan Tong
Xiwen Zhang
Rui Wang
R. Wu
Junxian He
AIMat
LRM
88
43
0
18 Jun 2024
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
Zhen Huang
Zengzhi Wang
Shijie Xia
Xuefeng Li
Haoyang Zou
...
Yuxiang Zheng
Shaoting Zhang
Dahua Lin
Yu Qiao
Pengfei Liu
ELM
LRM
140
43
0
18 Jun 2024
Can LLMs Solve longer Math Word Problems Better?
Xin Xu
Tong Xiao
Zitong Chao
Zhenya Huang
Can Yang
Yang Wang
197
14
0
23 May 2024
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark
Hongwei Liu
Zilong Zheng
Yuxuan Qiao
Haodong Duan
Zhiwei Fei
Fengzhe Zhou
Wenwei Zhang
Songyang Zhang
Dahua Lin
Kai-xiang Chen
121
68
0
20 May 2024
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving
Aniket Didolkar
Anirudh Goyal
Nan Rosemary Ke
Siyuan Guo
Michal Valko
Timothy Lillicrap
Danilo Jimenez Rezende
Yoshua Bengio
Michael C. Mozer
Sanjeev Arora
LRM
72
30
0
20 May 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI
Aixin Liu
Bei Feng
Bin Wang
Bingxuan Wang
...
Zhuoshu Li
Zihan Wang
Zihui Gu
Zilin Li
Ziwei Xie
MoE
173
500
0
07 May 2024
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Hugh Zhang
Jeff Da
Dean Lee
Vaughn Robinson
Catherine Wu
...
Qin Lyu
Sean Hendryx
Russell Kaplan
Michele Lunati
Summer Yue
ALM
LRM
ELM
108
110
0
01 May 2024
Benchmarking Benchmark Leakage in Large Language Models
Ruijie Xu
Zengzhi Wang
Run-Ze Fan
Pengfei Liu
129
54
0
29 Apr 2024
RewardBench: Evaluating Reward Models for Language Modeling
Nathan Lambert
Valentina Pyatkin
Jacob Morrison
Lester James V. Miranda
Bill Yuchen Lin
...
Sachin Kumar
Tom Zick
Yejin Choi
Noah A. Smith
Hanna Hajishirzi
ALM
200
260
0
20 Mar 2024
MathScale: Scaling Instruction Tuning for Mathematical Reasoning
Zhengyang Tang
Xingxing Zhang
Benyou Wang
Furu Wei
ALM
LRM
97
83
0
05 Mar 2024
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
Saurabh Srivastava
B. AnnaroseM
V. AntoP
Shashank Menon
Ajay Sukumar
T. AdwaithSamod
Alan Philipose
Stevin Prince
Sooraj Thomas
ELM
ReLM
LRM
79
56
0
29 Feb 2024
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
Chaoqun He
Renjie Luo
Yuzhuo Bai
Shengding Hu
Zhen Leng Thai
...
Yuxiang Zhang
Jie Liu
Lei Qi
Zhiyuan Liu
Maosong Sun
ELM
AIMat
175
282
0
21 Feb 2024
SciAgent: Tool-augmented Language Models for Scientific Reasoning
Yubo Ma
Zhibin Gou
Junheng Hao
Ruochen Xu
Shuohang Wang
...
Yujiu Yang
Yixin Cao
Aixin Sun
Hany Awadalla
Weizhu Chen
RALM
LRM
LLMAG
135
24
0
18 Feb 2024
Can We Verify Step by Step for Incorrect Answer Detection?
Xin Xu
Shizhe Diao
Can Yang
Yang Wang
LRM
348
15
0
16 Feb 2024
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao
Peiyi Wang
Qihao Zhu
Runxin Xu
Jun-Mei Song
...
Haowei Zhang
Mingchuan Zhang
Yiming Li
Yu-Huan Wu
Daya Guo
ReLM
LRM
219
1,289
0
05 Feb 2024
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models
Jinchang Hou
Chang Ao
Haihong Wu
Xiangtao Kong
Zhigang Zheng
...
Chengming Li
Xiping Hu
Ruifeng Xu
Shiwen Ni
Min Yang
AI4Ed
ELM
71
6
0
29 Jan 2024
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Damai Dai
Chengqi Deng
Chenggang Zhao
R. X. Xu
Huazuo Gao
...
Panpan Huang
Fuli Luo
Chong Ruan
Zhifang Sui
W. Liang
MoE
127
321
0
11 Jan 2024
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein
Betty Li Hou
Asa Cooper Stickland
Jackson Petty
Richard Yuanzhe Pang
Julien Dirani
Julian Michael
Samuel R. Bowman
AI4MH
ELM
183
738
0
20 Nov 2023
Llemma: An Open Language Model For Mathematics
Zhangir Azerbayev
Hailey Schoelkopf
Keiran Paster
Marco Dos Santos
Stephen Marcus McAleer
Albert Q. Jiang
Jia Deng
Stella Biderman
Sean Welleck
CLL
130
303
0
16 Oct 2023
NEWTON: Are Large Language Models Capable of Physical Reasoning?
Yi Ru Wang
Jiafei Duan
Dieter Fox
S. Srinivasa
ELM
LRM
AIMat
ReLM
135
35
0
10 Oct 2023
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu
Hritik Bansal
Tony Xia
Jiacheng Liu
Chun-yue Li
Hannaneh Hajishirzi
Hao Cheng
Kai-Wei Chang
Michel Galley
Jianfeng Gao
LRM
MLLM
175
669
0
03 Oct 2023
Using Large Language Model to Solve and Explain Physics Word Problems Approaching Human Level
Jingzhe Ding
Yan Cen
Xinyuan Wei
AI4CE
98
11
0
15 Sep 2023
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
Xiaoxuan Wang
Ziniu Hu
Pan Lu
Yanqiao Zhu
Jieyu Zhang
Satyen Subramaniam
Arjun R. Loomba
Shichang Zhang
Yizhou Sun
Wei Wang
ELM
LRM
76
114
0
20 Jul 2023
CMMLU: Measuring massive multitask language understanding in Chinese
Haonan Li
Yixuan Zhang
Fajri Koto
Yifei Yang
Hai Zhao
Yeyun Gong
Nan Duan
Tim Baldwin
ALM
ELM
142
274
0
15 Jun 2023
Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models
Daman Arora
H. Singh
Mausam
ELM
LRM
138
55
0
24 May 2023
TheoremQA: A Theorem-driven Question Answering dataset
Wenhu Chen
Ming Yin
Max Ku
Pan Lu
Yixin Wan
Xueguang Ma
Jianyu Xu
Xinyi Wang
Tony Xia
AIMat
133
140
0
21 May 2023
Evaluating the Performance of Large Language Models on GAOKAO Benchmark
Xiaotian Zhang
Chun-yan Li
Yi Zong
Zhengyu Ying
Liang He
Xipeng Qiu
ALM
ELM
125
115
0
21 May 2023
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
Yuzhen Huang
Yuzhuo Bai
Zhihao Zhu
Junlei Zhang
Jinghan Zhang
...
Yikai Zhang
Jiayi Lei
Yao Fu
Maosong Sun
Junxian He
ELM
LRM
156
552
0
15 May 2023
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong
Ruixiang Cui
Yiduo Guo
Yaobo Liang
Shuai Lu
Yanlin Wang
Amin Saied
Weizhu Chen
Nan Duan
ALM
ELM
138
550
0
13 Apr 2023
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
Wenhu Chen
Xueguang Ma
Xinyi Wang
William W. Cohen
ReLM
ReCod
LRM
394
829
0
22 Nov 2022
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
Ashwin Kalyan
ELM
ReLM
LRM
304
1,303
0
20 Sep 2022
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz
Anders Andreassen
David Dohan
Ethan Dyer
Henryk Michalewski
...
Theo Gutman-Solo
Yuhuai Wu
Behnam Neyshabur
Guy Gur-Ari
Vedant Misra
ReLM
ELM
LRM
306
866
0
29 Jun 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang
Jason W. Wei
Dale Schuurmans
Quoc Le
Ed H. Chi
Sharan Narang
Aakanksha Chowdhery
Denny Zhou
ReLM
BDL
LRM
AI4CE
751
3,762
0
21 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
1.1K
9,827
0
28 Jan 2022
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
445
4,610
0
27 Oct 2021
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning
Jiaqi Chen
Jianheng Tang
Jinghui Qin
Xiaodan Liang
Lingbo Liu
Eric Xing
Liang Lin
AIMat
119
188
0
30 May 2021
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
Basel Alomair
Jacob Steinhardt
ReLM
FaML
259
2,415
0
05 Mar 2021
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELM
RALM
524
4,587
0
07 Sep 2020
PIQA: Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk
Rowan Zellers
Ronan Le Bras
Jianfeng Gao
Yejin Choi
OOD
LRM
394
1,854
0
26 Nov 2019
1
2
Next