Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.12209
Cited By
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark
20 May 2024
Hongwei Liu
Zilong Zheng
Yuxuan Qiao
Haodong Duan
Zhiwei Fei
Fengzhe Zhou
Wenwei Zhang
Songyang Zhang
Dahua Lin
Kai-xiang Chen
Re-assign community
ArXiv (abs)
PDF
HTML
Github (100★)
Papers citing
"MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark"
17 / 17 papers shown
Title
Evaluation of LLMs for mathematical problem solving
Ruonan Wang
Runxi Wang
Yunwen Shen
Chengfeng Wu
Qinglin Zhou
Rohitash Chandra
ELM
LRM
46
0
0
30 May 2025
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
Xiang Li
Haiyang Yu
Xinghua Zhang
Ziyang Huang
Shizhu He
Kang Liu
Jun Zhao
Fei Huang
Yongbin Li
LRM
32
0
0
29 May 2025
DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures
Yu He
Yingxi Li
Colin White
Ellen Vitercik
ELM
LRM
24
0
0
29 May 2025
ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark
M. Shalyt
Rotem Elimelech
I. Kaminer
20
0
0
28 May 2025
The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants
Yiqun Zhang
Hao Li
Chenxu Wang
L. Chen
Qiaosheng Zhang
...
Xinrun Wang
Jia Xu
Lei Bai
Wanli Ouyang
Shuyue Hu
77
0
0
26 May 2025
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency
Zhikai Wang
Jiashuo Sun
Weinan Zhang
Zhiqiang Hu
Xin Li
F. Wang
Deli Zhao
VLM
LRM
186
1
0
24 Apr 2025
Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute
Jianhao Chen
Zishuo Xun
Bocheng Zhou
Han Qi
Qiaosheng Zhang
...
Wei Hu
Yuzhong Qu
W. Ouyang
Wanli Ouyang
Shuyue Hu
191
2
0
01 Apr 2025
Efficient Inference for Large Reasoning Models: A Survey
Yi Liu
Jiaying Wu
Yufei He
Hongcheng Gao
Hongyu Chen
Baolong Bi
Jiaheng Zhang
Zhiqi Huang
Bryan Hooi
Bryan Hooi
LLMAG
LRM
170
17
0
29 Mar 2025
Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps
Yu Cui
Bryan Hooi
Yujun Cai
Yiwei Wang
LRM
82
3
0
25 Mar 2025
SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models
Chuan Qin
Xiusi Chen
Chengrui Wang
Pengmin Wu
Xi Chen
...
Han Wu
Chong Li
Yuanchun Zhou
H. Xiong
Hengshu Zhu
ELM
91
2
0
12 Mar 2025
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts
Peijie Wang
Zhong-Zhi Li
Fei Yin
Xin Yang
Dekang Ran
Cheng-Lin Liu
LRM
136
11
0
28 Feb 2025
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models
Haoyang Li
Xuejia Chen
Zhanchao Xu
Darian Li
Nicole Hu
...
Yongbin Li
Luyu Qiu
C. Zhang
Qing Li
Lei Chen
ELM
LRM
114
1
0
16 Feb 2025
One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs
Hai-Tao Zheng
Jiayi Kuang
Haojing Huang
Zhikun Xu
Xinnian Liang
...
Jue Chen
Chao Qu
Ying Shen
Hai-Tao Zheng
Philip S. Yu
LRM
142
2
0
12 Feb 2025
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
Xin Xu
Qiyun Xu
Tong Xiao
Tianhao Chen
Yuchen Yan
Jiaxin Zhang
Shizhe Diao
Can Yang
Yang Wang
LRM
AI4CE
ELM
276
8
0
01 Feb 2025
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Beichen Zhang
Yuhong Liu
Xiaoyi Dong
Yuhang Zang
Pan Zhang
Haodong Duan
Yuhang Cao
Dahua Lin
Jinqiao Wang
LRM
ReLM
142
6
0
06 Jan 2025
RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards
Xinze Li
Sen Mei
Zhenghao Liu
Yukun Yan
Shuo Wang
...
Haotian Chen
Ge Yu
Zhiyuan Liu
Maosong Sun
Chenyan Xiong
106
12
0
17 Oct 2024
Baichuan 2: Open Large-scale Language Models
Ai Ming Yang
Bin Xiao
Bingning Wang
Borong Zhang
Ce Bian
...
Youxin Jiang
Yuchen Gao
Yupeng Zhang
Guosheng Dong
Zhiying Wu
ELM
LRM
328
755
0
19 Sep 2023
1