ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.05692
  4. Cited By
Evaluating Mathematical Reasoning Beyond Accuracy
v1v2 (latest)

Evaluating Mathematical Reasoning Beyond Accuracy

8 April 2024
Shijie Xia
Xuefeng Li
Yixin Liu
Tongshuang Wu
Pengfei Liu
    LRMReLM
ArXiv (abs)PDFHTML

Papers citing "Evaluating Mathematical Reasoning Beyond Accuracy"

42 / 42 papers shown
Title
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
Yi Chen
Yuying Ge
Rui Wang
Yixiao Ge
Junhao Cheng
Ying Shan
Xihui Liu
OffRLVLMLRM
26
0
0
19 Jun 2025
Know What You Don't Know: Uncertainty Calibration of Process Reward Models
Know What You Don't Know: Uncertainty Calibration of Process Reward Models
Young-Jin Park
Kristjan Greenewald
Kaveh Alim
Hao Wang
Navid Azizan
LRM
60
0
0
11 Jun 2025
FreePRM: Training Process Reward Models Without Ground Truth Process Labels
FreePRM: Training Process Reward Models Without Ground Truth Process Labels
Lin Sun
C. Liu
Xiaofeng Ma
Tao Yang
Weijia Lu
Ning Wu
68
0
0
04 Jun 2025
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Hao Sun
Zile Qiao
Jiayan Guo
Xuanbo Fan
Yingyan Hou
Yong Jiang
Pengjun Xie
Yan Zhang
Fei Huang
Jingren Zhou
OffRL
135
12
0
07 May 2025
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
Chenrui Fan
Ming Li
Lichao Sun
Tianyi Zhou
LRM
136
12
0
09 Apr 2025
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Hamed Mahdavi
Alireza Hashemi
Majid Daliri
Pegah Mohammadipour
Alireza Farhadi
Samira Malek
Yekta Yazdanifard
Amir Khasahmadi
V. Honavar
ELMLRM
164
4
0
01 Apr 2025
A Survey on Mathematical Reasoning and Optimization with Large Language Models
A Survey on Mathematical Reasoning and Optimization with Large Language Models
Ali Forootani
OffRLLRMAI4CE
120
1
0
22 Mar 2025
Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving
Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving
Priscylla Silva
Evandro Costa
LRM
77
0
0
18 Mar 2025
StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error
Steve Yang
C. Wang
Yidong Wang
Xiaotao Gu
Minlie Huang
J. Tang
LRMLLMAG
124
2
0
13 Mar 2025
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding
Ayesha Ishaq
Jean Lahoud
Ketan More
Omkar Thawakar
Ritesh Thawkar
...
Fahad Shahbaz Khan
Hisham Cholakkal
Ivan Laptev
Rao Muhammad Anwer
Salman Khan
LRM
123
4
0
13 Mar 2025
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
Tong Wei
Yijun Yang
Junliang Xing
Yuanchun Shi
Zongqing Lu
Deheng Ye
OffRLLRM
83
2
0
11 Mar 2025
ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges
Jiaxin Ai
Pengfei Zhou
Zhaopan Xu
Ming Li
Fanrui Zhang
...
Jianwen Sun
Yukang Feng
Baojin Huang
Zhongyuan Wang
Kai Zhang
ELM
479
1
0
09 Mar 2025
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Yancheng He
Shilong Li
Jing Liu
Weixun Wang
Xingyuan Bu
...
Zhongyuan Peng
Zhenru Zhang
Zhicheng Zheng
Wenbo Su
Bo Zheng
ELMLRM
168
17
0
26 Feb 2025
Evaluating Robustness of Reward Models for Mathematical Reasoning
Evaluating Robustness of Reward Models for Mathematical Reasoning
Sunghwan Kim
Dongjin Kang
Taeyoon Kwon
Hyungjoo Chae
Jungsoo Won
Dongha Lee
Jinyoung Yeo
76
7
0
02 Oct 2024
Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks
Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks
Huanxuan Liao
Shizhu He
Yao Xu
Yuanzhe Zhang
Kang Liu
Jun Zhao
LRM
165
4
0
20 Sep 2024
BeHonest: Benchmarking Honesty in Large Language Models
BeHonest: Benchmarking Honesty in Large Language Models
Steffi Chern
Zhulin Hu
Yuqing Yang
Ethan Chern
Yuan Guo
Jiahe Jin
Binjie Wang
Pengfei Liu
HILMALM
143
5
0
19 Jun 2024
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
Zhen Huang
Zengzhi Wang
Shijie Xia
Xuefeng Li
Haoyang Zou
...
Yuxiang Zheng
Shaoting Zhang
Dahua Lin
Yu Qiao
Pengfei Liu
ELMLRM
138
43
0
18 Jun 2024
Can We Verify Step by Step for Incorrect Answer Detection?
Can We Verify Step by Step for Incorrect Answer Detection?
Xin Xu
Shizhe Diao
Can Yang
Yang Wang
LRM
330
15
0
16 Feb 2024
Evaluating LLMs' Mathematical and Coding Competency through
  Ontology-guided Interventions
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions
Pengfei Hong
Navonil Majumder
Deepanway Ghosal
Somak Aditya
Rada Mihalcea
Soujanya Poria
LRM
99
5
0
17 Jan 2024
Augmenting Math Word Problems via Iterative Question Composing
Augmenting Math Word Problems via Iterative Question Composing
Haoxiong Liu
Yifan Zhang
Yifan Luo
Andrew Chi-Chih Yao
SyDaLRM
151
46
0
17 Jan 2024
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
Zhongshen Zeng
Pengguang Chen
Shu Liu
Haiyun Jiang
Jiaya Jia
ReLMELMLRM
81
23
0
28 Dec 2023
LLMs cannot find reasoning errors, but can correct them given the error
  location
LLMs cannot find reasoning errors, but can correct them given the error location
Gladys Tyen
Hassan Mansoor
Victor Carbune
Peter Chen
Tony Mak
LRM
148
79
0
14 Nov 2023
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai
Xuehai Pan
Ruiyang Sun
Jiaming Ji
Xinbo Xu
Mickel Liu
Yizhou Wang
Yaodong Yang
141
364
0
19 Oct 2023
Llemma: An Open Language Model For Mathematics
Llemma: An Open Language Model For Mathematics
Zhangir Azerbayev
Hailey Schoelkopf
Keiran Paster
Marco Dos Santos
Stephen Marcus McAleer
Albert Q. Jiang
Jia Deng
Stella Biderman
Sean Welleck
CLL
126
303
0
16 Oct 2023
Let's reward step by step: Step-Level reward model as the Navigators for
  Reasoning
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning
Qianli Ma
Haotian Zhou
Tingkai Liu
Jianbo Yuan
Pengfei Liu
Yang You
Hongxia Yang
LRM
95
54
0
16 Oct 2023
Mistral 7B
Mistral 7B
Albert Q. Jiang
Alexandre Sablayrolles
A. Mensch
Chris Bamford
Devendra Singh Chaplot
...
Teven Le Scao
Thibaut Lavril
Thomas Wang
Timothée Lacroix
William El Sayed
MoELRM
157
2,263
0
10 Oct 2023
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language
  Models
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
L. Yu
Weisen Jiang
Han Shi
Jincheng Yu
Zhengying Liu
Yu Zhang
James T. Kwok
Zheng Li
Adrian Weller
Weiyang Liu
OSLMLRM
122
395
0
21 Sep 2023
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Haipeng Luo
Qingfeng Sun
Can Xu
Pu Zhao
Jian-Guang Lou
...
Xiubo Geng
Qingwei Lin
Shifeng Chen
Yansong Tang
Dongmei Zhang
LRMOSLM
303
468
0
18 Aug 2023
ARB: Advanced Reasoning Benchmark for Large Language Models
ARB: Advanced Reasoning Benchmark for Large Language Models
Tomohiro Sawada
Daniel Paleka
Alexander Havrilla
Pranav Tadepalli
Paula Vidas
Alexander Kranias
John J. Nay
Kshitij Gupta
Aran Komatsuzaki
ELMLRM
81
39
0
25 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
537
12,130
0
18 Jul 2023
Interpretable Math Word Problem Solution Generation Via Step-by-step
  Planning
Interpretable Math Word Problem Solution Generation Via Step-by-step Planning
Mengxue Zhang
Zichao Wang
Zhichao Yang
Weiqi Feng
Andrew Lan
LRM
69
18
0
01 Jun 2023
Let's Verify Step by Step
Let's Verify Step by Step
Hunter Lightman
V. Kosaraju
Yura Burda
Harrison Edwards
Bowen Baker
Teddy Lee
Jan Leike
John Schulman
Ilya Sutskever
K. Cobbe
ALMOffRLLRM
242
1,241
0
31 May 2023
ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness
ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness
Archiki Prasad
Swarnadeep Saha
Xiang Zhou
Joey Tianyi Zhou
LRM
114
50
0
21 Apr 2023
GPT-4 Technical Report
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAGMLLM
1.6K
14,852
0
15 Mar 2023
A Survey of Deep Learning for Mathematical Reasoning
A Survey of Deep Learning for Mathematical Reasoning
Pan Lu
Liang Qiu
Wenhao Yu
Sean Welleck
Kai-Wei Chang
ReLMLRM
133
150
0
20 Dec 2022
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning
O. Yu. Golovneva
Moya Chen
Spencer Poff
Martin Corredor
Luke Zettlemoyer
Maryam Fazel-Zarandi
Asli Celikyilmaz
ReLMLRM
104
152
0
15 Dec 2022
Solving math word problems with process- and outcome-based feedback
Solving math word problems with process- and outcome-based feedback
J. Uesato
Nate Kushman
Ramana Kumar
Francis Song
Noah Y. Siegel
L. Wang
Antonia Creswell
G. Irving
I. Higgins
FaMLReLMAIMatLRM
135
362
0
25 Nov 2022
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of
  Chain-of-Thought
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
Abulhair Saparov
He He
ELMLRMReLM
301
315
0
03 Oct 2022
Solving Quantitative Reasoning Problems with Language Models
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz
Anders Andreassen
David Dohan
Ethan Dyer
Henryk Michalewski
...
Theo Gutman-Solo
Yuhuai Wu
Behnam Neyshabur
Guy Gur-Ari
Vedant Misra
ReLMELMLRM
230
865
0
29 Jun 2022
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLMOffRLLRM
425
4,609
0
27 Oct 2021
A Study of Automatic Metrics for the Evaluation of Natural Language
  Explanations
A Study of Automatic Metrics for the Evaluation of Natural Language Explanations
Miruna Clinciu
Arash Eshghi
H. Hastie
105
57
0
15 Mar 2021
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
Basel Alomair
Jacob Steinhardt
ReLMFaML
239
2,415
0
05 Mar 2021
1