ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.13502
  4. Cited By
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

17 February 2025
Andreas Opedal
Haruki Shirakami
Bernhard Schölkopf
Abulhair Saparov
Mrinmaya Sachan
    LRM
ArXivPDFHTML

Papers citing "MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs"

49 / 49 papers shown
Title
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
...
Shiyu Wang
S. Yu
Shunfeng Zhou
Shuting Pan
S.S. Li
ReLM
VLM
OffRL
AI4TS
LRM
370
1,692
0
22 Jan 2025
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in
  Large Language Models
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Iman Mirzadeh
Keivan Alizadeh
Hooman Shahrokhi
Oncel Tuzel
Samy Bengio
Mehrdad Farajtabar
AIMat
LRM
100
176
0
07 Oct 2024
Models Can and Should Embrace the Communicative Nature of
  Human-Generated Math
Models Can and Should Embrace the Communicative Nature of Human-Generated Math
Sasha Boguraev
Ben Lipkin
Leonie Weissweiler
Kyle Mahowald
88
1
0
25 Sep 2024
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden
  Reasoning Process
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
Tian Ye
Zicheng Xu
Yuanzhi Li
Zeyuan Allen-Zhu
ReLM
LRM
51
54
0
29 Jul 2024
Reliable Reasoning Beyond Natural Language
Reliable Reasoning Beyond Natural Language
Nasim Borazjanizadeh
Steven T Piantadosi
LRM
ReLM
48
5
0
16 Jul 2024
Deciphering the Factors Influencing the Efficacy of Chain-of-Thought:
  Probability, Memorization, and Noisy Reasoning
Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
Akshara Prabhakar
Thomas Griffiths
R. Thomas McCoy
LRM
88
20
0
01 Jul 2024
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human
  Curricula
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula
Shubhra Mishra
Gabriel Poesia
Belinda Mo
Noah D. Goodman
64
3
0
01 Jul 2024
A Peek into Token Bias: Large Language Models Are Not Yet Genuine
  Reasoners
A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
Bowen Jiang
Yangxinyu Xie
Zhuoqun Hao
Xiaomeng Wang
Tanwi Mallick
Weijie J. Su
Camillo J Taylor
Dan Roth
LRM
106
51
0
16 Jun 2024
A Careful Examination of Large Language Model Performance on Grade
  School Arithmetic
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Hugh Zhang
Jeff Da
Dean Lee
Vaughn Robinson
Catherine Wu
...
Qin Lyu
Sean Hendryx
Russell Kaplan
Michele Lunati
Summer Yue
ALM
LRM
ELM
85
105
0
01 May 2024
Many-Shot In-Context Learning
Many-Shot In-Context Learning
Rishabh Agarwal
Avi Singh
Lei M. Zhang
Bernd Bohnet
Luis Rosias
...
John D. Co-Reyes
Eric Chu
Feryal M. P. Behbahani
Aleksandra Faust
Hugo Larochelle
ReLM
OffRL
BDL
86
115
0
17 Apr 2024
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Zhiqing Sun
Longhui Yu
Yikang Shen
Weiyang Liu
Yiming Yang
Sean Welleck
Chuang Gan
79
68
0
14 Mar 2024
Case-Based or Rule-Based: How Do Transformers Do the Math?
Case-Based or Rule-Based: How Do Transformers Do the Math?
Yi Hu
Xiaojuan Tang
Haotong Yang
Muhan Zhang
LRM
72
25
0
27 Feb 2024
Let's Learn Step by Step: Enhancing In-Context Learning Ability with
  Curriculum Learning
Let's Learn Step by Step: Enhancing In-Context Learning Ability with Curriculum Learning
Yinpeng Liu
Jiawei Liu
Xiang Shi
Qikai Cheng
Yong Huang
Wei Lu
59
29
0
16 Feb 2024
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data
Yinya Huang
Xiaohan Lin
Zhengying Liu
Qingxing Cao
Huajian Xin
Haiming Wang
Zhenguo Li
Linqi Song
Xiaodan Liang
ALM
80
38
0
14 Feb 2024
Premise Order Matters in Reasoning with Large Language Models
Premise Order Matters in Reasoning with Large Language Models
Xinyun Chen
Ryan A. Chi
Xuezhi Wang
Denny Zhou
ReLM
LRM
130
31
0
14 Feb 2024
Limits of Transformer Language Models on Learning to Compose Algorithms
Limits of Transformer Language Models on Learning to Compose Algorithms
Jonathan Thomm
Aleksandar Terzić
Giacomo Camposampiero
Michael Hersche
Bernhard Schölkopf
Abbas Rahimi
92
7
0
08 Feb 2024
Do Language Models Exhibit the Same Cognitive Biases in Problem Solving
  as Human Learners?
Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?
Andreas Opedal
Alessandro Stolfo
Haruki Shirakami
Ying Jiao
Ryan Cotterell
Bernhard Schölkopf
Abulhair Saparov
Mrinmaya Sachan
LRM
57
16
0
31 Jan 2024
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
Peter Hase
Mohit Bansal
Peter Clark
Sarah Wiegreffe
105
33
0
12 Jan 2024
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
  Supervision
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns
Pavel Izmailov
Jan Hendrik Kirchner
Bowen Baker
Leo Gao
...
Adrien Ecoffet
Manas Joglekar
Jan Leike
Ilya Sutskever
Jeff Wu
ELM
88
291
0
14 Dec 2023
Investigating Data Contamination in Modern Benchmarks for Large Language
  Models
Investigating Data Contamination in Modern Benchmarks for Large Language Models
Chunyuan Deng
Yilun Zhao
Xiangru Tang
Mark B. Gerstein
Arman Cohan
AAML
ELM
64
59
0
16 Nov 2023
A Systematic Comparison of Syllogistic Reasoning in Humans and Language
  Models
A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models
Tiwalayo Eisape
MH Tessler
Ishita Dasgupta
Fei Sha
Sjoerd van Steenkiste
Tal Linzen
ReLM
LRM
108
10
0
01 Nov 2023
What's In My Big Data?
What's In My Big Data?
Yanai Elazar
Akshita Bhagia
Ian H. Magnusson
Abhilasha Ravichander
Dustin Schwenk
...
Luca Soldaini
Sameer Singh
Hanna Hajishirzi
Noah A. Smith
Jesse Dodge
42
95
0
31 Oct 2023
NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination
  for each Benchmark
NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark
Oscar Sainz
Jon Ander Campos
Iker García-Ferrero
Julen Etxaniz
Oier López de Lacalle
Eneko Agirre
69
180
0
27 Oct 2023
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language
  Models
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
L. Yu
Weisen Jiang
Han Shi
Jincheng Yu
Zhengying Liu
Yu Zhang
James T. Kwok
Zheng Li
Adrian Weller
Weiyang Liu
OSLM
LRM
98
386
0
21 Sep 2023
Lost in the Middle: How Language Models Use Long Contexts
Lost in the Middle: How Language Models Use Long Contexts
Nelson F. Liu
Kevin Lin
John Hewitt
Ashwin Paranjape
Michele Bevilacqua
Fabio Petroni
Percy Liang
RALM
106
1,600
0
06 Jul 2023
World Models for Math Story Problems
World Models for Math Story Problems
Andreas Opedal
Niklas Stoehr
Abulhair Saparov
Mrinmaya Sachan
ReLM
74
12
0
07 Jun 2023
Faith and Fate: Limits of Transformers on Compositionality
Faith and Fate: Limits of Transformers on Compositionality
Nouha Dziri
Ximing Lu
Melanie Sclar
Xiang Lorraine Li
Liwei Jian
...
Sean Welleck
Xiang Ren
Allyson Ettinger
Zaïd Harchaoui
Yejin Choi
ReLM
LRM
133
377
0
29 May 2023
Testing the General Deductive Reasoning Capacity of Large Language
  Models Using OOD Examples
Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Abulhair Saparov
Richard Yuanzhe Pang
Vishakh Padmakumar
Nitish Joshi
Seyed Mehran Kazemi
Najoung Kim
He He
ELM
LRM
73
94
0
24 May 2023
Stop Uploading Test Data in Plain Text: Practical Strategies for
  Mitigating Data Contamination by Evaluation Benchmarks
Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks
Alon Jacovi
Avi Caciularu
Omer Goldman
Yoav Goldberg
47
105
0
17 May 2023
How Do In-Context Examples Affect Compositional Generalization?
How Do In-Context Examples Affect Compositional Generalization?
Shengnan An
Zeqi Lin
Qiang Fu
B. Chen
Nanning Zheng
Jian-Guang Lou
Dongmei Zhang
75
54
0
08 May 2023
Language Models Don't Always Say What They Think: Unfaithful
  Explanations in Chain-of-Thought Prompting
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Miles Turpin
Julian Michael
Ethan Perez
Sam Bowman
ReLM
LRM
80
431
0
07 May 2023
An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)
An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)
Paulo Shakarian
Abhinav Koyyalamudi
Noel Ngu
Lakshmivihari Mareedu
74
66
0
23 Feb 2023
Do Deep Neural Networks Capture Compositionality in Arithmetic
  Reasoning?
Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?
Keito Kudo
Yoichi Aoki
Tatsuki Kuribayashi
Ana Brassard
Masashi Yoshikawa
Keisuke Sakaguchi
Kentaro Inui
CoGe
51
11
0
15 Feb 2023
Large Language Models Can Be Easily Distracted by Irrelevant Context
Large Language Models Can Be Easily Distracted by Irrelevant Context
Freda Shi
Xinyun Chen
Kanishka Misra
Nathan Scales
David Dohan
Ed H. Chi
Nathanael Scharli
Denny Zhou
ReLM
RALM
LRM
101
587
0
31 Jan 2023
Measuring Progress on Scalable Oversight for Large Language Models
Measuring Progress on Scalable Oversight for Large Language Models
Sam Bowman
Jeeyoon Hyun
Ethan Perez
Edwin Chen
Craig Pettit
...
Tristan Hume
Yuntao Bai
Zac Hatfield-Dodds
Benjamin Mann
Jared Kaplan
ALM
ELM
72
129
0
04 Nov 2022
A Causal Framework to Quantify the Robustness of Mathematical Reasoning
  with Language Models
A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models
Alessandro Stolfo
Zhijing Jin
Kumar Shridhar
Bernhard Schölkopf
Mrinmaya Sachan
ELM
OOD
LRM
69
66
0
21 Oct 2022
Transformers Learn Shortcuts to Automata
Transformers Learn Shortcuts to Automata
Bingbin Liu
Jordan T. Ash
Surbhi Goel
A. Krishnamurthy
Cyril Zhang
OffRL
LRM
131
175
0
19 Oct 2022
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of
  Chain-of-Thought
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
Abulhair Saparov
He He
ELM
LRM
ReLM
231
305
0
03 Oct 2022
Complexity-Based Prompting for Multi-Step Reasoning
Complexity-Based Prompting for Multi-Step Reasoning
Yao Fu
Hao-Chun Peng
Ashish Sabharwal
Peter Clark
Tushar Khot
ReLM
LRM
215
436
0
03 Oct 2022
Exploring Length Generalization in Large Language Models
Exploring Length Generalization in Large Language Models
Cem Anil
Yuhuai Wu
Anders Andreassen
Aitor Lewkowycz
Vedant Misra
V. Ramasesh
Ambrose Slone
Guy Gur-Ari
Ethan Dyer
Behnam Neyshabur
ReLM
LRM
87
169
0
11 Jul 2022
Unveiling Transformers with LEGO: a synthetic reasoning task
Unveiling Transformers with LEGO: a synthetic reasoning task
Yi Zhang
A. Backurs
Sébastien Bubeck
Ronen Eldan
Suriya Gunasekar
Tal Wagner
LRM
90
91
0
09 Jun 2022
Rethinking the Role of Demonstrations: What Makes In-Context Learning
  Work?
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Sewon Min
Xinxi Lyu
Ari Holtzman
Mikel Artetxe
M. Lewis
Hannaneh Hajishirzi
Luke Zettlemoyer
LLMAG
LRM
163
1,485
0
25 Feb 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
814
9,387
0
28 Jan 2022
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
285
4,408
0
27 Oct 2021
Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with
  Recurrent Networks
Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks
Avi Schwarzschild
Eitan Borgnia
Arjun Gupta
Furong Huang
U. Vishkin
Micah Goldblum
Tom Goldstein
74
75
0
08 Jun 2021
Fantastically Ordered Prompts and Where to Find Them: Overcoming
  Few-Shot Prompt Order Sensitivity
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Yao Lu
Max Bartolo
Alastair Moore
Sebastian Riedel
Pontus Stenetorp
AILaw
LRM
403
1,185
0
18 Apr 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
  Crawled Corpus
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
118
446
0
18 Apr 2021
Dynabench: Rethinking Benchmarking in NLP
Dynabench: Rethinking Benchmarking in NLP
Douwe Kiela
Max Bartolo
Yixin Nie
Divyansh Kaushik
Atticus Geiger
...
Pontus Stenetorp
Robin Jia
Joey Tianyi Zhou
Christopher Potts
Adina Williams
201
407
0
07 Apr 2021
Are NLP Models really able to Solve Simple Math Word Problems?
Are NLP Models really able to Solve Simple Math Word Problems?
Arkil Patel
S. Bhattamishra
Navin Goyal
ReLM
LRM
89
837
0
12 Mar 2021
1