ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.05891
  4. Cited By
v1v2v3v4 (latest)

MastermindEval: A Simple But Scalable Reasoning Benchmark

7 March 2025
Jonas Golde
Patrick Haller
Fabio Barth
Alan Akbik
    LRMReLMELM
ArXiv (abs)PDFHTML

Papers citing "MastermindEval: A Simple But Scalable Reasoning Benchmark"

22 / 22 papers shown
Title
Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities
Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities
Weixiang Zhao
Xingyu Sui
Jiahe Guo
Yulin Hu
Yang Deng
Yanyan Zhao
Bing Qin
Wanxiang Che
Tat-Seng Chua
Ting Liu
ELMLRM
123
9
0
23 Mar 2025
GAMEBoT: Transparent Assessment of LLM Reasoning in Games
GAMEBoT: Transparent Assessment of LLM Reasoning in Games
Wenye Lin
Jonathan Roberts
Yunhan Yang
Samuel Albanie
Zongqing Lu
Kai Han
LRMELM
128
1
0
18 Dec 2024
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language
  Models
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Jiayi Gui
Yiming Liu
Jiale Cheng
Xiaotao Gu
Xiao-Yang Liu
Hongning Wang
Yuxiao Dong
Jie Tang
Minlie Huang
ELMLLMAGLRM
97
7
0
28 Aug 2024
Scaling LLM Test-Time Compute Optimally can be More Effective than
  Scaling Model Parameters
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell
Jaehoon Lee
Kelvin Xu
Aviral Kumar
LRM
245
702
0
06 Aug 2024
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Bahare Fatemi
Mehran Kazemi
Anton Tsitsulin
Karishma Malkan
Jinyeong Yim
John Palowitch
Sungyong Seo
Jonathan J. Halcrow
Bryan Perozzi
LRM
100
39
0
13 Jun 2024
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability
  of Large Language Models
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models
Mihir Parmar
Nisarg Patel
Neeraj Varshney
Mutsumi Nakamura
Man Luo
Santosh Mashetty
Arindam Mitra
Chitta Baral
LRMReLMELM
213
31
0
23 Apr 2024
Large Language Model based Multi-Agents: A Survey of Progress and
  Challenges
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
Taicheng Guo
Preslav Nakov
Yaqi Wang
Ruidi Chang
Shichao Pei
Nitesh Chawla
Olaf Wiest
Xiangliang Zhang
LLMAGLM&RoAI4CELRM
164
333
0
21 Jan 2024
The Rise and Potential of Large Language Model Based Agents: A Survey
The Rise and Potential of Large Language Model Based Agents: A Survey
Zhiheng Xi
Wenxiang Chen
Xin Guo
Wei He
Yiwen Ding
...
Wenjuan Qin
Yongyan Zheng
Xipeng Qiu
Xuanjing Huan
Tao Gui
LM&MALM&Ro3DVAI4CE
174
958
0
14 Sep 2023
BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory
  Information
BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information
Mehran Kazemi
Quan Yuan
Deepti Bhatia
Najoung Kim
Xin Xu
Vaiva Imbrasaite
Deepak Ramachandran
LRM
100
50
0
13 Jun 2023
Faith and Fate: Limits of Transformers on Compositionality
Faith and Fate: Limits of Transformers on Compositionality
Nouha Dziri
Ximing Lu
Melanie Sclar
Xiang Lorraine Li
Liwei Jian
...
Sean Welleck
Xiang Ren
Allyson Ettinger
Zaïd Harchaoui
Yejin Choi
ReLMLRM
201
387
0
29 May 2023
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun
Nathan Scales
Nathanael Scharli
Sebastian Gehrmann
Yi Tay
...
Aakanksha Chowdhery
Quoc V. Le
Ed H. Chi
Denny Zhou
Jason W. Wei
ALMELMLRMReLM
286
1,143
0
17 Oct 2022
FOLIO: Natural Language Reasoning with First-Order Logic
FOLIO: Natural Language Reasoning with First-Order Logic
Simeng Han
Hailey Schoelkopf
Yilun Zhao
Zhenting Qi
Martin Riddell
...
Yingbo Zhou
Caiming Xiong
Rex Ying
Arman Cohan
Dragomir R. Radev
ReLMLRM
131
109
0
02 Sep 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&RoLRMAI4CEReLM
1.0K
9,796
0
28 Jan 2022
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLMOffRLLRM
425
4,608
0
27 Oct 2021
SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning
SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning
Roshanak Mirzaee
Hossein Rajaby Faghihi
Qiang Ning
Parisa Kordjmashidi
56
83
0
12 Apr 2021
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit
  Reasoning Strategies
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
Mor Geva
Daniel Khashabi
Elad Segal
Tushar Khot
Dan Roth
Jonathan Berant
RALM
372
742
0
06 Jan 2021
LogiQA: A Challenge Dataset for Machine Reading Comprehension with
  Logical Reasoning
LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning
Jian Liu
Leyang Cui
Hanmeng Liu
Dandan Huang
Yile Wang
Yue Zhang
RALM
137
382
0
16 Jul 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
1.1K
42,687
0
28 May 2020
ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning
ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning
Weihao Yu
Zihang Jiang
Yanfei Dong
Jiashi Feng
LRM
173
255
0
11 Feb 2020
WIQA: A dataset for "What if..." reasoning over procedural text
WIQA: A dataset for "What if..." reasoning over procedural text
Niket Tandon
Bhavana Dalvi
Keisuke Sakaguchi
Antoine Bosselut
Peter Clark
83
102
0
10 Sep 2019
CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text
CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text
Koustuv Sinha
Shagun Sodhani
Jin Dong
Joelle Pineau
William L. Hamilton
91
211
0
16 Aug 2019
CommonsenseQA: A Question Answering Challenge Targeting Commonsense
  Knowledge
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
Alon Talmor
Jonathan Herzig
Nicholas Lourie
Jonathan Berant
RALM
172
1,754
0
02 Nov 2018
1