ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.22832
  4. Cited By
L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution

L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution

28 March 2025
Simeng Sun
Cheng-Ping Hsieh
Faisal Ladhak
Erik Arakelyan
Santiago Akle Serano
Boris Ginsburg
    ReLM
    ELM
    LRM
ArXivPDFHTML

Papers citing "L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution"

38 / 38 papers shown
Title
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?
Yang Zhou
Hongyi Liu
Zhuoming Chen
Yuandong Tian
Beidi Chen
LRM
88
11
0
07 Feb 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
...
Shiyu Wang
S. Yu
Shunfeng Zhou
Shuting Pan
S.S. Li
ReLM
VLM
OffRL
AI4TS
LRM
328
1,641
0
22 Jan 2025
Artificial Expert Intelligence through PAC-reasoning
Artificial Expert Intelligence through PAC-reasoning
Shai Shalev-Shwartz
Amnon Shashua
Gal Beniamini
Yoav Levine
Or Sharir
Noam Wies
Ido Ben-Shaul
Tomer Nussbaum
Shir Granot Peled
LRM
82
1
0
03 Dec 2024
GPT-4o System Card
GPT-4o System Card
OpenAI OpenAI
:
Aaron Hurst
Adam Lerer
Adam P. Goucher
...
Yuchen He
Yuchen Zhang
Yujia Jin
Yunxing Dai
Yury Malkov
MLLM
170
893
0
25 Oct 2024
FLARE: Faithful Logic-Aided Reasoning and Exploration
FLARE: Faithful Logic-Aided Reasoning and Exploration
Erik Arakelyan
Pasquale Minervini
Pat Verga
Patrick Lewis
Isabelle Augenstein
ReLM
LRM
121
2
0
14 Oct 2024
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in
  Large Language Models
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Iman Mirzadeh
Keivan Alizadeh
Hooman Shahrokhi
Oncel Tuzel
Samy Bengio
Mehrdad Farajtabar
AIMat
LRM
92
173
0
07 Oct 2024
ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure
ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure
Ippei Fujisawa
Sensho Nobe
Hiroki Seto
Rina Onda
Yoshiaki Uchida
Hiroki Ikoma
Pei-Chun Chien
Ryota Kanai
LRM
57
4
0
04 Oct 2024
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language
  Models
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Jiayi Gui
Yiming Liu
Jiale Cheng
Xiaotao Gu
Xiao-Yang Liu
Hongning Wang
Yuxiao Dong
Jie Tang
Minlie Huang
ELM
LLMAG
LRM
58
7
0
28 Aug 2024
The SIFo Benchmark: Investigating the Sequential Instruction Following
  Ability of Large Language Models
The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models
Xinyi Chen
Baohao Liao
Jirui Qi
Panagiotis Eustratiadis
Christof Monz
Arianna Bisazza
Maarten de Rijke
ALM
ELM
LRM
56
6
0
28 Jun 2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
110
176
0
22 Jun 2024
The CLRS-Text Algorithmic Reasoning Language Benchmark
The CLRS-Text Algorithmic Reasoning Language Benchmark
Larisa Markeeva
Sean McLeish
Borja Ibarz
Wilfried Bounsi
Olga Kozlova
Alex Vitvitskyi
Charles Blundell
Tom Goldstein
Avi Schwarzschild
Petar Veličković
LRM
60
15
0
06 Jun 2024
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by
  Step
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
Yuntian Deng
Yejin Choi
Stuart M. Shieber
ReLM
LRM
59
72
0
23 May 2024
NExT: Teaching Large Language Models to Reason about Code Execution
NExT: Teaching Large Language Models to Reason about Code Execution
Ansong Ni
Miltiadis Allamanis
Arman Cohan
Yinlin Deng
Kensen Shi
Charles Sutton
Pengcheng Yin
ReLM
LRM
61
42
0
23 Apr 2024
Language Models as Compilers: Simulating Pseudocode Execution Improves
  Algorithmic Reasoning in Language Models
Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
Hyungjoo Chae
Yeonghyeon Kim
Seungone Kim
Kai Tzu-iunn Ong
Beong-woo Kwak
...
Seonghwan Kim
Taeyoon Kwon
Jiwan Chung
Youngjae Yu
Jinyoung Yeo
LRM
ReLM
53
16
0
03 Apr 2024
Reasoning Runtime Behavior of a Program with LLM: How Far Are We?
Reasoning Runtime Behavior of a Program with LLM: How Far Are We?
Junkai Chen
Zhiyuan Pan
Xing Hu
Zhenhao Li
Ge Li
Xin Xia
LRM
70
26
0
25 Mar 2024
LiveCodeBench: Holistic and Contamination Free Evaluation of Large
  Language Models for Code
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain
King Han
Alex Gu
Wen-Ding Li
Fanjia Yan
Tianjun Zhang
Sida I. Wang
Armando Solar-Lezama
Koushik Sen
Ion Stoica
ELM
84
391
0
12 Mar 2024
Chain of Thought Empowers Transformers to Solve Inherently Serial
  Problems
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
Zhiyuan Li
Hong Liu
Denny Zhou
Tengyu Ma
LRM
AI4CE
48
123
0
20 Feb 2024
Code Simulation Challenges for Large Language Models
Code Simulation Challenges for Large Language Models
Emanuele La Malfa
Christoph Weinhuber
Orazio Torre
Fangru Lin
Samuele Marro
Anthony Cohn
Nigel Shadbolt
Michael Wooldridge
LLMAG
LRM
33
8
0
17 Jan 2024
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Alex Gu
Baptiste Rozière
Hugh Leather
Armando Solar-Lezama
Gabriel Synnaeve
Sida I. Wang
ELM
ALM
LRM
37
109
0
05 Jan 2024
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
Chengshu Li
Jacky Liang
Andy Zeng
Xinyun Chen
Karol Hausman
Dorsa Sadigh
Sergey Levine
Fei-Fei Li
Fei Xia
Brian Ichter
LLMAG
LRM
65
81
0
07 Dec 2023
CodeScope: An Execution-based Multilingual Multitask Multidimensional
  Benchmark for Evaluating LLMs on Code Understanding and Generation
CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation
Weixiang Yan
Haitian Liu
Yunkun Wang
Yunzhe Li
Qian Chen
...
Tingyu Lin
Weishan Zhao
Li Zhu
Hari Sundaram
Shuiguang Deng
ELM
LRM
66
37
0
14 Nov 2023
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez
John Yang
Alexander Wettig
Shunyu Yao
Kexin Pei
Ofir Press
Karthik Narasimhan
ELM
70
572
0
10 Oct 2023
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
  Language Models
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
Ansong Ni
Pengcheng Yin
Yilun Zhao
Chen Wei
Yanjun Wang
...
Mingyuan Zhang
Chen Change Loy
Yingbo Zhou
Dragomir R. Radev
Arman Cohan
ELM
59
19
0
29 Sep 2023
Predicting Code Coverage without Execution
Predicting Code Coverage without Execution
Michele Tufano
Shubham Chandel
Anisha Agarwal
Neel Sundaresan
Colin B. Clement
24
8
0
25 Jul 2023
Teaching Arithmetic to Small Transformers
Teaching Arithmetic to Small Transformers
Nayoung Lee
Kartik K. Sreenivasan
Jason D. Lee
Kangwook Lee
Dimitris Papailiopoulos
LRM
54
89
0
07 Jul 2023
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Tianyang Liu
Canwen Xu
Julian McAuley
ALM
78
166
0
05 Jun 2023
Testing the General Deductive Reasoning Capacity of Large Language
  Models Using OOD Examples
Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Abulhair Saparov
Richard Yuanzhe Pang
Vishakh Padmakumar
Nitish Joshi
Seyed Mehran Kazemi
Najoung Kim
He He
ELM
LRM
63
94
0
24 May 2023
Logic-LM: Empowering Large Language Models with Symbolic Solvers for
  Faithful Logical Reasoning
Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning
Liangming Pan
Alon Albalak
Xinyi Wang
William Yang Wang
ReLM
LRM
AI4CE
113
262
0
20 May 2023
Code Execution with Pre-trained Language Models
Code Execution with Pre-trained Language Models
Chenxiao Liu
Shuai Lu
Weizhu Chen
Daxin Jiang
Alexey Svyatkovskiy
Shengyu Fu
Neel Sundaresan
Nan Duan
ELM
69
25
0
08 May 2023
Transformers learn in-context by gradient descent
Transformers learn in-context by gradient descent
J. Oswald
Eyvind Niklasson
E. Randazzo
João Sacramento
A. Mordvintsev
A. Zhmoginov
Max Vladymyrov
MLT
91
488
0
15 Dec 2022
Program of Thoughts Prompting: Disentangling Computation from Reasoning
  for Numerical Reasoning Tasks
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
Wenhu Chen
Xueguang Ma
Xinyi Wang
William W. Cohen
ReLM
ReCod
LRM
135
808
0
22 Nov 2022
Teaching Algorithmic Reasoning via In-context Learning
Teaching Algorithmic Reasoning via In-context Learning
Hattie Zhou
Azade Nova
Hugo Larochelle
Rameswar Panda
Behnam Neyshabur
Hanie Sedghi
LRM
ReLM
68
114
0
15 Nov 2022
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of
  Chain-of-Thought
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
Abulhair Saparov
He He
ELM
LRM
ReLM
210
303
0
03 Oct 2022
Show Your Work: Scratchpads for Intermediate Computation with Language
  Models
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye
Anders Andreassen
Guy Gur-Ari
Henryk Michalewski
Jacob Austin
...
Aitor Lewkowycz
Maarten Bosma
D. Luan
Charles Sutton
Augustus Odena
ReLM
LRM
166
741
0
30 Nov 2021
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
D. Song
Jacob Steinhardt
ReLM
FaML
156
2,249
0
05 Mar 2021
Neural Execution Engines: Learning to Execute Subroutines
Neural Execution Engines: Learning to Execute Subroutines
Yujun Yan
Kevin Swersky
Danai Koutra
Parthasarathy Ranganathan
Milad Hashemi
NAI
45
42
0
15 Jun 2020
Neural Execution of Graph Algorithms
Neural Execution of Graph Algorithms
Petar Velickovic
Rex Ying
Matilde Padovano
R. Hadsell
Charles Blundell
GNN
78
168
0
23 Oct 2019
Learning to Execute
Learning to Execute
Wojciech Zaremba
Ilya Sutskever
ODL
87
559
0
17 Oct 2014
1