Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.14809
Cited By
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
22 February 2024
Zicheng Lin
Zhibin Gou
Tian Liang
Ruilin Luo
Haowei Liu
Yujiu Yang
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CriticBench: Benchmarking LLMs for Critique-Correct Reasoning"
28 / 28 papers shown
Title
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Yilun Zhou
Austin Xu
Peifeng Wang
Caiming Xiong
Shafiq Joty
ELM
ALM
LRM
110
3
0
21 Apr 2025
FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research
Gabriel Recchia
Chatrik Singh Mangat
Issac Li
Gayatri Krishnakumar
ALM
163
0
0
29 Mar 2025
Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors
Zhiyu Yang
Shuo Wang
Yukun Yan
Yang Deng
64
0
0
28 Mar 2025
Rolling Forward: Enhancing LightGCN with Causal Graph Convolution for Credit Bond Recommendation
Ashraf Ghiye
Baptiste Barreau
Laurent Carlier
Michalis Vazirgiannis
90
3
0
18 Mar 2025
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Yancheng He
Shilong Li
Jing Liu
Weixun Wang
Xingyuan Bu
...
Zhongyuan Peng
Zhenru Zhang
Zhicheng Zheng
Wenbo Su
Bo Zheng
ELM
LRM
105
13
0
26 Feb 2025
Evaluating Step-by-step Reasoning Traces: A Survey
Jinu Lee
Julia Hockenmaier
LRM
ELM
98
2
0
17 Feb 2025
Improving Video Generation with Human Feedback
Jie Liu
Gongye Liu
Jiajun Liang
Ziyang Yuan
Xiaokun Liu
...
Pengfei Wan
Di Zhang
Kun Gai
Yujiu Yang
Wanli Ouyang
VGen
EGVM
111
22
0
23 Jan 2025
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Ruilin Luo
Zhuofan Zheng
Yifan Wang
Xinzhe Ni
Zicheng Lin
...
Yiyao Yu
C. Shi
Ruihang Chu
Jin Zeng
Yujiu Yang
LRM
116
19
0
08 Jan 2025
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
Mingyang Song
Zhaochen Su
Xiaoye Qu
Jiawei Zhou
Yu Cheng
LRM
115
37
0
06 Jan 2025
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Chujie Zheng
Zizhuo Zhang
Beichen Zhang
Runji Lin
Keming Lu
Bowen Yu
Dayiheng Liu
Jingren Zhou
Junyang Lin
LRM
157
71
0
09 Dec 2024
Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu
Zhengxing Chen
Aston Zhang
L Tan
Chenguang Zhu
...
Suchin Gururangan
Chao-Yue Zhang
Melanie Kambadur
Dhruv Mahajan
Rui Hou
LRM
ALM
137
23
0
25 Nov 2024
The Generative AI Paradox: "What It Can Create, It May Not Understand"
Peter West
Ximing Lu
Nouha Dziri
Faeze Brahman
Linjie Li
...
Khyathi Chandu
Benjamin Newman
Pang Wei Koh
Allyson Ettinger
Yejin Choi
AIMat
81
78
0
31 Oct 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
312
4,253
0
09 Jun 2023
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Zhibin Gou
Zhihong Shao
Yeyun Gong
Yelong Shen
Yujiu Yang
Nan Duan
Weizhu Chen
KELM
LRM
61
382
0
19 May 2023
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
171
1,603
0
15 Dec 2022
Language Models are Multilingual Chain-of-Thought Reasoners
Freda Shi
Mirac Suzgun
Markus Freitag
Xuezhi Wang
Suraj Srivats
...
Yi Tay
Sebastian Ruder
Denny Zhou
Dipanjan Das
Jason W. Wei
ReLM
LRM
218
362
0
06 Oct 2022
Complexity-Based Prompting for Multi-Step Reasoning
Yao Fu
Hao-Chun Peng
Ashish Sabharwal
Peter Clark
Tushar Khot
ReLM
LRM
199
433
0
03 Oct 2022
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning
Pan Lu
Liang Qiu
Kai-Wei Chang
Ying Nian Wu
Song-Chun Zhu
Tanmay Rajpurohit
Peter Clark
Ashwin Kalyan
ReLM
LRM
140
289
0
29 Sep 2022
Self-critiquing models for assisting human evaluators
William Saunders
Catherine Yeh
Jeff Wu
Steven Bills
Ouyang Long
Jonathan Ward
Jan Leike
ALM
ELM
65
300
0
12 Jun 2022
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILM
LRM
422
6,202
0
05 Apr 2022
Cut the CARP: Fishing for zero-shot story evaluation
Shahbuland Matiana
J. Smith
Ryan Teehan
Louis Castricato
Stella Biderman
Leo Gao
Spencer Frazier
99
16
0
06 Oct 2021
Program Synthesis with Large Language Models
Jacob Austin
Augustus Odena
Maxwell Nye
Maarten Bosma
Henryk Michalewski
...
Ellen Jiang
Carrie J. Cai
Michael Terry
Quoc V. Le
Charles Sutton
ELM
AIMat
ReCod
ALM
180
1,925
0
16 Aug 2021
Evaluating Large Language Models Trained on Code
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
...
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
ELM
ALM
205
5,454
0
07 Jul 2021
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
Mor Geva
Daniel Khashabi
Elad Segal
Tushar Khot
Dan Roth
Jonathan Berant
RALM
330
715
0
06 Jan 2021
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
Alon Talmor
Jonathan Herzig
Nicholas Lourie
Jonathan Berant
RALM
140
1,716
0
02 Nov 2018
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang
Peng Qi
Saizheng Zhang
Yoshua Bengio
William W. Cohen
Ruslan Salakhutdinov
Christopher D. Manning
RALM
147
2,635
0
25 Sep 2018
FEVER: a large-scale dataset for Fact Extraction and VERification
James Thorne
Andreas Vlachos
Christos Christodoulopoulos
Arpit Mittal
HILM
121
1,646
0
14 Mar 2018
Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
Wang Ling
Dani Yogatama
Chris Dyer
Phil Blunsom
AIMat
76
724
0
11 May 2017
1