Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2504.01848
Cited By
v1
v2
v3 (latest)
PaperBench: Evaluating AI's Ability to Replicate AI Research
2 April 2025
Giulio Starace
Oliver Jaffe
Dane Sherburn
James Aung
Jun Shern Chan
Leon Maksin
Rachel Dias
Evan Mays
Benjamin Kinsella
Wyatt Thompson
Johannes Heidecke
Amelia Glaese
Tejal Patwardhan
ALM
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"PaperBench: Evaluating AI's Ability to Replicate AI Research"
16 / 16 papers shown
Title
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
Jonathan Kutasov
Yuqi Sun
Paul Colognese
Teun van der Weij
Linda Petrini
...
Xiang Deng
Henry Sleight
Tyler Tracy
Buck Shlegeris
Joe Benton
LLMAG
28
0
0
17 Jun 2025
xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations
Kaiyuan Chen
Y. Ren
Yang Liu
Xiaobo Hu
Haotong Tian
...
Yuan Jiang
Zexuan Liu
Zihan Yin
Zijian Ma
Zhiwen Mo
24
0
0
16 Jun 2025
EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements
Issa Sugiura
Takashi Ishida
Taro Makino
Chieko Tazuke
Takanori Nakagawa
Kosuke Nakago
David Ha
20
0
0
10 Jun 2025
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning
Qiyao Wei
Samuel Holt
Jing Yang
Markus Wulfmeier
M. Schaar
14
0
0
09 Jun 2025
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
Tianyu Hua
Harper Hua
Violet Xiang
Benjamin Klieger
Sang T. Truong
Weixin Liang
Fan-Yun Sun
Nick Haber
20
0
0
02 Jun 2025
AI Scientists Fail Without Strong Implementation Capability
Minjun Zhu
Qiujie Xie
Yixuan Weng
Jian Wu
Zhen Lin
Linyi Yang
Yue Zhang
ELM
68
0
0
02 Jun 2025
AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage
Xuanle Zhao
Zilin Sang
Yuxuan Li
Qi Shi
Shuo Wang
Duzhen Zhang
Xu Han
Zhiyuan Liu
Maosong Sun
Maosong Sun
75
1
0
27 May 2025
AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy
Sebastian Antony Joseph
Syed Murtaza Husain
Stella S. R. Offner
Stéphanie Juneau
Paul Torrey
Adam S. Bolton
Juan P. Farias
Niall Gaffney
Greg Durrett
Junyi Jessy Li
91
0
0
26 May 2025
Next Token Prediction Is a Dead End for Creativity
Ibukun Olatunji
Mark Sheppard
24
1
0
25 May 2025
Value-Guided Search for Efficient Chain-of-Thought Reasoning
Kaiwen Wang
Jin Peng Zhou
Jonathan D. Chang
Zhaolin Gao
Nathan Kallus
Kianté Brantley
Wen Sun
LRM
90
1
0
23 May 2025
Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems
Jiayi Geng
Howard Chen
Dilip Arumugam
Thomas L. Griffiths
101
0
0
23 May 2025
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research
Guijin Son
Jiwoo Hong
Honglu Fan
Heejeong Nam
Hyunwoo Ko
...
Jinyeop Song
Jinha Choi
Gonçalo Paulo
Youngjae Yu
Stella Biderman
104
1
0
17 May 2025
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Rahul Arora
Jason W. Wei
Rebecca Soskin Hicks
Preston Bowman
Joaquin Quiñonero Candela
...
Meghan Shah
Andrea Vallone
Alex Beutel
Johannes Heidecke
K. Singhal
LM&MA
AI4MH
ELM
122
6
0
13 May 2025
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Minju Seo
Jinheon Baek
Seongyun Lee
Sung Ju Hwang
AI4CE
138
5
0
24 Apr 2025
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Daoguang Zan
Zhirong Huang
Wei Liu
Hanwu Chen
L. Zhang
...
Jing Su
Tianyu Liu
Rui Long
Kai Shen
Liang Xiang
115
7
0
03 Apr 2025
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
87
31
0
11 Jun 2024
1