Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.06770
Cited By
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
10 October 2023
Carlos E. Jimenez
John Yang
Alexander Wettig
Shunyu Yao
Kexin Pei
Ofir Press
Karthik Narasimhan
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?"
50 / 100 papers shown
Title
TRAIL: Trace Reasoning and Agentic Issue Localization
Darshan Deshpande
Varun Gangal
Hersh Mehta
Jitin Krishnan
Anand Kannappan
Rebecca Qian
25
0
0
13 May 2025
Measuring General Intelligence with Generated Games
Vivek Verma
David Huang
William Chen
Dan Klein
Nicholas Tomlin
ReLM
ELM
LM&MA
LRM
48
0
0
12 May 2025
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
Rushi Qiang
Yuchen Zhuang
Yinghao Li
D. Kilman
Rongzhi Zhang
...
Ian Shu-Hei Wong
Sherry Yang
Percy Liang
Chao Zhang
Bo Dai
ELM
39
0
0
12 May 2025
Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks
Kai Xu
YiWei Mao
XinYi Guan
ZiLong Feng
38
0
0
12 May 2025
HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics
Lennart Luettgau
Harry Coppock
Magda Dubois
Christopher Summerfield
Cozmin Ududec
31
0
0
08 May 2025
Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale
Jiale Liu
Yifan Zeng
Shaokun Zhang
Chi Zhang
Malte Højmark-Bertelsen
Marie Normann Gadeberg
H. Wang
Qingyun Wu
41
0
0
06 May 2025
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
Zimu Lu
Y. Yang
Houxing Ren
Haotian Hou
Han Xiao
Ke Wang
Weikang Shi
Aojun Zhou
Mingjie Zhan
H. Li
LLMAG
45
0
0
06 May 2025
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
Meng-Hao Guo
Jiajun Xu
Yi Zhang
Jiaxi Song
Haoyang Peng
...
Yongming Rao
Houwen Peng
Han Hu
Gordon Wetzstein
Shi-Min Hu
ELM
LRM
57
0
0
04 May 2025
Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks
Vishnu Sarukkai
Zhiqiang Xie
Kayvon Fatahalian
LLMAG
75
0
0
01 May 2025
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation
D. Sculley
Will Cukierski
Phil Culliton
Sohier Dane
Maggie Demkin
...
Addison Howard
Paul Mooney
Walter Reade
Megan Risdal
Nate Keating
31
0
0
01 May 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Shaokun Zhang
Ming Yin
Jieyu Zhang
J. H. Liu
Zhiguang Han
...
Beibin Li
Chi Wang
H. Wang
Y. Chen
Qingyun Wu
49
0
0
30 Apr 2025
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
Sizhe Wang
Z. Wang
Dongsheng Ma
Yongan Yu
Rui Ling
Z. Li
Feiyu Xiong
W. Zhang
LRM
60
0
0
30 Apr 2025
CrashFixer: A crash resolution agent for the Linux kernel
Alex Mathai
Chenxi Huang
Suwei Ma
Jihwan Kim
Hailie Mitchell
Aleksandr Nogikh
Petros Maniatis
Franjo Ivančić
Junfeng Yang
Baishakhi Ray
62
0
0
29 Apr 2025
Turing Machine Evaluation for Large Language Model
Haitao Wu
Zongbo Han
Huaxi Huang
Changqing Zhang
ELM
LRM
62
0
0
29 Apr 2025
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification
Shangyu Li
Juyong Jiang
Tiancheng Zhao
Jiasi Shen
49
0
0
29 Apr 2025
Towards Automated Scoping of AI for Social Good Projects
Jacob Emmerson
Rayid Ghani
Zheyuan Ryan Shi
133
0
0
28 Apr 2025
APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries
Huajian Xin
Luming Li
Xiaoran Jin
Jacques Fleuriot
Wenda Li
AIMat
52
0
0
27 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
1
0
26 Apr 2025
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Z. Wang
K. Wang
Q. Wang
Pingyue Zhang
Linjie Li
...
Jiajun Wu
L. Fei-Fei
Lijuan Wang
Yejin Choi
Manling Li
86
3
0
24 Apr 2025
Efficient Pretraining Length Scaling
Bohong Wu
Shen Yan
Sijun Zhang
Jianqiao Lu
Yutao Zeng
Ya Wang
Xun Zhou
127
0
0
21 Apr 2025
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
Jasper Götting
Pedro Medeiros
Jon G Sanders
Nathaniel Li
Long Phan
Karam Elabd
Lennart Justen
Dan Hendrycks
Seth Donoughe
ELM
55
2
0
21 Apr 2025
CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations
Man Ho Adrian Lam
Chaozheng Wang
Jen-tse Huang
M. Lyu
LRM
34
0
0
19 Apr 2025
SeaView: Software Engineering Agent Visual Interface for Enhanced Workflow
Timothy Bula
Saurabh Pujar
Luca Buratti
Mihaela A. Bornea
Avirup Sil
LLMAG
39
0
0
11 Apr 2025
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
Anna Goldie
Azalia Mirhoseini
Hao Zhou
Irene Cai
Christopher D. Manning
SyDa
OffRL
ReLM
LRM
109
3
0
07 Apr 2025
Frontier AI's Impact on the Cybersecurity Landscape
Wenbo Guo
Yujin Potter
Tianneng Shi
Zhun Wang
Andy Zhang
Dawn Song
52
1
0
07 Apr 2025
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization
Weiwei Sun
Shengyu Feng
Shanda Li
Yiming Yang
LLMAG
45
1
0
06 Apr 2025
Inference-Time Scaling for Generalist Reward Modeling
Zijun Liu
P. Wang
R. Xu
Shirong Ma
Chong Ruan
Peng Li
Yang Janet Liu
Y. Wu
OffRL
LRM
46
11
0
03 Apr 2025
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Kai Yan
Yufei Xu
Zhengyin Du
Xuesong Yao
Z. Wang
Xiaowen Guo
Jiecao Chen
ReLM
ELM
LRM
95
3
0
01 Apr 2025
Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models
Irtaza Khalid
Amir Masoud Nourollah
Steven Schockaert
LRM
40
0
0
30 Mar 2025
L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution
Simeng Sun
Cheng-Ping Hsieh
Faisal Ladhak
Erik Arakelyan
Santiago Akle Serano
Boris Ginsburg
ReLM
ELM
LRM
133
0
0
28 Mar 2025
CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities
Yuxuan Zhu
Antony Kellermann
Dylan Bowman
Philip Li
Akul Gupta
...
Avi Dhir
Sudhit Rao
Kaicheng Yu
Twm Stone
Daniel Kang
LLMAG
ELM
72
3
0
21 Mar 2025
Measuring AI Ability to Complete Long Tasks
Thomas Kwa
Ben West
Joel Becker
Amy Deng
Katharyn Garcia
...
Lucas Jun Koba Sato
H. Wijk
Daniel M. Ziegler
Elizabeth Barnes
Lawrence Chan
ELM
77
6
0
18 Mar 2025
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri
Melissa Z. Pan
Shuyi Yang
Lakshya A Agrawal
Bhavya Chopra
...
Dan Klein
Kannan Ramchandran
Matei A. Zaharia
Joseph E. Gonzalez
Ion Stoica
LLMAG
Presented at
ResearchTrend Connect | LLMAG
on
23 Apr 2025
127
8
0
17 Mar 2025
CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning
Hao Cui
Zahra Shamsi
Gowoon Cheon
Xuejian Ma
Shutong Li
...
Eun-Ah Kim
M. Brenner
Viren Jain
Sameera Ponda
Subhashini Venugopalan
ELM
LRM
52
0
0
14 Mar 2025
LocAgent: Graph-Guided LLM Agents for Code Localization
Zhaoling Chen
Xiangru Tang
Gangda Deng
Fang Wu
Jialong Wu
Zhiwei Jiang
Viktor Prasanna
Arman Cohan
Xingyao Wang
LLMAG
93
3
0
12 Mar 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Tianhe Lin
Jian Xie
Siyu Yuan
Deqing Yang
ReLM
LRM
73
2
0
10 Mar 2025
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
Wei Li
Xin Zhang
Zhongxin Guo
Shaoguang Mao
Wen Luo
Guangyue Peng
Yangyu Huang
Houfeng Wang
Scarlett Li
57
0
0
09 Mar 2025
Transferable Foundation Models for Geometric Tasks on Point Cloud Representations: Geometric Neural Operators
Blaine Quackenbush
P. Atzberger
3DPC
AI4CE
65
0
0
06 Mar 2025
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
Peiding Wang
L. Zhang
Fang Liu
Lin Shi
Minxiao Li
Bo Shen
An Fu
ELM
LRM
140
0
0
05 Mar 2025
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology
Ludovico Mitchener
Jon M. Laurent
Benjamin Tenmann
Siddharth Narayanan
Geemi P Wellawatte
A. White
Lorenzo Sani
Samuel G. Rodriques
LLMAG
LM&MA
ELM
62
3
0
28 Feb 2025
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer
Marthe Ballon
Andres Algaba
Vincent Ginis
LRM
ReLM
38
4
0
24 Feb 2025
Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents
Patrick Tser Jern Kon
Jiachen Liu
Qiuyi Ding
Yiming Qiu
Zhenning Yang
Yibo Huang
Jayanth Srinivasa
Myungjin Lee
Mosharaf Chowdhury
Ang Chen
56
3
0
22 Feb 2025
Measuring AI agent autonomy: Towards a scalable approach with code inspection
Peter Cihon
Merlin Stein
Gagan Bansal
Sam Manning
Kevin Xu
38
0
0
21 Feb 2025
Forecasting Frontier Language Model Agent Capabilities
Govind Pimpale
Axel Højmark
Jérémy Scheurer
Marius Hobbhahn
LLMAG
ELM
41
1
0
21 Feb 2025
InductionBench: LLMs Fail in the Simplest Complexity Class
Wenyue Hua
Tyler Wong
Sun Fei
Liangming Pan
Adam Jardine
William Yang Wang
LRM
73
2
0
20 Feb 2025
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
Michael Luo
Xiaoxiang Shi
Colin Cai
Tianjun Zhang
Justin Wong
...
Chi Wang
Yanping Huang
Zhifeng Chen
Joseph E. Gonzalez
Ion Stoica
55
3
0
20 Feb 2025
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Longxu Dou
Qian Liu
Fan Zhou
Changyu Chen
Zili Wang
...
Tianyu Pang
Chao Du
Xinyi Wan
Wei Lu
Min Lin
103
1
0
18 Feb 2025
AIDE: AI-Driven Exploration in the Space of Code
Zhengyao Jiang
Dominik Schmidt
Dhruv Srikanth
Dixing Xu
Ian Kaplan
Deniss Jacenko
Yuxiang Wu
67
5
0
18 Feb 2025
SWE-Lancer: Can Frontier LLMs Earn
1
M
i
l
l
i
o
n
f
r
o
m
R
e
a
l
−
W
o
r
l
d
F
r
e
e
l
a
n
c
e
S
o
f
t
w
a
r
e
E
n
g
i
n
e
e
r
i
n
g
?
1 Million from Real-World Freelance Software Engineering?
1
M
i
ll
i
o
n
f
ro
m
R
e
a
l
−
W
or
l
d
F
ree
l
an
ce
S
o
f
tw
a
re
E
n
g
in
eer
in
g
?
Samuel Miserendino
M. Wang
Tejal Patwardhan
Johannes Heidecke
43
18
0
17 Feb 2025
Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarcity
Dylan Zhang
Justin Wang
Tianran Sun
38
1
0
17 Feb 2025
1
2
Next