Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2501.01257
Cited By
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
2 January 2025
Shanghaoran Quan
Jiaxi Yang
Bowen Yu
Jian Xu
Dayiheng Liu
An Yang
Xuancheng Ren
Bofei Gao
Yibo Miao
Yunlong Feng
Zhaoxiang Wang
Jian Yang
Zeyu Cui
Yang Fan
Wenjie Qu
Binyuan Hui
Junyang Lin
ALM
ELM
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings"
17 / 17 papers shown
Title
The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute
Yunho Jin
Gu-Yeon Wei
David Brooks
LRM
7
0
0
20 May 2025
OSS-Bench: Benchmark Generator for Coding LLMs
Yuancheng Jiang
Roland Yap
Zhenkai Liang
12
0
0
18 May 2025
Qwen3 Technical Report
An Yang
A. Li
Baosong Yang
Beichen Zhang
Binyuan Hui
...
Zekun Wang
Zeyu Cui
Zhenru Zhang
Zhenhong Zhou
Zihan Qiu
LLMAG
OSLM
LRM
50
10
0
14 May 2025
Evaluating LLM Metrics Through Real-World Capabilities
Justin K Miller
Wenjia Tang
ELM
ALM
52
0
0
13 May 2025
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
Sizhe Wang
Ziwen Wang
Dongsheng Ma
Yongan Yu
Rui Ling
Zhiyu Li
Feiyu Xiong
Wentao Zhang
LRM
65
0
0
30 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
96
2
0
26 Apr 2025
Scaling Laws For Scalable Oversight
Joshua Engels
David D. Baek
Subhash Kantamneni
Max Tegmark
ELM
77
0
0
25 Apr 2025
CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation
Anirudh Khatry
Robert Zhang
Jia Pan
Ziteng Wang
Qiaochu Chen
Greg Durrett
Isil Dillig
39
0
0
21 Apr 2025
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs
Yunhui Xia
Wei Shen
Yan Wang
Jason Klein Liu
Huifeng Sun
Siyue Wu
Jian Hu
Xiaolong Xu
AI4TS
30
1
0
20 Apr 2025
HoarePrompt: Structural Reasoning About Program Correctness in Natural Language
Dimitrios Stamatios Bouras
Yihan Dai
Tairan Wang
Yingfei Xiong
Sergey Mechtaev
LRM
53
0
0
25 Mar 2025
RustEvo^2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation
Linxi Liang
Jing Gong
Mingwei Liu
Chong Wang
Guangsheng Ou
Yanlin Wang
Xin Peng
Zibin Zheng
ALM
69
0
0
21 Mar 2025
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol
Roham Koohestani
Philippe de Bekker
M. Izadi
VLM
50
0
0
07 Mar 2025
ProBench: Benchmarking Large Language Models in Competitive Programming
Lei Yang
Renren Jin
Ling Shi
Jianxiang Peng
Yue Chen
Deyi Xiong
ReLM
ELM
LRM
61
2
0
28 Feb 2025
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
Axel Backlund
Lukas Petersson
LLMAG
RALM
63
1
0
20 Feb 2025
SWE-Lancer: Can Frontier LLMs Earn
1
M
i
l
l
i
o
n
f
r
o
m
R
e
a
l
−
W
o
r
l
d
F
r
e
e
l
a
n
c
e
S
o
f
t
w
a
r
e
E
n
g
i
n
e
e
r
i
n
g
?
1 Million from Real-World Freelance Software Engineering?
1
M
i
ll
i
o
n
f
ro
m
R
e
a
l
−
W
or
l
d
F
ree
l
an
ce
S
o
f
tw
a
re
E
n
g
in
eer
in
g
?
Samuel Miserendino
Ming Wang
Tejal Patwardhan
Johannes Heidecke
43
18
0
17 Feb 2025
RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation
C. Zhou
Xinyu Zhang
Dandan Song
Xiancai Chen
Wanli Gu
Huipeng Ma
Yuhang Tian
Mengdi Zhang
Linmei Hu
63
1
0
13 Feb 2025
GLoRE: Evaluating Logical Reasoning of Large Language Models
Hanmeng Liu
Zhiyang Teng
Ruoxi Ning
Jian Liu
Qiji Zhou
Yuexin Zhang
Yue Zhang
ReLM
ELM
LRM
70
8
0
13 Oct 2023
1