Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.09247
Cited By
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
11 October 2024
Jacob Haimes
Cenny Wenner
Kunvar Thaman
Vassil Tashev
Clement Neo
Esben Kran
Jason Schreiber
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts"
4 / 4 papers shown
Title
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
Yifan Sun
Han Wang
Dongbai Li
Gang Wang
Huan Zhang
AAML
60
0
0
20 Mar 2025
DarkBench: Benchmarking Dark Patterns in Large Language Models
Esben Kran
Hieu Minh "Jord" Nguyen
Akash Kundu
Sami Jawhar
Jinsuk Park
Mateusz Maria Jurewicz
59
1
0
13 Mar 2025
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
67
4
0
24 Oct 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Andrey Anurin
Jonathan Ng
Kibo Schaffer
Jason Schreiber
Esben Kran
ELM
40
5
0
10 Oct 2024
1