Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.18102
Cited By
How Can I Publish My LLM Benchmark Without Giving the True Answers Away?
23 May 2025
Takashi Ishida
Thanawat Lodkaew
Ikko Yamane
Re-assign community
ArXiv
PDF
HTML
Papers citing
"How Can I Publish My LLM Benchmark Without Giving the True Answers Away?"
31 / 31 papers shown
Title
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation
D. Sculley
Will Cukierski
Phil Culliton
Sohier Dane
Maggie Demkin
...
Addison Howard
Paul Mooney
Walter Reade
Megan Risdal
Nate Keating
64
1
0
01 May 2025
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Weihao Xuan
Rui Yang
Heli Qi
Qingcheng Zeng
Yunze Xiao
...
Edison Marrese-Taylor
Shijian Lu
Yusuke Iwasawa
Yutaka Matsuo
Irene Li
ELM
149
6
0
13 Mar 2025
BIG-Bench Extra Hard
Mehran Kazemi
Bahare Fatemi
Hritik Bansal
John Palowitch
Chrysovalantis Anastasiou
...
Kate Olszewska
Yi Tay
Vinh Q. Tran
Quoc V. Le
Orhan Firat
ELM
LRM
235
10
0
26 Feb 2025
Do Large Language Model Benchmarks Test Reliability?
Joshua Vendrow
Edward Vendrow
Sara Beery
Aleksander Madry
159
5
0
05 Feb 2025
Evolutionary Optimization of Model Merging Recipes
Takuya Akiba
Makoto Shing
Yujin Tang
Qi Sun
David Ha
MoMe
228
115
0
28 Jan 2025
Assessing Contamination in Large Language Models: Introducing the LogProber method
Nicolas Yax
Pierre-Yves Oudeyer
Stefano Palminteri
64
5
0
26 Aug 2024
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo
Hynek Kydlícek
Loubna Ben Allal
Anton Lozhkov
Margaret Mitchell
Colin Raffel
Leandro von Werra
Thomas Wolf
95
223
0
25 Jun 2024
Data Contamination Can Cross Language Barriers
Feng Yao
Yufan Zhuang
Zihao Sun
Sunan Xu
Animesh Kumar
Jingbo Shang
66
10
0
19 Jun 2024
Are We Done with MMLU?
Aryo Pradipta Gema
Joshua Ong Jun Leang
Giwon Hong
Alessio Devoto
Alberto Carlo Maria Mancino
...
R. McHardy
Joshua Harris
Jean Kaddour
Emile van Krieken
Pasquale Minervini
ELM
104
39
0
06 Jun 2024
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Hugh Zhang
Jeff Da
Dean Lee
Vaughn Robinson
Catherine Wu
...
Qin Lyu
Sean Hendryx
Russell Kaplan
Michele Lunati
Summer Yue
ALM
LRM
ELM
67
100
0
01 May 2024
Investigating Data Contamination for Pre-training Language Models
Minhao Jiang
Ken Ziyu Liu
Ming Zhong
Rylan Schaeffer
Siru Ouyang
Jiawei Han
Sanmi Koyejo
58
67
0
11 Jan 2024
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein
Betty Li Hou
Asa Cooper Stickland
Jackson Petty
Richard Yuanzhe Pang
Julien Dirani
Julian Michael
Samuel R. Bowman
AI4MH
ELM
72
627
0
20 Nov 2023
Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models
Shahriar Golchin
Mihai Surdeanu
53
26
0
10 Nov 2023
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Shuo Yang
Wei-Lin Chiang
Lianmin Zheng
Joseph E. Gonzalez
Ion Stoica
ALM
51
122
0
08 Nov 2023
Proving Test Set Contamination in Black Box Language Models
Yonatan Oren
Nicole Meister
Niladri Chatterji
Faisal Ladhak
Tatsunori B. Hashimoto
HILM
58
139
0
26 Oct 2023
Detecting Pretraining Data from Large Language Models
Weijia Shi
Anirudh Ajith
Mengzhou Xia
Yangsibo Huang
Daogao Liu
Terra Blevins
Danqi Chen
Luke Zettlemoyer
MIALM
59
177
0
25 Oct 2023
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez
John Yang
Alexander Wettig
Shunyu Yao
Kexin Pei
Ofir Press
Karthik Narasimhan
ELM
63
529
0
10 Oct 2023
Time Travel in LLMs: Tracing Data Contamination in Large Language Models
Shahriar Golchin
Mihai Surdeanu
88
98
0
16 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
249
11,636
0
18 Jul 2023
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALM
PILM
1.0K
12,840
0
27 Feb 2023
Editing Models with Task Arithmetic
Gabriel Ilharco
Marco Tulio Ribeiro
Mitchell Wortsman
Suchin Gururangan
Ludwig Schmidt
Hannaneh Hajishirzi
Ali Farhadi
KELM
MoMe
MU
154
474
0
08 Dec 2022
Is the Performance of My Deep Network Too Good to Be True? A Direct Approach to Estimating the Bayes Error in Binary Classification
Takashi Ishida
Ikko Yamane
Nontawat Charoenphakdee
Gang Niu
Masashi Sugiyama
BDL
UQCV
58
16
0
01 Feb 2022
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
214
4,175
0
27 Oct 2021
Evaluating State-of-the-Art Classification Models Against Bayes Optimality
Ryan Theisen
Huan Wang
Lav Varshney
Caiming Xiong
R. Socher
27
11
0
07 Jun 2021
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
Aida Amini
Saadia Gabriel
Shanchuan Lin
Rik Koncel-Kedziorski
Yejin Choi
Hannaneh Hajishirzi
AIMat
ReLM
AI4CE
98
553
0
30 May 2019
Model Similarity Mitigates Test Set Overuse
Horia Mania
John Miller
Ludwig Schmidt
Moritz Hardt
Benjamin Recht
51
51
0
29 May 2019
Cold Case: The Lost MNIST Digits
Chhavi Yadav
Léon Bottou
37
105
0
25 May 2019
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark
Kenton Lee
Ming-Wei Chang
Tom Kwiatkowski
Michael Collins
Kristina Toutanova
191
1,475
0
24 May 2019
Do ImageNet Classifiers Generalize to ImageNet?
Benjamin Recht
Rebecca Roelofs
Ludwig Schmidt
Vaishaal Shankar
OOD
SSeg
VLM
96
1,693
0
13 Feb 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
789
7,080
0
20 Apr 2018
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark
Isaac Cowhey
Oren Etzioni
Tushar Khot
Ashish Sabharwal
Carissa Schoenick
Oyvind Tafjord
ELM
RALM
LRM
113
2,474
0
14 Mar 2018
1