Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.01964
Cited By
Don't Make Your LLM an Evaluation Benchmark Cheater
3 November 2023
Kun Zhou
Yutao Zhu
Zhipeng Chen
Wentong Chen
Wayne Xin Zhao
Xu Chen
Yankai Lin
Ji-Rong Wen
Jiawei Han
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Don't Make Your LLM an Evaluation Benchmark Cheater"
38 / 38 papers shown
Title
RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
Xinnuo Xu
Rachel Lawrence
Kshitij Dubey
Atharva Pandey
Risa Ueno
Fabian Falck
A. Nori
Rahul Sharma
Amit Sharma
Javier González
LRM
30
0
0
18 Jun 2025
Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis
Kejian Zhu
Shangqing Tu
Zhuoran Jin
Lei Hou
Juanzi Li
Jun Zhao
KELM
88
0
0
04 Jun 2025
DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures
Yu He
Yingxi Li
Colin White
Ellen Vitercik
ELM
LRM
26
0
0
29 May 2025
Evaluation and Incident Prevention in an Enterprise AI Assistant
Akash Maharaj
David Arbour
Daniel Lee
Uttaran Bhattacharya
Anup B. Rao
Austin Zane
Avi Feller
Kun Qian
Yunyao Li
69
0
0
11 Apr 2025
Large Language Models Could Be Rote Learners
Yuyang Xu
Renjun Hu
Haochao Ying
Jian Wu
Xing Shi
Wei Lin
ELM
440
0
0
11 Apr 2025
A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models
Yuantao Zhang
Zhankui Yang
AAML
78
0
0
05 Apr 2025
Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets
Preetam Prabhu Srikar Dammu
Himanshu Naidu
Chirag Shah
168
1
0
06 Mar 2025
CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models
Shengzhuang Chen
Yikai Liao
Xiaoxiao Sun
Kede Ma
Ying Wei
138
0
0
06 Mar 2025
PiCO: Peer Review in LLMs based on the Consistency Optimization
Kun-Peng Ning
Shuo Yang
Yu-Yang Liu
Jia-Yu Yao
Zhen-Hui Liu
Yu Wang
Ming Pang
Li Yuan
ALM
215
9
0
24 Feb 2025
Evaluation of Deep Audio Representations for Hearables
Fabian Gröger
Pascal Baumann
Ludovic Amruthalingam
Laurent Simon
Ruksana Giurda
Simone Lionetti
125
0
0
10 Feb 2025
Unbiased Evaluation of Large Language Models from a Causal Perspective
Meilin Chen
Jian Tian
Liang Ma
Di Xie
Weijie Chen
Jiang Zhu
ALM
ELM
166
0
0
10 Feb 2025
Real-time Fake News from Adversarial Feedback
Sanxing Chen
Yukun Huang
Bhuwan Dhingra
86
0
0
31 Dec 2024
AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
Xiaobao Wu
Liangming Pan
Yuxi Xie
Ruiwen Zhou
Shuai Zhao
Yubo Ma
Mingzhe Du
Rui Mao
Anh Tuan Luu
William Yang Wang
283
13
0
18 Dec 2024
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
Eunsu Kim
Juyoung Suk
Seungone Kim
Niklas Muennighoff
Dongkwan Kim
Alice Oh
ELM
196
1
0
10 Dec 2024
Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina
Yuan Gao
Dokyun Lee
Gordon Burtch
Sina Fazelpour
LRM
193
14
0
25 Oct 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
120
8
0
24 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li
Zhiqiu Lin
Wenxuan Peng
Jean de Dieu Nyandwi
Daniel Jiang
Zixian Ma
Simran Khanuja
Ranjay Krishna
Graham Neubig
Deva Ramanan
AAML
CoGe
VLM
234
31
0
18 Oct 2024
Detecting Training Data of Large Language Models via Expectation Maximization
Gyuwan Kim
Yang Li
Evangelia Spiliopoulou
Jie Ma
Miguel Ballesteros
William Yang Wang
MIALM
271
4
2
10 Oct 2024
Fine-tuning can Help Detect Pretraining Data from Large Language Models
Han Zhang
Songxin Zhang
Bingyi Jing
Hongxin Wei
152
1
0
09 Oct 2024
Training on the Benchmark Is Not All You Need
Shiwen Ni
Xiangtao Kong
Chengming Li
Xiping Hu
Ruifeng Xu
Jia Zhu
Min Yang
150
6
0
03 Sep 2024
Bringing AI Participation Down to Scale: A Comment on Open AIs Democratic Inputs to AI Project
David Moats
Chandrima Ganguly
VLM
61
0
0
16 Jul 2024
Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation
Biqing Qi
Kaiyan Zhang
Kai Tian
Haoxiang Li
Zhang-Ren Chen
Sihang Zeng
Ermo Hua
Hu Jinfang
Bowen Zhou
LM&MA
127
18
0
12 Jul 2024
Training on the Test Task Confounds Evaluation and Emergence
Ricardo Dominguez-Olmedo
Florian E. Dorner
Moritz Hardt
ELM
154
9
1
10 Jul 2024
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
Kun Qian
Shunji Wan
Claudia Tang
Youzhi Wang
Xuanming Zhang
Maximillian Chen
Zhou Yu
AAML
93
12
0
25 Jun 2024
Benchmark Data Contamination of Large Language Models: A Survey
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
ELM
ALM
94
56
0
06 Jun 2024
Easy Problems That LLMs Get Wrong
Sean Williams
James Huckle
LRM
160
14
0
30 May 2024
Exploring Subjectivity for more Human-Centric Assessment of Social Biases in Large Language Models
Paula Akemi Aoyagui
Sharon Ferguson
Anastasia Kuzminykh
81
0
0
17 May 2024
Binary Hypothesis Testing for Softmax Models and Leverage Score Models
Yeqi Gao
Yuzhou Gu
Zhao Song
75
0
0
09 May 2024
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks
Melissa Ailem
Katerina Marazopoulou
Charlotte Siska
James Bono
96
22
0
25 Apr 2024
RAM: Towards an Ever-Improving Memory System by Learning from Communications
Jiaqi Li
Xiaobo Wang
Wentao Ding
Zihao Wang
Yipeng Kang
Zixia Jia
Zilong Zheng
110
3
0
18 Apr 2024
Sampling-based Pseudo-Likelihood for Membership Inference Attacks
Masahiro Kaneko
Youmi Ma
Yuki Wata
Naoaki Okazaki
82
9
0
17 Apr 2024
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition
Kehua Feng
Keyan Ding
Hongzhi Tan
Kede Ma
Zhihua Wang
...
Yuzhou Cheng
Ge Sun
Guozhou Zheng
Qiang Zhang
H. Chen
128
13
0
10 Apr 2024
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Jiasheng Ye
Peiju Liu
Tianxiang Sun
Yunhua Zhou
Jun Zhan
Xipeng Qiu
147
76
0
25 Mar 2024
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain
King Han
Alex Gu
Wen-Ding Li
Fanjia Yan
Tianjun Zhang
Sida I. Wang
Armando Solar-Lezama
Koushik Sen
Ion Stoica
ELM
151
448
0
12 Mar 2024
Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model
Huan Ma
Yan Zhu
Changqing Zhang
Peilin Zhao
Baoyuan Wu
Long-Kai Huang
Qinghua Hu
Bing Wu
VLM
146
2
0
01 Mar 2024
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models
Jiahao Ying
Yixin Cao
Yushi Bai
Qianru Sun
Bo Wang
Wei Tang
Zhaojun Ding
Yizhe Yang
Xuanjing Huang
Shuicheng Yan
KELM
59
10
0
19 Feb 2024
Institutional Platform for Secure Self-Service Large Language Model Exploration
V. Bumgardner
Mitchell A. Klusty
W. V. Logan
Samuel E. Armstrong
Caylin D. Hickey
Jeff Talbert
Caylin Hickey
Jeff Talbert
140
1
0
01 Feb 2024
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
Yun Luo
Zhen Yang
Fandong Meng
Yafu Li
Jie Zhou
Yue Zhang
CLL
KELM
211
319
0
17 Aug 2023
1