Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.04244
Cited By
Benchmark Data Contamination of Large Language Models: A Survey
6 June 2024
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
ELM
ALM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Benchmark Data Contamination of Large Language Models: A Survey"
40 / 40 papers shown
Title
MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance
Joseph Peper
Wenzhao Qiu
Ali Payani
Lu Wang
10
0
0
17 Jun 2025
OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics
Yaoming Zhu
Junxin Wang
Yiyang Li
Lin Qiu
Zongyu Wang
...
Xuezhi Cao
Yuhuai Wei
Mingshi Wang
Xunliang Cai
Rong Ma
LRM
108
0
0
12 Jun 2025
Multidimensional Analysis of Specific Language Impairment Using Unsupervised Learning Through PCA and Clustering
Niruthiha Selvanayagam
22
0
0
05 Jun 2025
RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems
Yixiao Zeng
Tianyu Cao
Danqing Wang
Xinran Zhao
Zimeng Qiu
Morteza Ziyadi
Tongshuang Wu
Lei Li
RALM
41
0
0
01 Jun 2025
DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures
Yu He
Yingxi Li
Colin White
Ellen Vitercik
ELM
LRM
21
0
0
29 May 2025
PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy
Shuhao Guan
Moule Lin
Cheng Xu
Xinyi Liu
Jinman Zhao
Jiexin Fan
Qi Xu
Derek Greene
57
2
0
26 May 2025
AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models
Miguel Angel Peñaloza Perez
Bruno Lopez Orozco
Jesus Tadeo Cruz Soto
Michelle Bruno Hernandez
Miguel Angel Alvarado Gonzalez
Sandra Malagon
LRM
ELM
20
0
0
25 May 2025
Generative AI and Creativity: A Systematic Literature Review and Meta-Analysis
Niklas Holzner
Sebastian Maier
Stefan Feuerriegel
37
0
0
22 May 2025
Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs
Federico Ranaldi
Andrea Zugarini
Leonardo Ranaldi
Fabio Massimo Zanzotto
39
0
0
21 May 2025
An Empirical Study of Many-to-Many Summarization with Large Language Models
Jiaan Wang
Fandong Meng
Zengkui Sun
Yunlong Liang
Yuxuan Cao
Jiarong Xu
Haoxiang Shi
Jie Zhou
45
0
0
19 May 2025
Generative Induction of Dialogue Task Schemas with Streaming Refinement and Simulated Interactions
James D. Finch
Yasasvi Josyula
Jinho Choi
74
0
0
25 Apr 2025
Automatically Generating Rules of Malicious Software Packages via Large Language Model
XiangRui Zhang
HaoYu Chen
YongZhong He
Wenjia Niu
Qiang Li
69
0
0
24 Apr 2025
Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion
Yejun Yoon
Jaeyoon Jung
Seunghyun Yoon
Kunwoo Park
59
0
0
19 Apr 2025
From Stability to Inconsistency: A Study of Moral Preferences in LLMs
Monika Jotautaite
Mary Phuong
Chatrik Singh Mangat
Maria Angelica Martinez
56
0
0
08 Apr 2025
WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization
I. Gevers
Victor De Marez
Luna De Bruyne
Walter Daelemans
65
0
0
31 Mar 2025
Unlocking the Potential of Past Research: Using Generative AI to Reconstruct Healthcare Simulation Models
Thomas Monks
Alison Harper
Amy Heather
94
0
0
27 Mar 2025
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
Yifan Sun
Han Wang
Dongbai Li
Gang Wang
Huan Zhang
AAML
92
0
0
20 Mar 2025
Framing the Game: How Context Shapes LLM Decision-Making
Isaac Robinson
John Burden
68
0
0
05 Mar 2025
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Magnus F. Gjerde
Vanessa Cheung
David Lagnado
ReLM
LRM
104
0
0
23 Feb 2025
Social Genome: Grounded Social Reasoning Abilities of Multimodal Models
Leena Mathur
Marian Qian
Paul Pu Liang
Louis-Philippe Morency
LRM
473
4
0
21 Feb 2025
Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance
Guangxiang Zhao
Saier Hu
Xiaoqi Jian
Jinzhu Wu
Yuhan Wu
Change Jia
Lin Sun
Xiangzheng Zhang
173
1
0
18 Feb 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Maria Eriksson
Erasmo Purificato
Arman Noroozian
Joao Vinagre
Guillaume Chaslot
Emilia Gomez
David Fernandez-Llorca
ELM
266
6
0
10 Feb 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
Jiawei Han
Wei Wei
Wei Wang
Huan Liu
170
30
0
03 Feb 2025
Multi-Step Reasoning in Korean and the Emergent Mirage
Guijin Son
Hyunwoo Ko
Dasol Choi
LRM
ReLM
131
1
0
10 Jan 2025
Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
Hui Dai
Ryan Teehan
Mengye Ren
KELM
ELM
AIFin
36
1
0
13 Nov 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
114
8
0
24 Oct 2024
CAP: Data Contamination Detection via Consistency Amplification
Yi Zhao
Jing Li
Linyi Yang
50
1
0
19 Oct 2024
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
Varun Gumma
Anandhita Raghunath
Mohit Jain
Sunayana Sitaram
LM&MA
52
1
0
17 Oct 2024
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
Catarina G. Belem
Pouya Pezeskhpour
Hayate Iso
Seiji Maekawa
Nikita Bhutani
Estevam R. Hruschka
HILM
128
3
0
17 Oct 2024
In-Context Learning for Long-Context Sentiment Analysis on Infrastructure Project Opinions
Alireza Shamshiri
Kyeong Rok Ryu
June Young Park
LLMAG
57
1
0
15 Oct 2024
Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks
Rushang Karia
Daniel Bramblett
D. Dobhal
Siddharth Srivastava
ELM
LRM
112
0
0
11 Oct 2024
Fine-tuning can Help Detect Pretraining Data from Large Language Models
Han Zhang
Songxin Zhang
Bingyi Jing
Hongxin Wei
140
1
0
09 Oct 2024
How Much Can We Forget about Data Contamination?
Sebastian Bordt
Suraj Srinivas
Valentyn Boreiko
U. V. Luxburg
133
2
0
04 Oct 2024
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
Ezra Karger
Houtan Bastani
Chen Yueh-Han
Zachary Jacobs
Danny Halawi
Fred Zhang
P. Tetlock
139
9
0
30 Sep 2024
System 2 thinking in OpenAI's o1-preview model: Near-perfect performance on a mathematics exam
J. D. Winter
Dimitra Dodou
Y. B. Eisma
VLM
ELM
LRM
ReLM
40
11
0
19 Sep 2024
MAVEN-Fact: A Large-scale Event Factuality Detection Dataset
Chunyang Li
Hao Peng
Xiaozhi Wang
Yunjia Qi
Lei Hou
Bin Xu
Juanzi Li
HILM
100
1
0
22 Jul 2024
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
Kun Qian
Shunji Wan
Claudia Tang
Youzhi Wang
Xuanming Zhang
Maximillian Chen
Zhou Yu
AAML
93
12
0
25 Jun 2024
UNO Arena for Evaluating Sequential Decision-Making Capability of Large Language Models
Zhanyue Qin
Haochuan Wang
Deyuan Liu
Ziyang Song
Cunhang Fan
...
Zhen Lei
Zhiying Tu
Dianhui Chu
Xiaoyan Yu
Dianbo Sui
ELM
LRM
91
2
0
24 Jun 2024
Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation
Chunyuan Deng
Yilun Zhao
Yuzhao Heng
Yitong Li
Jiannan Cao
Xiangru Tang
Arman Cohan
86
15
0
20 Jun 2024
DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection
Xiao Yu
Yuang Qi
Kejiang Chen
Guoqiang Chen
Xi Yang
Pengyuan Zhu
Xiuwei Shang
Weiming Zhang
Neng H. Yu
DeLMO
73
11
0
21 May 2023
1