Investigating Data Contamination in Modern Benchmarks for Large Language
Models

Investigating Data Contamination in Modern Benchmarks for Large Language Models

16 November 2023

Mark B. Gerstein

Arman Cohan

Papers citing "Investigating Data Contamination in Modern Benchmarks for Large Language Models"

18 / 18 papers shown

Title
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information Joshua Harris Fan Grayson Felix Feldman Timothy Laurence Toby Nonnenmacher ... Leo Loman Selina Patel Thomas Finnie Samuel Collins Michael Borowitz AI4MH LM&MA ELM 54 0 0 09 May 2025
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations Laura Dietz Oleg Zendel P. Bailey Charles L. A. Clarke Ellese Cotterill Jeff Dalton Faegheh Hasibi Mark Sanderson Nick Craswell ELM 50 0 0 27 Apr 2025
HalluLens: LLM Hallucination Benchmark Yejin Bang Ziwei Ji Alan Schelten Anthony Hartshorn Tara Fowler Cheng Zhang Nicola Cancedda Pascale Fung HILM 92 1 0 24 Apr 2025
Repurposing the scientific literature with vision-language models Anton Alyakin Jaden Stryker Daniel Alber Karl L. Sangwon Brandon Duderstadt ... Laura Snyder Eric Leuthardt Douglas Kondziolka E. Oermann Eric Karl Oermann 101 0 0 26 Feb 2025
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs Andreas Opedal Haruki Shirakami Bernhard Schölkopf Abulhair Saparov Mrinmaya Sachan LRM 57 1 0 17 Feb 2025
Mind the Data Gap: Bridging LLMs to Enterprise Data Integration Moe Kayali Fabian Wenz Nesime Tatbul Çağatay Demiralp 46 2 0 31 Dec 2024
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination D. Song Sicheng Lai Shunian Chen Lichao Sun Benyou Wang 174 0 0 06 Nov 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions Yujuan Fu Özlem Uzuner Meliha Yetisgen Fei Xia 59 4 0 24 Oct 2024
How Much Can We Forget about Data Contamination? Sebastian Bordt Suraj Srinivas Valentyn Boreiko U. V. Luxburg 48 1 0 04 Oct 2024
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation Ilya Gusev LLMAG 58 3 0 10 Sep 2024
Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools Varun Magesh Faiz Surani Matthew Dahl Mirac Suzgun Christopher D. Manning Daniel E. Ho HILM ELM AILaw 27 66 0 30 May 2024
Do LLMs Dream of Ontologies? Marco Bombieri Paolo Fiorini Simone Paolo Ponzetto M. Rospocher CLL 32 2 0 26 Jan 2024
Black-Box Access is Insufficient for Rigorous AI Audits Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis ... Michael Gerovitch David Bau Max Tegmark David M. Krueger Dylan Hadfield-Menell AAML 34 78 0 25 Jan 2024
Don't Make Your LLM an Evaluation Benchmark Cheater Kun Zhou Yutao Zhu Zhipeng Chen Wentong Chen Wayne Xin Zhao Xu Chen Yankai Lin Ji-Rong Wen Jiawei Han ELM 110 137 0 03 Nov 2023
Data Contamination Through the Lens of Time Manley Roberts Himanshu Thakur Christine Herlihy Colin White Samuel Dooley 84 31 0 16 Oct 2023
Can we trust the evaluation on ChatGPT? Rachith Aiyappa Jisun An Haewoon Kwak Yong-Yeol Ahn ELM ALM LLMAG AI4MH LRM 120 87 0 22 Mar 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 339 12,003 0 04 Mar 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 282 1,996 0 31 Dec 2020