ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.03927
  4. Cited By
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in
  Closed-Source LLMs

Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

6 February 2024
Simone Balloccu
Patrícia Schmidtová
Mateusz Lango
Ondrej Dusek
    SILM
    ELM
    PILM
ArXivPDFHTML

Papers citing "Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs"

43 / 43 papers shown
Title
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation
D. Sculley
Will Cukierski
Phil Culliton
Sohier Dane
Maggie Demkin
...
Addison Howard
Paul Mooney
Walter Reade
Megan Risdal
Nate Keating
57
1
0
01 May 2025
Leveraging Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems
Leveraging Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems
Jędrzej Warczyński
Mateusz Lango
Ondrej Dusek
53
0
0
28 Feb 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
Jiawei Han
Wei Wei
Wei Wang
Huan Liu
91
23
0
03 Feb 2025
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents
Weiwei Sun
Lingyong Yan
Xinyu Ma
Shuaiqiang Wang
Pengjie Ren
Zhumin Chen
Dawei Yin
Zhaochun Ren
RALM
ALM
ELM
LRM
LM&MA
130
304
0
31 Dec 2024
On Memorization of Large Language Models in Logical Reasoning
On Memorization of Large Language Models in Logical Reasoning
Chulin Xie
Yangsibo Huang
Chiyuan Zhang
Da Yu
Xinyun Chen
Bill Yuchen Lin
Bo Li
Badih Ghazi
Ravi Kumar
LRM
74
33
0
30 Oct 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
76
5
0
24 Oct 2024
Fine-tuning can Help Detect Pretraining Data from Large Language Models
Fine-tuning can Help Detect Pretraining Data from Large Language Models
Han Zhang
Songxin Zhang
Bingyi Jing
Hongxin Wei
77
1
0
09 Oct 2024
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
Ezra Karger
Houtan Bastani
Chen Yueh-Han
Zachary Jacobs
Danny Halawi
Fred Zhang
P. Tetlock
82
7
0
30 Sep 2024
Leveraging Open-Source Large Language Models for Native Language Identification
Leveraging Open-Source Large Language Models for Native Language Identification
Yee Man Ng
Ilia Markov
58
2
0
15 Sep 2024
Large Language Models Can Better Understand Knowledge Graphs Than We Thought
Large Language Models Can Better Understand Knowledge Graphs Than We Thought
Xinbang Dai
Yuncheng Hua
Tongtong Wu
Yang Sheng
Qiu Ji
Guilin Qi
106
0
0
18 Feb 2024
Generating Faithful Text From a Knowledge Graph with Noisy Reference
  Text
Generating Faithful Text From a Knowledge Graph with Noisy Reference Text
Tahsina Hashem
Weiqing Wang
Derry Wijaya
Mohammed Eunus Ali
Yuan-Fang Li
39
3
0
12 Aug 2023
Think-on-Graph: Deep and Responsible Reasoning of Large Language Model
  on Knowledge Graph
Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph
Jiashuo Sun
Chengjin Xu
Lumingyuan Tang
Saizhuo Wang
Chen Lin
Yeyun Gong
Lionel M. Ni
H. Shum
Jian Guo
LRM
55
77
0
15 Jul 2023
Neural Machine Translation Data Generation and Augmentation using
  ChatGPT
Neural Machine Translation Data Generation and Augmentation using ChatGPT
Wayne Yang
Garrett Nicolai
78
7
0
11 Jul 2023
Multilingual Language Models are not Multicultural: A Case Study in
  Emotion
Multilingual Language Models are not Multicultural: A Case Study in Emotion
Shreya Havaldar
Sunny Rai
Bhumika Singhal
Langchen Liu
Langchen Liu Sharath Chandra Guntuku
Lyle Ungar
61
60
0
03 Jul 2023
UMASS_BioNLP at MEDIQA-Chat 2023: Can LLMs generate high-quality
  synthetic note-oriented doctor-patient conversations?
UMASS_BioNLP at MEDIQA-Chat 2023: Can LLMs generate high-quality synthetic note-oriented doctor-patient conversations?
Junda Wang
Zonghai Yao
Avijit Mitra
Samuel Osebe
Zhichao Yang
Hongfeng Yu
LM&MA
MedIm
75
14
0
29 Jun 2023
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
  Models
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Wei Ping
Weixin Chen
Hengzhi Pei
Chulin Xie
Mintong Kang
...
Zinan Lin
Yuk-Kit Cheng
Sanmi Koyejo
D. Song
Yue Liu
55
405
0
20 Jun 2023
Evaluation of Question Generation Needs More References
Evaluation of Question Generation Needs More References
Shinhyeok Oh
Hyojun Go
Hyeongdon Moon
Yunsung Lee
Myeongho Jeong
Hyun Seung Lee
Seungtaek Choi
ELM
46
8
0
26 May 2023
Enabling Large Language Models to Generate Text with Citations
Enabling Large Language Models to Generate Text with Citations
Tianyu Gao
Howard Yen
Jiatong Yu
Danqi Chen
LM&MA
HILM
67
336
0
24 May 2023
Does ChatGPT have Theory of Mind?
Does ChatGPT have Theory of Mind?
B. Holterman
Kees van Deemter
LRM
AI4CE
44
23
0
23 May 2023
GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot
  Setting and Performance Boosting Through Prompts
GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts
Jessica Nayeli López Espejel
E. Ettifouri
Mahaman Sanoussi Yahaya Alassan
El Mehdi Chouham
Walid Dahhane
ELM
LRM
44
88
0
21 May 2023
StructGPT: A General Framework for Large Language Model to Reason over
  Structured Data
StructGPT: A General Framework for Large Language Model to Reason over Structured Data
Jinhao Jiang
Kun Zhou
Zican Dong
Keming Ye
Wayne Xin Zhao
Ji-Rong Wen
LRM
LMTD
RALM
75
276
0
16 May 2023
Can LLM Already Serve as A Database Interface? A BIg Bench for
  Large-Scale Database Grounded Text-to-SQLs
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs
Jinyang Li
Binyuan Hui
Ge Qu
Jiaxi Yang
Binhua Li
...
Guoliang Li
Kevin C. C. Chang
Fei Huang
Reynold Cheng
Yongbin Li
LMTD
74
382
0
04 May 2023
Causal Reasoning and Large Language Models: Opening a New Frontier for
  Causality
Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
Emre Kıcıman
Robert Osazuwa Ness
Amit Sharma
Chenhao Tan
LRM
ELM
65
269
0
28 Apr 2023
Evaluating ChatGPT's Information Extraction Capabilities: An Assessment
  of Performance, Explainability, Calibration, and Faithfulness
Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness
Bo Li
Gexiang Fang
Yang Yang
Quansen Wang
Wei Ye
Wen Zhao
Shikun Zhang
ELM
AI4MH
69
160
0
23 Apr 2023
Is ChatGPT Equipped with Emotional Dialogue Capabilities?
Is ChatGPT Equipped with Emotional Dialogue Capabilities?
Weixiang Zhao
Yanyan Zhao
Xin Lu
Shilong Wang
Yanpeng Tong
Bing Qin
LLMAG
AI4MH
84
58
0
19 Apr 2023
Is ChatGPT a Highly Fluent Grammatical Error Correction System? A
  Comprehensive Evaluation
Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation
Tao Fang
Shu Yang
Kaixin Lan
Derek F. Wong
Jinpeng Hu
Lidia S. Chao
Yue Zhang
AI4MH
LRM
ELM
KELM
46
108
0
04 Apr 2023
Humans in Humans Out: On GPT Converging Toward Common Sense in both
  Success and Failure
Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure
Philipp E. Koralus
Vincent Wang-Ma'scianica
LRM
15
13
0
30 Mar 2023
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Kevin Qinghong Lin
E. Azarnasab
Faisal Ahmed
Zicheng Liu
Ce Liu
Michael Zeng
Lijuan Wang
ReLM
KELM
LRM
53
372
0
20 Mar 2023
Capabilities of GPT-4 on Medical Challenge Problems
Capabilities of GPT-4 on Medical Challenge Problems
Harsha Nori
Nicholas King
S. McKinney
Dean Carignan
Eric Horvitz
LM&MA
ELM
AI4MH
64
786
0
20 Mar 2023
An Empirical Study of Pre-trained Language Models in Simple Knowledge
  Graph Question Answering
An Empirical Study of Pre-trained Language Models in Simple Knowledge Graph Question Answering
Nan Hu
Yike Wu
Guilin Qi
Dehai Min
Jiaoyan Chen
Jeff Z. Pan
Z. Ali
ELM
AI4MH
40
38
0
18 Mar 2023
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
Jiaan Wang
Yunlong Liang
Fandong Meng
Zengkui Sun
Haoxiang Shi
Zhixu Li
Jinan Xu
Jianfeng Qu
Jie Zhou
LM&MA
ELM
ALM
AI4MH
82
458
0
07 Mar 2023
How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language
  Understanding Tasks
How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks
Xuanting Chen
Junjie Ye
Can Zu
Nuo Xu
Rui Zheng
Minlong Peng
Jie Zhou
Tao Gui
Qi Zhang
Xuanjing Huang
AI4MH
ELM
46
81
0
01 Mar 2023
On the Robustness of ChatGPT: An Adversarial and Out-of-distribution
  Perspective
On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
Jindong Wang
Xixu Hu
Wenxin Hou
Hao Chen
Runkai Zheng
...
Weirong Ye
Xiubo Geng
Binxing Jiao
Yue Zhang
Xingxu Xie
AI4MH
83
227
0
22 Feb 2023
ChatGPT: Jack of all trades, master of none
ChatGPT: Jack of all trades, master of none
Jan Kocoñ
Igor Cichecki
Oliwier Kaszyca
Mateusz Kochanek
Dominika Szydło
...
Maciej Piasecki
Lukasz Radliñski
Konrad Wojtasik
Stanislaw Wo'zniak
Przemyslaw Kazienko
AI4MH
77
544
0
21 Feb 2023
Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
  Fine-tuned BERT
Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT
Qihuang Zhong
Liang Ding
Juhua Liu
Bo Du
Dacheng Tao
AI4MH
86
241
0
19 Feb 2023
Is ChatGPT better than Human Annotators? Potential and Limitations of
  ChatGPT in Explaining Implicit Hate Speech
Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech
Fan Huang
Haewoon Kwak
Jisun An
AI4MH
45
260
0
11 Feb 2023
Mathematical Capabilities of ChatGPT
Mathematical Capabilities of ChatGPT
Simon Frieder
Luca Pinchetti
Alexis Chevalier
Ryan-Rhys Griffiths
Tommaso Salvatori
Thomas Lukasiewicz
P. Petersen
Julius Berner
ELM
AI4MH
94
412
0
31 Jan 2023
Causal-Discovery Performance of ChatGPT in the context of Neuropathic
  Pain Diagnosis
Causal-Discovery Performance of ChatGPT in the context of Neuropathic Pain Diagnosis
Ruibo Tu
Chao Ma
Cheng Zhang
ELM
CML
23
43
0
24 Jan 2023
Multi-Level Knowledge Distillation for Out-of-Distribution Detection in
  Text
Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text
Qianhui Wu
Huiqiang Jiang
Haonan Yin
Börje F. Karlsson
Chin-Yew Lin
72
10
0
21 Nov 2022
Large Language Models Meet Harry Potter: A Bilingual Dataset for
  Aligning Dialogue Agents with Characters
Large Language Models Meet Harry Potter: A Bilingual Dataset for Aligning Dialogue Agents with Characters
Nuo Chen
Yan Wang
Haiyun Jiang
Deng Cai
Yuhan Li
Ziyang Chen
Longyue Wang
Jia Li
39
8
0
13 Nov 2022
LaMDA: Language Models for Dialog Applications
LaMDA: Language Models for Dialog Applications
R. Thoppilan
Daniel De Freitas
Jamie Hall
Noam M. Shazeer
Apoorv Kulshreshtha
...
Blaise Aguera-Arcas
Claire Cui
M. Croak
Ed H. Chi
Quoc Le
ALM
96
1,577
0
20 Jan 2022
SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive
  Summarization
SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization
Bogdan Gliwa
Iwona Mochol
M. Biesek
A. Wawer
92
624
0
27 Nov 2019
Quantifying the Carbon Emissions of Machine Learning
Quantifying the Carbon Emissions of Machine Learning
Alexandre Lacoste
A. Luccioni
Victor Schmidt
Thomas Dandres
75
688
0
21 Oct 2019
1