Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2411.10842
Cited By
CODECLEANER: Elevating Standards with A Robust Data Contamination Mitigation Toolkit
16 November 2024
Jialun Cao
Songqiang Chen
Wuqi Zhang
Hau Ching Lo
Shing-Chi Cheung
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CODECLEANER: Elevating Standards with A Robust Data Contamination Mitigation Toolkit"
17 / 17 papers shown
Title
Concerned with Data Contamination? Assessing Countermeasures in Code Language Model
Jialun Cao
Wuqi Zhang
Shing-Chi Cheung
60
20
0
25 Mar 2024
Datasets for Large Language Models: A Comprehensive Survey
Yang Liu
Jiahuan Cao
Chongyu Liu
Kai Ding
Lianwen Jin
AILaw
72
70
0
28 Feb 2024
Refactoring Programs Using Large Language Models with Few-Shot Examples
Atsushi Shirafuji
Yusuke Oda
Jun Suzuki
Makoto Morishita
Yutaka Watanobe
64
38
0
20 Nov 2023
Detecting Pretraining Data from Large Language Models
Weijia Shi
Anirudh Ajith
Mengzhou Xia
Yangsibo Huang
Daogao Liu
Terra Blevins
Danqi Chen
Luke Zettlemoyer
MIALM
93
201
0
25 Oct 2023
Membership Inference Attacks against Language Models via Neighbourhood Comparison
Justus Mattern
Fatemehsadat Mireshghallah
Zhijing Jin
Bernhard Schölkopf
Mrinmaya Sachan
Taylor Berg-Kirkpatrick
MIALM
98
190
0
29 May 2023
Data Portraits: Recording Foundation Model Training Data
Marc Marone
Benjamin Van Durme
203
30
0
06 Mar 2023
Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
Kushal Tirumala
Aram H. Markosyan
Luke Zettlemoyer
Armen Aghajanyan
TDI
112
197
0
22 May 2022
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILM
LRM
535
6,301
0
05 Apr 2022
Program Synthesis with Large Language Models
Jacob Austin
Augustus Odena
Maxwell Nye
Maarten Bosma
Henryk Michalewski
...
Ellen Jiang
Carrie J. Cai
Michael Terry
Quoc V. Le
Charles Sutton
ELM
AIMat
ReCod
ALM
216
2,009
0
16 Aug 2021
Evaluating Large Language Models Trained on Code
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
...
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
ELM
ALM
238
5,675
0
07 Jul 2021
Measuring Coding Challenge Competence With APPS
Dan Hendrycks
Steven Basart
Saurav Kadavath
Mantas Mazeika
Akul Arora
...
Collin Burns
Samir Puranik
Horace He
Basel Alomair
Jacob Steinhardt
ELM
AIMat
ALM
274
707
0
20 May 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
122
452
0
18 Apr 2021
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
Basel Alomair
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
517
1,956
0
14 Dec 2020
SemMT: A Semantic-based Testing Approach for Machine Translation Systems
Jialun Cao
Meiziniu Li
Yeting Li
Ming Wen
Haiming Chen
68
35
0
03 Dec 2020
It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
Timo Schick
Hinrich Schütze
132
976
0
15 Sep 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
904
42,463
0
28 May 2020
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
Nicholas Carlini
Chang-rui Liu
Ulfar Erlingsson
Jernej Kos
Basel Alomair
165
1,150
0
22 Feb 2018
1