ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.06539
  4. Cited By
Deduplicating Training Data Mitigates Privacy Risks in Language Models

Deduplicating Training Data Mitigates Privacy Risks in Language Models

14 February 2022
Nikhil Kandpal
Eric Wallace
Colin Raffel
    PILM
    MU
ArXivPDFHTML

Papers citing "Deduplicating Training Data Mitigates Privacy Risks in Language Models"

50 / 212 papers shown
Title
Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft
  Prompting and Calibrated Confidence Estimation
Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation
Zhexin Zhang
Jiaxin Wen
Minlie Huang
38
29
0
10 Jul 2023
CHORUS: Foundation Models for Unified Data Discovery and Exploration
CHORUS: Foundation Models for Unified Data Discovery and Exploration
Moe Kayali
A. Lykov
Ilias Fountalis
N. Vasiloglou
Dan Olteanu
Dan Suciu
25
21
0
16 Jun 2023
Inverse Scaling: When Bigger Isn't Better
Inverse Scaling: When Bigger Isn't Better
I. R. McKenzie
Alexander Lyzhov
Michael Pieler
Alicia Parrish
Aaron Mueller
...
Yuhui Zhang
Zhengping Zhou
Najoung Kim
Sam Bowman
Ethan Perez
27
126
0
15 Jun 2023
Membership Inference Attacks against Language Models via Neighbourhood
  Comparison
Membership Inference Attacks against Language Models via Neighbourhood Comparison
Justus Mattern
Fatemehsadat Mireshghallah
Zhijing Jin
Bernhard Schölkopf
Mrinmaya Sachan
Taylor Berg-Kirkpatrick
MIALM
23
167
0
29 May 2023
Scaling Data-Constrained Language Models
Scaling Data-Constrained Language Models
Niklas Muennighoff
Alexander M. Rush
Boaz Barak
Teven Le Scao
Aleksandra Piktus
Nouamane Tazi
S. Pyysalo
Thomas Wolf
Colin Raffel
ALM
35
200
0
25 May 2023
Training Data Extraction From Pre-trained Language Models: A Survey
Training Data Extraction From Pre-trained Language Models: A Survey
Shotaro Ishihara
26
46
0
25 May 2023
Quantifying Association Capabilities of Large Language Models and Its
  Implications on Privacy Leakage
Quantifying Association Capabilities of Large Language Models and Its Implications on Privacy Leakage
Hanyin Shao
Jie Huang
Shen Zheng
Kevin Chen-Chuan Chang
PILM
22
25
0
22 May 2023
The Role of Data Curation in Image Captioning
The Role of Data Curation in Image Captioning
Wenyan Li
Jonas F. Lotz
Chen Qiu
Desmond Elliott
DiffM
37
6
0
05 May 2023
Mitigating Approximate Memorization in Language Models via Dissimilarity
  Learned Policy
Mitigating Approximate Memorization in Language Models via Dissimilarity Learned Policy
Aly M. Kassem
26
2
0
02 May 2023
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
Kent K. Chang
Mackenzie Cramer
Sandeep Soni
David Bamman
RALM
145
111
0
28 Apr 2023
Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in
  Self-supervised Learning
Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning
Casey Meehan
Florian Bordes
Pascal Vincent
Kamalika Chaudhuri
Chuan Guo
34
18
0
26 Apr 2023
Towards Mode Balancing of Generative Models via Diversity Weights
Towards Mode Balancing of Generative Models via Diversity Weights
Sebastian Berns
S. Colton
Christian Guckelsberger
21
6
0
24 Apr 2023
Emergent and Predictable Memorization in Large Language Models
Emergent and Predictable Memorization in Large Language Models
Stella Biderman
USVSN Sai Prashanth
Lintang Sutawika
Hailey Schoelkopf
Quentin G. Anthony
Shivanshu Purohit
Edward Raf
29
116
0
21 Apr 2023
An Evaluation on Large Language Model Outputs: Discourse and
  Memorization
An Evaluation on Large Language Model Outputs: Discourse and Memorization
Adrian de Wynter
Xun Wang
Alex Sokolov
Qilong Gu
Si-Qing Chen
ELM
84
32
0
17 Apr 2023
Foundation Models and Fair Use
Foundation Models and Fair Use
Peter Henderson
Xuechen Li
Dan Jurafsky
Tatsunori Hashimoto
Mark A. Lemley
Percy Liang
33
119
0
28 Mar 2023
Language Model Behavior: A Comprehensive Survey
Language Model Behavior: A Comprehensive Survey
Tyler A. Chang
Benjamin Bergen
VLM
LRM
LM&MA
27
103
0
20 Mar 2023
SemDeDup: Data-efficient learning at web-scale through semantic
  deduplication
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
Amro Abbas
Kushal Tirumala
Daniel Simig
Surya Ganguli
Ari S. Morcos
23
162
0
16 Mar 2023
Secret-Keeping in Question Answering
Secret-Keeping in Question Answering
Nathaniel W. Rollings
Kent O'Sullivan
Sakshum Kulshrestha
KELM
30
0
0
16 Mar 2023
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Hugo Laurenccon
Lucile Saulnier
Thomas Wang
Christopher Akiki
Albert Villanova del Moral
...
Violette Lepercq
Suzana Ilić
Margaret Mitchell
Sasha Luccioni
Yacine Jernite
AI4CE
AILaw
44
163
0
07 Mar 2023
A Pathway Towards Responsible AI Generated Content
A Pathway Towards Responsible AI Generated Content
Chen Chen
Jie Fu
Lingjuan Lyu
49
71
0
02 Mar 2023
The ROOTS Search Tool: Data Transparency for LLMs
The ROOTS Search Tool: Data Transparency for LLMs
Aleksandra Piktus
Christopher Akiki
Paulo Villegas
Hugo Laurenccon
Gérard Dupont
A. Luccioni
Yacine Jernite
Anna Rogers
VLM
35
29
0
27 Feb 2023
On Provable Copyright Protection for Generative Models
On Provable Copyright Protection for Generative Models
Nikhil Vyas
Sham Kakade
Boaz Barak
24
87
0
21 Feb 2023
Bounding Training Data Reconstruction in DP-SGD
Bounding Training Data Reconstruction in DP-SGD
Jamie Hayes
Saeed Mahloujifar
Borja Balle
AAML
FedML
33
39
0
14 Feb 2023
Bag of Tricks for Training Data Extraction from Language Models
Bag of Tricks for Training Data Extraction from Language Models
Weichen Yu
Tianyu Pang
Qian Liu
Chao Du
Bingyi Kang
Yan Huang
Min-Bin Lin
Shuicheng Yan
21
47
0
09 Feb 2023
Machine Learning for Synthetic Data Generation: A Review
Machine Learning for Synthetic Data Generation: A Review
Ying-Cheng Lu
Minjie Shen
Huazheng Wang
Xiao Wang
Capucine Van Rechem
Tianfan Fu
Wenqi Wei
SyDa
42
140
0
08 Feb 2023
Analyzing Leakage of Personally Identifiable Information in Language
  Models
Analyzing Leakage of Personally Identifiable Information in Language Models
Nils Lukas
A. Salem
Robert Sim
Shruti Tople
Lukas Wutschitz
Santiago Zanella Béguelin
PILM
24
211
0
01 Feb 2023
FLAME: A small language model for spreadsheet formulas
FLAME: A small language model for spreadsheet formulas
Harshit Joshi
Abishai Ebenezer
J. Cambronero
Sumit Gulwani
Aditya Kanade
Vu Le
Ivan Radivcek
Gust Verbruggen
LMTD
37
12
0
31 Jan 2023
Extracting Training Data from Diffusion Models
Extracting Training Data from Diffusion Models
Nicholas Carlini
Jamie Hayes
Milad Nasr
Matthew Jagielski
Vikash Sehwag
Florian Tramèr
Borja Balle
Daphne Ippolito
Eric Wallace
DiffM
63
569
0
30 Jan 2023
Generalization on the Unseen, Logic Reasoning and Degree Curriculum
Generalization on the Unseen, Logic Reasoning and Degree Curriculum
Emmanuel Abbe
Samy Bengio
Aryo Lotfi
Kevin Rizk
LRM
39
49
0
30 Jan 2023
Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and
  Toxicity
Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
Terry Yue Zhuo
Yujin Huang
Chunyang Chen
Zhenchang Xing
SILM
33
102
0
30 Jan 2023
Training Data Influence Analysis and Estimation: A Survey
Training Data Influence Analysis and Estimation: A Survey
Zayd Hammoudeh
Daniel Lowd
TDI
29
82
0
09 Dec 2022
The Stack: 3 TB of permissively licensed source code
The Stack: 3 TB of permissively licensed source code
Denis Kocetkov
Raymond Li
Loubna Ben Allal
Jia Li
Chenghao Mou
...
Sean M. Hughes
Thomas Wolf
Dzmitry Bahdanau
Leandro von Werra
H. D. Vries
58
308
0
20 Nov 2022
Large Language Models Struggle to Learn Long-Tail Knowledge
Large Language Models Struggle to Learn Long-Tail Knowledge
Nikhil Kandpal
H. Deng
Adam Roberts
Eric Wallace
Colin Raffel
RALM
KELM
41
382
0
15 Nov 2022
Preventing Verbatim Memorization in Language Models Gives a False Sense
  of Privacy
Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy
Daphne Ippolito
Florian Tramèr
Milad Nasr
Chiyuan Zhang
Matthew Jagielski
Katherine Lee
Christopher A. Choquette-Choo
Nicholas Carlini
PILM
MU
23
58
0
31 Oct 2022
You can't pick your neighbors, or can you? When and how to rely on
  retrieval in the $k$NN-LM
You can't pick your neighbors, or can you? When and how to rely on retrieval in the kkkNN-LM
Andrew Drozdov
Shufan Wang
Razieh Rahimi
Andrew McCallum
Hamed Zamani
Mohit Iyyer
RALM
119
17
0
28 Oct 2022
Synthetic Text Generation with Differential Privacy: A Simple and
  Practical Recipe
Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe
Xiang Yue
Huseyin A. Inan
Xuechen Li
Girish Kumar
Julia McAnallen
Hoda Shajari
Huan Sun
David Levitan
Robert Sim
58
79
0
25 Oct 2022
Language Generation Models Can Cause Harm: So What Can We Do About It?
  An Actionable Survey
Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey
Sachin Kumar
Vidhisha Balachandran
Lucille Njoo
Antonios Anastasopoulos
Yulia Tsvetkov
ELM
77
85
0
14 Oct 2022
Noise-Robust De-Duplication at Scale
Noise-Robust De-Duplication at Scale
Emily Silcock
Luca DÁmico-Wong
Jinglin Yang
Melissa Dell
SyDa
33
20
0
09 Oct 2022
Knowledge Unlearning for Mitigating Privacy Risks in Language Models
Knowledge Unlearning for Mitigating Privacy Risks in Language Models
Joel Jang
Dongkeun Yoon
Sohee Yang
Sungmin Cha
Moontae Lee
Lajanugen Logeswaran
Minjoon Seo
KELM
PILM
MU
147
191
0
04 Oct 2022
WeLM: A Well-Read Pre-trained Language Model for Chinese
WeLM: A Well-Read Pre-trained Language Model for Chinese
Hui Su
Xiao Zhou
Houjin Yu
Xiaoyu Shen
Yuwen Chen
Zilin Zhu
Yang Yu
Jie Zhou
37
23
0
21 Sep 2022
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq
  Model
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
Saleh Soltan
Shankar Ananthakrishnan
Jack G. M. FitzGerald
Rahul Gupta
Wael Hamza
...
Mukund Sridhar
Fabian Triefenbach
Apurv Verma
Gokhan Tur
Premkumar Natarajan
54
82
0
02 Aug 2022
Combing for Credentials: Active Pattern Extraction from Smart Reply
Combing for Credentials: Active Pattern Extraction from Smart Reply
Bargav Jayaraman
Esha Ghosh
Melissa Chase
Sambuddha Roy
Wei Dai
David E. Evans
SILM
20
8
0
14 Jul 2022
Pile of Law: Learning Responsible Data Filtering from the Law and a
  256GB Open-Source Legal Dataset
Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Peter Henderson
M. Krass
Lucia Zheng
Neel Guha
Christopher D. Manning
Dan Jurafsky
Daniel E. Ho
AILaw
ELM
131
97
0
01 Jul 2022
Measuring Forgetting of Memorized Training Examples
Measuring Forgetting of Memorized Training Examples
Matthew Jagielski
Om Thakkar
Florian Tramèr
Daphne Ippolito
Katherine Lee
...
Eric Wallace
Shuang Song
Abhradeep Thakurta
Nicolas Papernot
Chiyuan Zhang
TDI
56
102
0
30 Jun 2022
Emergent Abilities of Large Language Models
Emergent Abilities of Large Language Models
Jason W. Wei
Yi Tay
Rishi Bommasani
Colin Raffel
Barret Zoph
...
Tatsunori Hashimoto
Oriol Vinyals
Percy Liang
J. Dean
W. Fedus
ELM
ReLM
LRM
60
2,344
0
15 Jun 2022
Privacy Leakage in Text Classification: A Data Extraction Approach
Privacy Leakage in Text Classification: A Data Extraction Approach
Adel M. Elmahdy
Huseyin A. Inan
Robert Sim
27
13
0
09 Jun 2022
Provably Confidential Language Modelling
Provably Confidential Language Modelling
Xuandong Zhao
Lei Li
Yu-Xiang Wang
MU
21
15
0
04 May 2022
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black
Stella Biderman
Eric Hallahan
Quentin G. Anthony
Leo Gao
...
Shivanshu Purohit
Laria Reynolds
J. Tow
Benqi Wang
Samuel Weinbach
90
801
0
14 Apr 2022
InCoder: A Generative Model for Code Infilling and Synthesis
InCoder: A Generative Model for Code Infilling and Synthesis
Daniel Fried
Armen Aghajanyan
Jessy Lin
Sida I. Wang
Eric Wallace
Freda Shi
Ruiqi Zhong
Wen-tau Yih
Luke Zettlemoyer
M. Lewis
SyDa
28
626
0
12 Apr 2022
PaLM: Scaling Language Modeling with Pathways
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILM
LRM
94
6,015
0
05 Apr 2022
Previous
12345
Next