Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.16607
Cited By
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
23 July 2024
J. Hayase
Alisa Liu
Yejin Choi
Sewoong Oh
Noah A. Smith
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?"
16 / 16 papers shown
Title
SuperBPE: Space Travel for Language Models
Alisa Liu
J. Hayase
Valentin Hofmann
Sewoong Oh
Noah A. Smith
Yejin Choi
96
7
0
17 Mar 2025
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation
Vera Neplenbroek
Arianna Bisazza
Raquel Fernández
154
1
0
18 Dec 2024
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
Tomasz Limisiewicz
Terra Blevins
Hila Gonen
Orevaoghene Ahia
Luke Zettlemoyer
59
15
0
15 Mar 2024
Coercing LLMs to do and reveal (almost) anything
Jonas Geiping
Alex Stein
Manli Shu
Khalid Saifullah
Yuxin Wen
Tom Goldstein
AAML
67
48
0
21 Feb 2024
Detecting Pretraining Data from Large Language Models
Weijia Shi
Anirudh Ajith
Mengzhou Xia
Yangsibo Huang
Daogao Liu
Terra Blevins
Danqi Chen
Luke Zettlemoyer
MIALM
59
183
0
25 Oct 2023
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Shayne Longpre
Gregory Yauney
Emily Reif
Katherine Lee
Adam Roberts
...
Denny Zhou
Jason W. Wei
Kevin Robinson
David M. Mimno
Daphne Ippolito
78
160
0
22 May 2023
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black
Stella Biderman
Eric Hallahan
Quentin G. Anthony
Leo Gao
...
Shivanshu Purohit
Laria Reynolds
J. Tow
Benqi Wang
Samuel Weinbach
151
824
0
14 Apr 2022
Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks
Fatemehsadat Mireshghallah
Kartik Goyal
Archit Uniyal
Taylor Berg-Kirkpatrick
Reza Shokri
MIALM
50
161
0
08 Mar 2022
Quantifying Memorization Across Neural Language Models
Nicholas Carlini
Daphne Ippolito
Matthew Jagielski
Katherine Lee
Florian Tramèr
Chiyuan Zhang
PILM
100
614
0
15 Feb 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
425
2,081
0
31 Dec 2020
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
432
1,906
0
14 Dec 2020
Label-Only Membership Inference Attacks
Christopher A. Choquette-Choo
Florian Tramèr
Nicholas Carlini
Nicolas Papernot
MIACV
MIALM
78
505
0
28 Jul 2020
Exploiting Unintended Feature Leakage in Collaborative Learning
Luca Melis
Congzheng Song
Emiliano De Cristofaro
Vitaly Shmatikov
FedML
136
1,471
0
10 May 2018
Membership Inference Attacks against Machine Learning Models
Reza Shokri
M. Stronati
Congzheng Song
Vitaly Shmatikov
SLR
MIALM
MIACV
228
4,103
0
18 Oct 2016
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich
Barry Haddow
Alexandra Birch
191
7,729
0
31 Aug 2015
Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers
G. Ateniese
G. Felici
L. Mancini
A. Spognardi
Antonio Villani
Domenico Vitali
72
459
0
19 Jun 2013
1