Data Mixture Inference: What do BPE Tokenizers Reveal about their
Training Data?

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

23 July 2024

Yejin Choi

Papers citing "Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?"

16 / 16 papers shown

Title
SuperBPE: Space Travel for Language Models Alisa Liu J. Hayase Valentin Hofmann Sewoong Oh Noah A. Smith Yejin Choi 96 7 0 17 Mar 2025
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation Vera Neplenbroek Arianna Bisazza Raquel Fernández 154 1 0 18 Dec 2024
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling Tomasz Limisiewicz Terra Blevins Hila Gonen Orevaoghene Ahia Luke Zettlemoyer 59 15 0 15 Mar 2024
Coercing LLMs to do and reveal (almost) anything Jonas Geiping Alex Stein Manli Shu Khalid Saifullah Yuxin Wen Tom Goldstein AAML 67 48 0 21 Feb 2024
Detecting Pretraining Data from Large Language Models Weijia Shi Anirudh Ajith Mengzhou Xia Yangsibo Huang Daogao Liu Terra Blevins Danqi Chen Luke Zettlemoyer MIALM 59 183 0 25 Oct 2023
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity Shayne Longpre Gregory Yauney Emily Reif Katherine Lee Adam Roberts ... Denny Zhou Jason W. Wei Kevin Robinson David M. Mimno Daphne Ippolito 78 160 0 22 May 2023
GPT-NeoX-20B: An Open-Source Autoregressive Language Model Sid Black Stella Biderman Eric Hallahan Quentin G. Anthony Leo Gao ... Shivanshu Purohit Laria Reynolds J. Tow Benqi Wang Samuel Weinbach 151 824 0 14 Apr 2022
Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks Fatemehsadat Mireshghallah Kartik Goyal Archit Uniyal Taylor Berg-Kirkpatrick Reza Shokri MIALM 50 161 0 08 Mar 2022
Quantifying Memorization Across Neural Language Models Nicholas Carlini Daphne Ippolito Matthew Jagielski Katherine Lee Florian Tramèr Chiyuan Zhang PILM 100 614 0 15 Feb 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 425 2,081 0 31 Dec 2020
Extracting Training Data from Large Language Models Nicholas Carlini Florian Tramèr Eric Wallace Matthew Jagielski Ariel Herbert-Voss ... Tom B. Brown D. Song Ulfar Erlingsson Alina Oprea Colin Raffel MLAU SILM 432 1,906 0 14 Dec 2020
Label-Only Membership Inference Attacks Christopher A. Choquette-Choo Florian Tramèr Nicholas Carlini Nicolas Papernot MIACV MIALM 78 505 0 28 Jul 2020
Exploiting Unintended Feature Leakage in Collaborative Learning Luca Melis Congzheng Song Emiliano De Cristofaro Vitaly Shmatikov FedML 136 1,471 0 10 May 2018
Membership Inference Attacks against Machine Learning Models Reza Shokri M. Stronati Congzheng Song Vitaly Shmatikov SLR MIALM MIACV 228 4,103 0 18 Oct 2016
Neural Machine Translation of Rare Words with Subword Units Rico Sennrich Barry Haddow Alexandra Birch 191 7,729 0 31 Aug 2015
Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers G. Ateniese G. Felici L. Mancini A. Spognardi Antonio Villani Domenico Vitali 72 459 0 19 Jun 2013