Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.01481
Cited By
GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration
2 June 2023
Aleksandra Piktus
Odunayo Ogundepo
Christopher Akiki
Akintunde Oladipo
Xinyu Crystina Zhang
Hailey Schoelkopf
Stella Biderman
Martin Potthast
Jimmy J. Lin
CVBM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration"
6 / 6 papers shown
Title
Oasis: Data Curation and Assessment System for Pretraining of Large Language Models
Tong Zhou
Yubo Chen
Pengfei Cao
Kang Liu
Jun Zhao
Shengping Liu
29
3
0
21 Nov 2023
Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers
Odunayo Ogundepo
Xinyu Crystina Zhang
Jimmy J. Lin
20
2
0
11 Oct 2022
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
242
593
0
14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
282
1,996
0
31 Dec 2020
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
290
1,824
0
14 Dec 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
264
4,489
0
23 Jan 2020
1