Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2109.00698
Cited By
v1
v2 (latest)
An Empirical Exploration in Quality Filtering of Text Data
2 September 2021
Leo Gao
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"An Empirical Exploration in Quality Filtering of Text Data"
6 / 6 papers shown
Title
Data, Data Everywhere: A Guide for Pretraining Dataset Construction
Jupinder Parmar
Shrimai Prabhumoye
Pritam Gundecha
Bo Liu
Aastha Jhunjhunwala
Zhilin Wang
M. Patwary
Mohammad Shoeybi
Bryan Catanzaro
122
10
0
08 Jul 2024
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Zachary Ankner
Cody Blakeney
Kartik K. Sreenivasan
Max Marion
Matthew L. Leavitt
Mansheej Paul
115
34
0
30 May 2024
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters
L. Lucy
Suchin Gururangan
Luca Soldaini
Emma Strubell
David Bamman
Lauren Klein
Jesse Dodge
117
17
0
12 Jan 2024
Will we run out of data? Limits of LLM scaling based on human-generated data
Pablo Villalobos
A. Ho
J. Sevilla
T. Besiroglu
Lennart Heim
Marius Hobbhahn
ALM
102
125
0
26 Oct 2022
Data Scaling Laws in NMT: The Effect of Noise and Architecture
Yamini Bansal
Behrooz Ghorbani
Ankush Garg
Biao Zhang
M. Krikun
Colin Cherry
Behnam Neyshabur
Orhan Firat
108
49
0
04 Feb 2022
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Suchin Gururangan
Dallas Card
Sarah K. Drier
E. K. Gade
Leroy Z. Wang
Zeyu Wang
Luke Zettlemoyer
Noah A. Smith
270
81
0
25 Jan 2022
1