Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2105.02732
Cited By
What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus
6 May 2021
A. Luccioni
J. Viviano
Re-assign community
ArXiv
PDF
HTML
Papers citing
"What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus"
18 / 68 papers shown
Title
BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla
Abhik Bhattacharjee
Tahmid Hasan
Wasi Uddin Ahmad
Rifat Shahriyar
AIMat
LM&MA
39
28
0
23 May 2022
Data Governance in the Age of Large-Scale Data-Driven Language Technology
Yacine Jernite
Huu Nguyen
Stella Biderman
A. Rogers
Maraim Masoud
...
Jorg Frohberg
Aaron Gokaslan
Peter Henderson
Rishi Bommasani
Margaret Mitchell
26
52
0
04 May 2022
PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained Language Model
Fei Mi
Yitong Li
Yulong Zeng
Jingyan Zhou
Yasheng Wang
Chuanfei Xu
Lifeng Shang
Xin Jiang
Shiqi Zhao
Qun Liu
ALM
45
18
0
31 Mar 2022
Probing Pre-Trained Language Models for Cross-Cultural Differences in Values
Arnav Arora
Lucie-Aimée Kaffee
Isabelle Augenstein
VLM
34
123
0
25 Mar 2022
Impact of Pretraining Term Frequencies on Few-Shot Reasoning
Yasaman Razeghi
Robert L Logan IV
Matt Gardner
Sameer Singh
ReLM
LRM
32
150
0
15 Feb 2022
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
Sebastian Gehrmann
Elizabeth Clark
Thibault Sellam
ELM
AI4CE
69
184
0
14 Feb 2022
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
Angelina McMillan-Major
Zaid Alyafeai
Stella Biderman
Kimbo Chen
F. Toni
...
Aitor Soroa Etxabe
Pedro Ortiz Suarez
Zeerak Talat
Daniel Alexander van Strien
Yacine Jernite
40
14
0
25 Jan 2022
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Julien Abadji
Pedro Ortiz Suarez
Laurent Romary
Benoît Sagot
CLL
39
153
0
17 Jan 2022
CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs
Abhik Bhattacharjee
Tahmid Hasan
Wasi Uddin Ahmad
Yuan-Fang Li
Yong-Bin Kang
Rifat Shahriyar
RALM
ELM
40
37
0
16 Dec 2021
Est-ce que vous compute? Code-switching, cultural identity, and AI
Arianna Falbo
Travis LaCroix
16
8
0
15 Dec 2021
A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication
A. Luccioni
Frances Corry
H. Sridharan
Mike Ananny
J. Schultz
Kate Crawford
54
29
0
18 Oct 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
34
425
0
18 Apr 2021
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer
Isaac Caswell
Lisa Wang
Ahsan Wahab
D. Esch
...
Duygu Ataman
Orevaoghene Ahia
Oghenefego Ahia
Sweta Agrawal
Mofetoluwa Adeyemi
20
267
0
22 Mar 2021
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
Timo Schick
Sahana Udupa
Hinrich Schütze
259
374
0
28 Feb 2021
BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla
Abhik Bhattacharjee
Tahmid Hasan
Wasi Uddin Ahmad
Kazi Samin Mubasshir
Md. Saiful Islam
Anindya Iqbal
M. Rahman
Rifat Shahriyar
SSL
VLM
25
166
0
01 Jan 2021
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
290
1,824
0
14 Dec 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
264
4,489
0
23 Jan 2020
The Woman Worked as a Babysitter: On Biases in Language Generation
Emily Sheng
Kai-Wei Chang
Premkumar Natarajan
Nanyun Peng
223
618
0
03 Sep 2019
Previous
1
2