ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2105.02732
  4. Cited By
What's in the Box? A Preliminary Analysis of Undesirable Content in the
  Common Crawl Corpus

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

6 May 2021
A. Luccioni
J. Viviano
ArXivPDFHTML

Papers citing "What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus"

18 / 68 papers shown
Title
BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating
  Low-Resource Natural Language Generation in Bangla
BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla
Abhik Bhattacharjee
Tahmid Hasan
Wasi Uddin Ahmad
Rifat Shahriyar
AIMat
LM&MA
39
28
0
23 May 2022
Data Governance in the Age of Large-Scale Data-Driven Language
  Technology
Data Governance in the Age of Large-Scale Data-Driven Language Technology
Yacine Jernite
Huu Nguyen
Stella Biderman
A. Rogers
Maraim Masoud
...
Jorg Frohberg
Aaron Gokaslan
Peter Henderson
Rishi Bommasani
Margaret Mitchell
26
52
0
04 May 2022
PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained
  Language Model
PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained Language Model
Fei Mi
Yitong Li
Yulong Zeng
Jingyan Zhou
Yasheng Wang
Chuanfei Xu
Lifeng Shang
Xin Jiang
Shiqi Zhao
Qun Liu
ALM
45
18
0
31 Mar 2022
Probing Pre-Trained Language Models for Cross-Cultural Differences in
  Values
Probing Pre-Trained Language Models for Cross-Cultural Differences in Values
Arnav Arora
Lucie-Aimée Kaffee
Isabelle Augenstein
VLM
34
123
0
25 Mar 2022
Impact of Pretraining Term Frequencies on Few-Shot Reasoning
Impact of Pretraining Term Frequencies on Few-Shot Reasoning
Yasaman Razeghi
Robert L Logan IV
Matt Gardner
Sameer Singh
ReLM
LRM
32
150
0
15 Feb 2022
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation
  Practices for Generated Text
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
Sebastian Gehrmann
Elizabeth Clark
Thibault Sellam
ELM
AI4CE
69
184
0
14 Feb 2022
Documenting Geographically and Contextually Diverse Data Sources: The
  BigScience Catalogue of Language Data and Resources
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
Angelina McMillan-Major
Zaid Alyafeai
Stella Biderman
Kimbo Chen
F. Toni
...
Aitor Soroa Etxabe
Pedro Ortiz Suarez
Zeerak Talat
Daniel Alexander van Strien
Yacine Jernite
40
14
0
25 Jan 2022
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Julien Abadji
Pedro Ortiz Suarez
Laurent Romary
Benoît Sagot
CLL
39
153
0
17 Jan 2022
CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+
  Language Pairs
CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs
Abhik Bhattacharjee
Tahmid Hasan
Wasi Uddin Ahmad
Yuan-Fang Li
Yong-Bin Kang
Rifat Shahriyar
RALM
ELM
40
37
0
16 Dec 2021
Est-ce que vous compute? Code-switching, cultural identity, and AI
Est-ce que vous compute? Code-switching, cultural identity, and AI
Arianna Falbo
Travis LaCroix
16
8
0
15 Dec 2021
A Framework for Deprecating Datasets: Standardizing Documentation,
  Identification, and Communication
A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication
A. Luccioni
Frances Corry
H. Sridharan
Mike Ananny
J. Schultz
Kate Crawford
54
29
0
18 Oct 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
  Crawled Corpus
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
34
425
0
18 Apr 2021
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer
Isaac Caswell
Lisa Wang
Ahsan Wahab
D. Esch
...
Duygu Ataman
Orevaoghene Ahia
Oghenefego Ahia
Sweta Agrawal
Mofetoluwa Adeyemi
20
267
0
22 Mar 2021
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based
  Bias in NLP
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
Timo Schick
Sahana Udupa
Hinrich Schütze
259
374
0
28 Feb 2021
BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource
  Language Understanding Evaluation in Bangla
BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla
Abhik Bhattacharjee
Tahmid Hasan
Wasi Uddin Ahmad
Kazi Samin Mubasshir
Md. Saiful Islam
Anindya Iqbal
M. Rahman
Rifat Shahriyar
SSL
VLM
25
166
0
01 Jan 2021
Extracting Training Data from Large Language Models
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
290
1,824
0
14 Dec 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
264
4,489
0
23 Jan 2020
The Woman Worked as a Babysitter: On Biases in Language Generation
The Woman Worked as a Babysitter: On Biases in Language Generation
Emily Sheng
Kai-Wei Chang
Premkumar Natarajan
Nanyun Peng
223
618
0
03 Sep 2019
Previous
12