ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.08758
  4. Cited By
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
  Crawled Corpus

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

18 April 2021
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
    AILaw
ArXivPDFHTML

Papers citing "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus"

34 / 84 papers shown
Title
Auditing large language models: a three-layered approach
Auditing large language models: a three-layered approach
Jakob Mokander
Jonas Schuett
Hannah Rose Kirk
Luciano Floridi
AILaw
MLAU
48
196
0
16 Feb 2023
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained
  Language Models
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models
Alexandra Chronopoulou
Matthew E. Peters
Alexander Fraser
Jesse Dodge
MoMe
32
66
0
14 Feb 2023
Trustworthy Social Bias Measurement
Trustworthy Social Bias Measurement
Rishi Bommasani
Percy Liang
27
10
0
20 Dec 2022
Can Current Task-oriented Dialogue Models Automate Real-world Scenarios
  in the Wild?
Can Current Task-oriented Dialogue Models Automate Real-world Scenarios in the Wild?
Sang-Woo Lee
Sungdong Kim
Donghyeon Ko
Dong-hyun Ham
Youngki Hong
...
Wangkyo Jung
Kyunghyun Cho
Donghyun Kwak
H. Noh
W. Park
51
1
0
20 Dec 2022
Striving for data-model efficiency: Identifying data externalities on
  group performance
Striving for data-model efficiency: Identifying data externalities on group performance
Esther Rolf
Ben Packer
Alex Beutel
Fernando Diaz
TDI
30
2
0
11 Nov 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
118
2,315
0
09 Nov 2022
Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs
Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs
Maarten Sap
Ronan Le Bras
Daniel Fried
Yejin Choi
27
209
0
24 Oct 2022
Self-Adaptive Named Entity Recognition by Retrieving Unstructured
  Knowledge
Self-Adaptive Named Entity Recognition by Retrieving Unstructured Knowledge
Kosuke Nishida
Naoki Yoshinaga
Kyosuke Nishida
30
2
0
14 Oct 2022
Noise-Robust De-Duplication at Scale
Noise-Robust De-Duplication at Scale
Emily Silcock
Luca DÁmico-Wong
Jinglin Yang
Melissa Dell
SyDa
39
20
0
09 Oct 2022
Re-contextualizing Fairness in NLP: The Case of India
Re-contextualizing Fairness in NLP: The Case of India
Shaily Bhatt
Sunipa Dev
Partha P. Talukdar
Shachi Dave
Vinodkumar Prabhakaran
32
54
0
25 Sep 2022
Language models show human-like content effects on reasoning tasks
Language models show human-like content effects on reasoning tasks
Ishita Dasgupta
Andrew Kyle Lampinen
Stephanie C. Y. Chan
Hannah R. Sheahan
Antonia Creswell
D. Kumaran
James L. McClelland
Felix Hill
ReLM
LRM
30
181
0
14 Jul 2022
Pile of Law: Learning Responsible Data Filtering from the Law and a
  256GB Open-Source Legal Dataset
Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Peter Henderson
M. Krass
Lucia Zheng
Neel Guha
Christopher D. Manning
Dan Jurafsky
Daniel E. Ho
AILaw
ELM
131
97
0
01 Jul 2022
Towards WinoQueer: Developing a Benchmark for Anti-Queer Bias in Large
  Language Models
Towards WinoQueer: Developing a Benchmark for Anti-Queer Bias in Large Language Models
Virginia K. Felkner
Ho-Chun Herbert Chang
Eugene Jang
Jonathan May
OSLM
29
8
0
23 Jun 2022
Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender
  Bias
Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias
Yarden Tal
Inbal Magar
Roy Schwartz
19
33
0
20 Jun 2022
Characteristics of Harmful Text: Towards Rigorous Benchmarking of
  Language Models
Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models
Maribeth Rauh
John F. J. Mellor
J. Uesato
Po-Sen Huang
Johannes Welbl
...
Amelia Glaese
G. Irving
Iason Gabriel
William S. Isaac
Lisa Anne Hendricks
33
49
0
16 Jun 2022
On Advances in Text Generation from Images Beyond Captioning: A Case
  Study in Self-Rationalization
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Shruti Palaskar
Akshita Bhagia
Yonatan Bisk
Florian Metze
A. Black
Ana Marasović
31
4
0
24 May 2022
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
  Their Implications
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Kaitlyn Zhou
Su Lin Blodgett
Adam Trischler
Hal Daumé
Kaheer Suleman
Alexandra Olteanu
ELM
99
26
0
13 May 2022
On the Limitations of Dataset Balancing: The Lost Battle Against
  Spurious Correlations
On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations
Roy Schwartz
Gabriel Stanovsky
37
26
0
27 Apr 2022
Language Contamination Helps Explain the Cross-lingual Capabilities of
  English Pretrained Models
Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models
Terra Blevins
Luke Zettlemoyer
40
85
0
17 Apr 2022
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black
Stella Biderman
Eric Hallahan
Quentin G. Anthony
Leo Gao
...
Shivanshu Purohit
Laria Reynolds
J. Tow
Benqi Wang
Samuel Weinbach
99
802
0
14 Apr 2022
Revisiting Transformer-based Models for Long Document Classification
Revisiting Transformer-based Models for Long Document Classification
Xiang Dai
Ilias Chalkidis
S. Darkner
Desmond Elliott
VLM
18
68
0
14 Apr 2022
Considerations for Multilingual Wikipedia Research
Considerations for Multilingual Wikipedia Research
Isaac Johnson
Emily A. Lescak
24
3
0
05 Apr 2022
Designing Word Filter Tools for Creator-led Comment Moderation
Designing Word Filter Tools for Creator-led Comment Moderation
Shagun Jhaver
Quan Ze Chen
Detlef Knauss
Amy X. Zhang
13
78
0
17 Feb 2022
Efficient Hierarchical Domain Adaptation for Pretrained Language Models
Efficient Hierarchical Domain Adaptation for Pretrained Language Models
Alexandra Chronopoulou
Matthew E. Peters
Jesse Dodge
33
42
0
16 Dec 2021
Recent Advances in Natural Language Processing via Large Pre-Trained
  Language Models: A Survey
Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey
Bonan Min
Hayley L Ross
Elior Sulem
Amir Pouran Ben Veyseh
Thien Huu Nguyen
Oscar Sainz
Eneko Agirre
Ilana Heinz
Dan Roth
LM&MA
VLM
AI4CE
83
1,035
0
01 Nov 2021
Risks of AI Foundation Models in Education
Risks of AI Foundation Models in Education
Su Lin Blodgett
Michael A. Madaio
UQCV
29
14
0
19 Oct 2021
Cross-Lingual Open-Domain Question Answering with Answer Sentence
  Generation
Cross-Lingual Open-Domain Question Answering with Answer Sentence Generation
Benjamin Muller
Luca Soldaini
Rik Koncel-Kedziorski
Eric Lind
Alessandro Moschitti
LRM
38
7
0
14 Oct 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
282
1,996
0
31 Dec 2020
Making Pre-trained Language Models Better Few-shot Learners
Making Pre-trained Language Models Better Few-shot Learners
Tianyu Gao
Adam Fisch
Danqi Chen
243
1,924
0
31 Dec 2020
Extracting Training Data from Large Language Models
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
290
1,824
0
14 Dec 2020
When does MAML Work the Best? An Empirical Study on Model-Agnostic
  Meta-Learning in NLP Applications
When does MAML Work the Best? An Empirical Study on Model-Agnostic Meta-Learning in NLP Applications
Zequn Liu
Ruiyi Zhang
Yiping Song
Wei Ju
Ming Zhang
38
8
0
24 May 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
264
4,489
0
23 Jan 2020
The Woman Worked as a Babysitter: On Biases in Language Generation
The Woman Worked as a Babysitter: On Biases in Language Generation
Emily Sheng
Kai-Wei Chang
Premkumar Natarajan
Nanyun Peng
223
618
0
03 Sep 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
299
6,984
0
20 Apr 2018
Previous
12