ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.10789
  4. Cited By

Data Caricatures: On the Representation of African American Language in Pretraining Corpora

13 March 2025
Nicholas Deas
Blake Vente
Amith Ananthram
Jessica A. Grieser
D. Patton
Shana Kleiner
James Shepard
Kathleen McKeown
ArXiv (abs)PDFHTML

Papers citing "Data Caricatures: On the Representation of African American Language in Pretraining Corpora"

19 / 19 papers shown
Title
Finding A Voice: Evaluating African American Dialect Generation for Chatbot Technology
Finding A Voice: Evaluating African American Dialect Generation for Chatbot Technology
Sarah E. Finch
Ellie S. Paek
Sejung Kwon
Ikseon Choi
Jessica Wells
Rasheeta Chandler
Jinho D. Choi
129
3
0
08 Jan 2025
Linguistic Bias in ChatGPT: Language Models Reinforce Dialect
  Discrimination
Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination
Eve Fleisig
G. Smith
Madeline Bossi
Ishita Rustagi
Xavier Yin
Dan Klein
88
28
0
13 Jun 2024
How to Train Data-Efficient LLMs
How to Train Data-Efficient LLMs
Noveen Sachdeva
Benjamin Coleman
Wang-Cheng Kang
Jianmo Ni
Lichan Hong
Ed H. Chi
James Caverlee
Julian McAuley
D. Cheng
85
64
0
15 Feb 2024
Dolma: an Open Corpus of Three Trillion Tokens for Language Model
  Pretraining Research
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini
Rodney Michael Kinney
Akshita Bhagia
Dustin Schwenk
David Atkinson
...
Hanna Hajishirzi
Iz Beltagy
Dirk Groeneveld
Jesse Dodge
Kyle Lo
92
278
0
31 Jan 2024
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Jiacheng Liu
Sewon Min
Luke Zettlemoyer
Yejin Choi
Hannaneh Hajishirzi
115
60
0
30 Jan 2024
Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in
  Low-Resource English Varieties
Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties
Tessa Masis
A. Neal
Lisa Green
Brendan O'Connor
61
9
0
15 Sep 2022
PaLM: Scaling Language Modeling with Pathways
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILMLRM
522
6,293
0
05 Apr 2022
What Does it Mean for a Language Model to Preserve Privacy?
What Does it Mean for a Language Model to Preserve Privacy?
Hannah Brown
Katherine Lee
Fatemehsadat Mireshghallah
Reza Shokri
Florian Tramèr
PILM
92
243
0
11 Feb 2022
Annotators with Attitudes: How Annotator Beliefs And Identities Bias
  Toxic Language Detection
Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection
Maarten Sap
Swabha Swayamdipta
Laura Vianna
Xuhui Zhou
Yejin Choi
Noah A. Smith
81
283
0
15 Nov 2021
Mitigating Racial Biases in Toxic Language Detection with an
  Equity-Based Ensemble Framework
Mitigating Racial Biases in Toxic Language Detection with an Equity-Based Ensemble Framework
Matan Halevy
Camille Harris
A. Bruckman
Diyi Yang
A. Howard
93
36
0
27 Sep 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
  Crawled Corpus
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
118
448
0
18 Apr 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
458
2,120
0
31 Dec 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
873
42,379
0
28 May 2020
Language (Technology) is Power: A Critical Survey of "Bias" in NLP
Language (Technology) is Power: A Critical Survey of "Bias" in NLP
Su Lin Blodgett
Solon Barocas
Hal Daumé
Hanna M. Wallach
157
1,248
0
28 May 2020
Demoting Racial Bias in Hate Speech Detection
Demoting Racial Bias in Hate Speech Detection
Mengzhou Xia
Anjalie Field
Yulia Tsvetkov
64
122
0
25 May 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
470
20,317
0
23 Oct 2019
Racial Bias in Hate Speech and Abusive Language Detection Datasets
Racial Bias in Hate Speech and Abusive Language Detection Datasets
Thomas Davidson
Debasmita Bhattacharya
Ingmar Weber
109
459
0
29 May 2019
Racial Disparity in Natural Language Processing: A Case Study of Social
  Media African-American English
Racial Disparity in Natural Language Processing: A Case Study of Social Media African-American English
Su Lin Blodgett
Brendan O'Connor
77
148
0
30 Jun 2017
Bag of Tricks for Efficient Text Classification
Bag of Tricks for Efficient Text Classification
Armand Joulin
Edouard Grave
Piotr Bojanowski
Tomas Mikolov
VLM
179
4,630
0
06 Jul 2016
1