ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.14933
  4. Cited By
Consent in Crisis: The Rapid Decline of the AI Data Commons

Consent in Crisis: The Rapid Decline of the AI Data Commons

20 July 2024
Shayne Longpre
Robert Mahari
Ariel N. Lee
Campbell Lund
Hamidah Oderinwale
William Brannon
Nayan Saxena
Naana Obeng-Marnu
Tobin South
Cole J. Hunter
Kevin Klyman
Christopher Klamm
Hailey Schoelkopf
Nikhil Singh
Manuel Cherep
Ahmad Anis
An Dinh
Caroline Chitongo
Da Yin
Damien Sileo
Deividas Mataciunas
Diganta Misra
Emad A. Alghamdi
Enrico Shippole
Jianguo Zhang
Joanna Materzynska
Kun Qian
Kush Tiwary
Lester James V. Miranda
Manan Dey
Minnie Liang
Mohammed Hamdy
Niklas Muennighoff
Seonghyeon Ye
Seungone Kim
Shrestha Mohanty
Vipul Gupta
Vivek Sharma
Vu Minh Chien
Xuhui Zhou
Yizhi Li
Caiming Xiong
Luis Villa
Stella Biderman
Hanlin Li
Daphne Ippolito
Sara Hooker
Jad Kabbara
Sandy Pentland
ArXivPDFHTML

Papers citing "Consent in Crisis: The Rapid Decline of the AI Data Commons"

36 / 36 papers shown
Title
A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment
A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment
Jean-Philippe Corbeil
Amin Dada
Jean-Michel Attendu
Asma Ben Abacha
Alessandro Sordoni
Lucas Caccia
François Beaulieu
Thomas Lin
Jens Kleesiek
Paul Vozila
LM&MA
87
0
0
15 May 2025
Beyond Public Access in LLM Pre-Training Data
Beyond Public Access in LLM Pre-Training Data
Sruly Rosenblat
Tim O'Reilly
Ilan Strauss
MLAU
110
0
0
24 Apr 2025
Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
Judy Hanwen Shen
Carlos Guestrin
132
1
0
09 Apr 2025
How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation
How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation
Ruohao Guo
Wei Xu
Alan Ritter
64
3
0
12 Mar 2025
Beyond Release: Access Considerations for Generative AI Systems
Beyond Release: Access Considerations for Generative AI Systems
Irene Solaiman
Rishi Bommasani
Dan Hendrycks
Ariel Herbert-Voss
Yacine Jernite
Aviya Skowron
Andrew Trask
140
1
0
23 Feb 2025
SoK: Decentralized AI (DeAI)
SoK: Decentralized AI (DeAI)
Zhipeng Wang
Rui Sun
Elizabeth Lui
Vatsal Shah
Xihan Xiong
Jiahao Sun
Davide Crapis
William Knottenbelt
160
1
0
26 Nov 2024
DAWN-ICL: Strategic Planning of Problem-solving Trajectories for Zero-Shot In-Context Learning
DAWN-ICL: Strategic Planning of Problem-solving Trajectories for Zero-Shot In-Context Learning
Xinyu Tang
Xiaolei Wang
Wayne Xin Zhao
Ji-Rong Wen
85
5
0
26 Oct 2024
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
218
2
0
26 Oct 2024
Acceptable Use Policies for Foundation Models
Acceptable Use Policies for Foundation Models
Kevin Klyman
60
14
0
29 Aug 2024
Contrastive Learning from Synthetic Audio Doppelgängers
Contrastive Learning from Synthetic Audio Doppelgängers
Manuel Cherep
Nikhil Singh
64
1
0
09 Jun 2024
YODAS: Youtube-Oriented Dataset for Audio and Speech
YODAS: Youtube-Oriented Dataset for Audio and Speech
Xinjian Li
Shinnosuke Takamichi
Takaaki Saeki
William Chen
Sayaka Shiota
Shinji Watanabe
73
20
0
02 Jun 2024
WildChat: 1M ChatGPT Interaction Logs in the Wild
WildChat: 1M ChatGPT Interaction Logs in the Wild
Wenting Zhao
Xiang Ren
Jack Hessel
Claire Cardie
Yejin Choi
Yuntian Deng
67
208
0
02 May 2024
Dolma: an Open Corpus of Three Trillion Tokens for Language Model
  Pretraining Research
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini
Rodney Michael Kinney
Akshita Bhagia
Dustin Schwenk
David Atkinson
...
Hanna Hajishirzi
Iz Beltagy
Dirk Groeneveld
Jesse Dodge
Kyle Lo
78
265
0
31 Jan 2024
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
Sewon Min
Suchin Gururangan
Eric Wallace
Hannaneh Hajishirzi
Noah A. Smith
Luke Zettlemoyer
AILaw
61
66
0
08 Aug 2023
Constitutional AI: Harmlessness from AI Feedback
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
156
1,583
0
15 Dec 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
317
2,364
0
09 Nov 2022
What Language Model to Train if You Have One Million GPU Hours?
What Language Model to Train if You Have One Million GPU Hours?
Teven Le Scao
Thomas Wang
Daniel Hesslow
Lucile Saulnier
Stas Bekman
...
Lintang Sutawika
Jaesung Tae
Zheng-Xin Yong
Julien Launay
Iz Beltagy
MoE
AI4CE
250
107
0
27 Oct 2022
Interactive Model Cards: A Human-Centered Approach to Model
  Documentation
Interactive Model Cards: A Human-Centered Approach to Model Documentation
Anamaria Crisan
Margaret Drouhard
Jesse Vig
Nazneen Rajani
HAI
59
89
0
05 May 2022
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black
Stella Biderman
Eric Hallahan
Quentin G. Anthony
Leo Gao
...
Shivanshu Purohit
Laria Reynolds
J. Tow
Benqi Wang
Samuel Weinbach
143
820
0
14 Apr 2022
Data Cards: Purposeful and Transparent Dataset Documentation for
  Responsible AI
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI
Mahima Pushkarna
Andrew Zaldivar
Oddur Kjartansson
AI4TS
69
209
0
03 Apr 2022
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
146
1,915
0
29 Mar 2022
Quantifying Memorization Across Neural Language Models
Quantifying Memorization Across Neural Language Models
Nicholas Carlini
Daphne Ippolito
Matthew Jagielski
Katherine Lee
Florian Tramèr
Chiyuan Zhang
PILM
93
603
0
15 Feb 2022
Datasheet for the Pile
Datasheet for the Pile
Stella Biderman
Kieran Bicheno
Leo Gao
78
35
0
13 Jan 2022
Ethical and social risks of harm from Language Models
Ethical and social risks of harm from Language Models
Laura Weidinger
John F. J. Mellor
Maribeth Rauh
Conor Griffin
J. Uesato
...
Lisa Anne Hendricks
William S. Isaac
Sean Legassick
G. Irving
Iason Gabriel
PILM
73
1,009
0
08 Dec 2021
Datasets: A Community Library for Natural Language Processing
Datasets: A Community Library for Natural Language Processing
Quentin Lhoest
Albert Villanova del Moral
Yacine Jernite
A. Thakur
Patrick von Platen
...
Thibault Goehringer
Victor Mustar
François Lagunas
Alexander M. Rush
Thomas Wolf
155
596
0
07 Sep 2021
Understanding Gender and Racial Disparities in Image Recognition Models
Understanding Gender and Racial Disparities in Image Recognition Models
Rohan Mahadev
Anindya Chakravarti
FaML
16
3
0
20 Jul 2021
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of
  Transcribed Audio
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
Guoguo Chen
Shuzhou Chai
Guan-Bo Wang
Jiayu Du
Weiqiang Zhang
...
Xuchen Yao
Yongqing Wang
Yujun Wang
Zhao You
Zhiyong Yan
93
360
0
13 Jun 2021
Addressing "Documentation Debt" in Machine Learning Research: A
  Retrospective Datasheet for BookCorpus
Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus
Jack Bandy
Nicholas Vincent
58
57
0
11 May 2021
What's in the Box? A Preliminary Analysis of Undesirable Content in the
  Common Crawl Corpus
What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus
A. Luccioni
J. Viviano
56
116
0
06 May 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
  Crawled Corpus
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
72
437
0
18 Apr 2021
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer
Isaac Caswell
Lisa Wang
Ahsan Wahab
D. Esch
...
Duygu Ataman
Orevaoghene Ahia
Oghenefego Ahia
Sweta Agrawal
Mofetoluwa Adeyemi
49
276
0
22 Mar 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
409
2,051
0
31 Dec 2020
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million
  Narrated Video Clips
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
103
1,192
0
07 Jun 2019
Datasheets for Datasets
Datasheets for Datasets
Timnit Gebru
Jamie Morgenstern
Briana Vecchione
Jennifer Wortman Vaughan
Hanna M. Wallach
Hal Daumé
Kate Crawford
231
2,158
0
23 Mar 2018
Annotation Artifacts in Natural Language Inference Data
Annotation Artifacts in Natural Language Inference Data
Suchin Gururangan
Swabha Swayamdipta
Omer Levy
Roy Schwartz
Samuel R. Bowman
Noah A. Smith
117
1,167
0
06 Mar 2018
Moments in Time Dataset: one million videos for event understanding
Moments in Time Dataset: one million videos for event understanding
Mathew Monfort
A. Andonian
Bolei Zhou
K. Ramakrishnan
Sarah Adel Bargal
...
L. Brown
Quanfu Fan
Dan Gutfreund
Carl Vondrick
A. Oliva
90
543
0
09 Jan 2018
1