ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.16787
  4. Cited By
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
  & Attribution in AI
v1v2v3 (latest)

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

25 October 2023
Shayne Longpre
Robert Mahari
Anthony Chen
Naana Obeng-Marnu
Damien Sileo
William Brannon
Niklas Muennighoff
Nathan Khazam
Jad Kabbara
Kartik Perisetla
Xinyi Wu
Enrico Shippole
K. Bollacker
Tongshuang Wu
Luis Villa
Sandy Pentland
Sara Hooker
ArXiv (abs)PDFHTML

Papers citing "The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI"

39 / 39 papers shown
Title
TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation
TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation
Wiebke Hutiri
Mircea Cimpoi
M. Scheuerman
Victoria Matthews
Alice Xiang
149
0
0
23 May 2025
Model Lakes
Model Lakes
Koyena Pal
David Bau
Renée J. Miller
155
2
0
24 Feb 2025
Ward: Provable RAG Dataset Inference via LLM Watermarks
Ward: Provable RAG Dataset Inference via LLM Watermarks
Nikola Jovanović
Robin Staab
Maximilian Baader
Martin Vechev
445
4
0
04 Oct 2024
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia
Rahmad Mahendra
Salsabil Maulana Akbar
Lester James V. Miranda
Jennifer Santoso
...
Genta Indra Winata
Ruochen Zhang
Fajri Koto
Zheng-Xin Yong
Samuel Cahyawijaya
194
14
0
14 Jun 2024
OctoPack: Instruction Tuning Code Large Language Models
OctoPack: Instruction Tuning Code Large Language Models
Niklas Muennighoff
Qian Liu
A. Zebaze
Qinkai Zheng
Binyuan Hui
Terry Yue Zhuo
Swayam Singh
Xiangru Tang
Leandro von Werra
Shayne Longpre
VLMALM
119
136
0
14 Aug 2023
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
Sewon Min
Suchin Gururangan
Eric Wallace
Hannaneh Hajishirzi
Noah A. Smith
Luke Zettlemoyer
AILaw
83
67
0
08 Aug 2023
Large Language Models
Large Language Models
Michael R Douglas
LLMAGLM&MA
140
642
0
11 Jul 2023
Are ChatGPT and Other Similar Systems the Modern Lernaean Hydras of AI?
Are ChatGPT and Other Similar Systems the Modern Lernaean Hydras of AI?
Dimitrios Ioannidis
J. Kepner
Andrew Bowne
Harriet S. Bryant
25
1
0
15 Jun 2023
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
Ziyang Luo
Can Xu
Pu Zhao
Qingfeng Sun
Xiubo Geng
Wenxiang Hu
Chongyang Tao
Jing Ma
Qingwei Lin
Daxin Jiang
ELMSyDaALM
113
687
0
14 Jun 2023
DataFinder: Scientific Dataset Recommendation from Natural Language
  Descriptions
DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions
Vijay Viswanathan
Luyu Gao
Tongshuang Wu
Pengfei Liu
Graham Neubig
67
13
0
26 May 2023
Gorilla: Large Language Model Connected with Massive APIs
Gorilla: Large Language Model Connected with Massive APIs
Shishir G. Patil
Tianjun Zhang
Xin Wang
Joseph E. Gonzalez
ELMCLLALMSyDa
90
566
0
24 May 2023
Symbol tuning improves in-context learning in language models
Symbol tuning improves in-context learning in language models
Jerry W. Wei
Le Hou
Andrew Kyle Lampinen
Xiangning Chen
Da Huang
...
Xinyun Chen
Yifeng Lu
Denny Zhou
Tengyu Ma
Quoc V. Le
LRM
71
80
0
15 May 2023
tasksource: A Dataset Harmonization Framework for Streamlined NLP
  Multi-Task Learning and Evaluation
tasksource: A Dataset Harmonization Framework for Streamlined NLP Multi-Task Learning and Evaluation
Damien Sileo
54
11
0
14 Jan 2023
Unnatural Instructions: Tuning Language Models with (Almost) No Human
  Labor
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Or Honovich
Thomas Scialom
Omer Levy
Timo Schick
ALM
126
375
0
19 Dec 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
401
2,394
0
09 Nov 2022
What Language Model to Train if You Have One Million GPU Hours?
What Language Model to Train if You Have One Million GPU Hours?
Teven Le Scao
Thomas Wang
Daniel Hesslow
Lucile Saulnier
Stas Bekman
...
Lintang Sutawika
Jaesung Tae
Zheng-Xin Yong
Julien Launay
Iz Beltagy
MoEAI4CE
285
109
0
27 Oct 2022
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
Hyung Won Chung
Le Hou
Shayne Longpre
Barret Zoph
Yi Tay
...
Jacob Devlin
Adam Roberts
Denny Zhou
Quoc V. Le
Jason W. Wei
ReLMLRM
208
3,150
0
20 Oct 2022
Interactive Model Cards: A Human-Centered Approach to Model
  Documentation
Interactive Model Cards: A Human-Centered Approach to Model Documentation
Anamaria Crisan
Margaret Drouhard
Jesse Vig
Nazneen Rajani
HAI
73
89
0
05 May 2022
Data Governance in the Age of Large-Scale Data-Driven Language
  Technology
Data Governance in the Age of Large-Scale Data-Driven Language Technology
Yacine Jernite
Huu Nguyen
Stella Biderman
A. Rogers
Maraim Masoud
...
Jorg Frohberg
Aaron Gokaslan
Peter Henderson
Rishi Bommasani
Margaret Mitchell
57
52
0
04 May 2022
Super-NaturalInstructions: Generalization via Declarative Instructions
  on 1600+ NLP Tasks
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Yizhong Wang
Swaroop Mishra
Pegah Alipoormolabashi
Yeganeh Kordi
Amirreza Mirzaei
...
Chitta Baral
Yejin Choi
Noah A. Smith
Hannaneh Hajishirzi
Daniel Khashabi
ELM
123
858
0
16 Apr 2022
Scalable Training of Language Models using JAX pjit and TPUv4
Scalable Training of Language Models using JAX pjit and TPUv4
Joanna Yoo
Kuba Perlin
Siddhartha Rao Kamalakara
J. Araújo
VLM
45
10
0
13 Apr 2022
Data Cards: Purposeful and Transparent Dataset Documentation for
  Responsible AI
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI
Mahima Pushkarna
Andrew Zaldivar
Oddur Kjartansson
AI4TS
92
218
0
03 Apr 2022
Quantifying Memorization Across Neural Language Models
Quantifying Memorization Across Neural Language Models
Nicholas Carlini
Daphne Ippolito
Matthew Jagielski
Katherine Lee
Florian Tramèr
Chiyuan Zhang
PILM
124
631
0
15 Feb 2022
Datasheet for the Pile
Datasheet for the Pile
Stella Biderman
Kieran Bicheno
Leo Gao
88
36
0
13 Jan 2022
Building Legal Datasets
Building Legal Datasets
Jerrold Soh
ELMAILaw
130
3
0
03 Nov 2021
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
355
1,708
0
15 Oct 2021
Datasets: A Community Library for Natural Language Processing
Datasets: A Community Library for Natural Language Processing
Quentin Lhoest
Albert Villanova del Moral
Yacine Jernite
A. Thakur
Patrick von Platen
...
Thibault Goehringer
Victor Mustar
François Lagunas
Alexander M. Rush
Thomas Wolf
218
614
0
07 Sep 2021
Finetuned Language Models Are Zero-Shot Learners
Finetuned Language Models Are Zero-Shot Learners
Jason W. Wei
Maarten Bosma
Vincent Zhao
Kelvin Guu
Adams Wei Yu
Brian Lester
Nan Du
Andrew M. Dai
Quoc V. Le
ALMUQCV
230
3,782
0
03 Sep 2021
Understanding Gender and Racial Disparities in Image Recognition Models
Understanding Gender and Racial Disparities in Image Recognition Models
Rohan Mahadev
Anindya Chakravarti
FaML
28
3
0
20 Jul 2021
Addressing "Documentation Debt" in Machine Learning Research: A
  Retrospective Datasheet for BookCorpus
Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus
Jack Bandy
Nicholas Vincent
65
57
0
11 May 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
  Crawled Corpus
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
120
448
0
18 Apr 2021
Detoxifying Language Models Risks Marginalizing Minority Voices
Detoxifying Language Models Risks Marginalizing Minority Voices
Albert Xu
Eshaan Pathak
Eric Wallace
Suchin Gururangan
Maarten Sap
Dan Klein
69
129
0
13 Apr 2021
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer
Isaac Caswell
Lisa Wang
Ahsan Wahab
D. Esch
...
Duygu Ataman
Orevaoghene Ahia
Oghenefego Ahia
Sweta Agrawal
Mofetoluwa Adeyemi
56
278
0
22 Mar 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
475
2,120
0
31 Dec 2020
Datasheets for Datasets
Datasheets for Datasets
Timnit Gebru
Jamie Morgenstern
Briana Vecchione
Jennifer Wortman Vaughan
Hanna M. Wallach
Hal Daumé
Kate Crawford
285
2,195
0
23 Mar 2018
Annotation Artifacts in Natural Language Inference Data
Annotation Artifacts in Natural Language Inference Data
Suchin Gururangan
Swabha Swayamdipta
Omer Levy
Roy Schwartz
Samuel R. Bowman
Noah A. Smith
155
1,180
0
06 Mar 2018
No Classification without Representation: Assessing Geodiversity Issues
  in Open Data Sets for the Developing World
No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World
S. Shankar
Yoni Halpern
Eric Breck
James Atwood
Jimbo Wilson
D. Sculley
71
296
0
22 Nov 2017
SQuAD: 100,000+ Questions for Machine Comprehension of Text
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar
Jian Zhang
Konstantin Lopyrev
Percy Liang
RALM
316
8,169
0
16 Jun 2016
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Jason Weston
Antoine Bordes
S. Chopra
Alexander M. Rush
Bart van Merriënboer
Armand Joulin
Tomas Mikolov
LRMELM
150
1,181
0
19 Feb 2015
1