Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.06574
Cited By
Towards Transparency: Exploring LLM Trainings Datasets through Visual Topic Modeling and Semantic Frame
3 June 2024
Charles de Dampierre
Andrei Mogoutov
Nicolas Baumard
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Towards Transparency: Exploring LLM Trainings Datasets through Visual Topic Modeling and Semantic Frame"
5 / 5 papers shown
Title
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Jianlv Chen
Shitao Xiao
Peitian Zhang
Kun Luo
Defu Lian
Zheng Liu
115
333
0
05 Feb 2024
Healthsheet: Development of a Transparency Artifact for Health Datasets
Negar Rostamzadeh
Diana Mincu
Subhrajit Roy
A. Smart
Lauren Wilcox
Mahima Pushkarna
Jessica Schrouff
Razvan Amironesei
Nyalleng Moorosi
Katherine A. Heller
39
62
0
26 Feb 2022
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
242
593
0
14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
279
1,996
0
31 Dec 2020
Efficient Estimation of Word Representations in Vector Space
Tomáš Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
281
31,267
0
16 Jan 2013
1