Towards Transparency: Exploring LLM Trainings Datasets through Visual Topic Modeling and Semantic Frame

3 June 2024

Papers citing "Towards Transparency: Exploring LLM Trainings Datasets through Visual Topic Modeling and Semantic Frame"

5 / 5 papers shown

Title
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation Jianlv Chen Shitao Xiao Peitian Zhang Kun Luo Defu Lian Zheng Liu 115 333 0 05 Feb 2024
Healthsheet: Development of a Transparency Artifact for Health Datasets Negar Rostamzadeh Diana Mincu Subhrajit Roy A. Smart Lauren Wilcox Mahima Pushkarna Jessica Schrouff Razvan Amironesei Nyalleng Moorosi Katherine A. Heller 39 62 0 26 Feb 2022
Deduplicating Training Data Makes Language Models Better Katherine Lee Daphne Ippolito A. Nystrom Chiyuan Zhang Douglas Eck Chris Callison-Burch Nicholas Carlini SyDa 242 593 0 14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 279 1,996 0 31 Dec 2020
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov Kai Chen G. Corrado J. Dean 3DV 281 31,267 0 16 Jan 2013