v1v2 (latest)

Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data

24 June 2023

Papers citing "Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data"

24 / 24 papers shown

Title
Do we really have to filter out random noise in pre-training data for language models? Jinghan Ru Yuxin Xie Xianwei Zhuang Yuguo Yin Zhihui Guo Zhiming Liu Qianli Ren Yuexian Zou 178 6 0 10 Feb 2025
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment Elyas Obbad Iddah Mlauzi Brando Miranda Rylan Schaeffer Kamal Obbad Suhana Bedi Sanmi Koyejo CVBM 110 0 0 23 Oct 2024
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World Joshua Kazdan Rylan Schaeffer Apratim Dey Matthias Gerstgrasser Rafael Rafailov D. Donoho Sanmi Koyejo 117 17 0 22 Oct 2024
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey Guanqiao Qu Qiyuan Chen Wei Wei Zheng Lin Xianhao Chen Kaibin Huang 127 56 0 09 Jul 2024
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity Shayne Longpre Gregory Yauney Emily Reif Katherine Lee Adam Roberts ... Denny Zhou Jason W. Wei Kevin Robinson David M. Mimno Daphne Ippolito 106 166 0 22 May 2023
The Vendi Score: A Diversity Evaluation Metric for Machine Learning Dan Friedman Adji Bousso Dieng EGVM 152 127 0 05 Oct 2022
Scaling Laws for a Multi-Agent Reinforcement Learning Model Oren Neumann C. Gros 81 27 0 29 Sep 2022
Emergent Abilities of Large Language Models Jason W. Wei Yi Tay Rishi Bommasani Colin Raffel Barret Zoph ... Tatsunori Hashimoto Oriol Vinyals Percy Liang J. Dean W. Fedus ELM ReLM LRM 288 2,511 0 15 Jun 2022
PaLM: Scaling Language Modeling with Pathways Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra ... Kathy Meier-Hellstern Douglas Eck J. Dean Slav Petrov Noah Fiedel PILM LRM 529 6,293 0 05 Apr 2022
Training Compute-Optimal Large Language Models Jordan Hoffmann Sebastian Borgeaud A. Mensch Elena Buchatskaya Trevor Cai ... Karen Simonyan Erich Elsen Jack W. Rae Oriol Vinyals Laurent Sifre AI4TS 208 1,980 0 29 Mar 2022
An Explanation of In-context Learning as Implicit Bayesian Inference Sang Michael Xie Aditi Raghunathan Percy Liang Tengyu Ma ReLM BDL VPVLM LRM 216 763 0 03 Nov 2021
Mastering Atari Games with Limited Data Weirui Ye Shao-Wei Liu Thanard Kurutach Pieter Abbeel Yang Gao VLM 122 240 0 30 Oct 2021
Evaluating Large Language Models Trained on Code Mark Chen Jerry Tworek Heewoo Jun Qiming Yuan Henrique Pondé ... Bob McGrew Dario Amodei Sam McCandlish Ilya Sutskever Wojciech Zaremba ELM ALM 236 5,647 0 07 Jul 2021
Scaling Scaling Laws with Board Games Andrew Jones 55 43 0 07 Apr 2021
Proof Artifact Co-training for Theorem Proving with Language Models Jesse Michael Han Jason M. Rute Yuhuai Wu Edward W. Ayers Stanislas Polu AIMat 99 126 0 11 Feb 2021
Scaling Laws for Transfer Danny Hernandez Jared Kaplan T. Henighan Sam McCandlish 90 250 0 02 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 475 2,120 0 31 Dec 2020
Random Network Distillation as a Diversity Metric for Both Image and Text Generation Liam H. Fowl Micah Goldblum Arjun Gupta Amr Sharaf Tom Goldstein EGVM 29 3 0 13 Oct 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 611 4,905 0 23 Jan 2020
A Constructive Prediction of the Generalization Error Across Scales Jonathan S. Rosenfeld Amir Rosenfeld Yonatan Belinkov Nir Shavit 105 215 0 27 Sep 2019
Improved Precision and Recall Metric for Assessing Generative Models Tuomas Kynkaanniemi Tero Karras S. Laine J. Lehtinen Timo Aila EGVM 105 865 0 15 Apr 2019
Task2Vec: Task Embedding for Meta-Learning Alessandro Achille Michael Lam Rahul Tewari Avinash Ravichandran Subhransu Maji Charless C. Fowlkes Stefano Soatto Pietro Perona SSL 77 315 0 10 Feb 2019
Assessing Generative Models via Precision and Recall Mehdi S. M. Sajjadi Olivier Bachem Mario Lucic Olivier Bousquet Sylvain Gelly EGVM 82 581 0 31 May 2018
Pointer Sentinel Mixture Models Stephen Merity Caiming Xiong James Bradbury R. Socher RALM 338 2,898 0 26 Sep 2016