Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.13840
Cited By
v1
v2 (latest)
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data
24 June 2023
Alycia Lee
Brando Miranda
Sudharsan Sundar
Sanmi Koyejo
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data"
24 / 24 papers shown
Title
Do we really have to filter out random noise in pre-training data for language models?
Jinghan Ru
Yuxin Xie
Xianwei Zhuang
Yuguo Yin
Zhihui Guo
Zhiming Liu
Qianli Ren
Yuexian Zou
178
6
0
10 Feb 2025
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
Elyas Obbad
Iddah Mlauzi
Brando Miranda
Rylan Schaeffer
Kamal Obbad
Suhana Bedi
Sanmi Koyejo
CVBM
110
0
0
23 Oct 2024
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
Joshua Kazdan
Rylan Schaeffer
Apratim Dey
Matthias Gerstgrasser
Rafael Rafailov
D. Donoho
Sanmi Koyejo
117
17
0
22 Oct 2024
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey
Guanqiao Qu
Qiyuan Chen
Wei Wei
Zheng Lin
Xianhao Chen
Kaibin Huang
127
56
0
09 Jul 2024
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Shayne Longpre
Gregory Yauney
Emily Reif
Katherine Lee
Adam Roberts
...
Denny Zhou
Jason W. Wei
Kevin Robinson
David M. Mimno
Daphne Ippolito
106
166
0
22 May 2023
The Vendi Score: A Diversity Evaluation Metric for Machine Learning
Dan Friedman
Adji Bousso Dieng
EGVM
152
127
0
05 Oct 2022
Scaling Laws for a Multi-Agent Reinforcement Learning Model
Oren Neumann
C. Gros
81
27
0
29 Sep 2022
Emergent Abilities of Large Language Models
Jason W. Wei
Yi Tay
Rishi Bommasani
Colin Raffel
Barret Zoph
...
Tatsunori Hashimoto
Oriol Vinyals
Percy Liang
J. Dean
W. Fedus
ELM
ReLM
LRM
288
2,511
0
15 Jun 2022
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILM
LRM
529
6,293
0
05 Apr 2022
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
208
1,980
0
29 Mar 2022
An Explanation of In-context Learning as Implicit Bayesian Inference
Sang Michael Xie
Aditi Raghunathan
Percy Liang
Tengyu Ma
ReLM
BDL
VPVLM
LRM
216
763
0
03 Nov 2021
Mastering Atari Games with Limited Data
Weirui Ye
Shao-Wei Liu
Thanard Kurutach
Pieter Abbeel
Yang Gao
VLM
122
240
0
30 Oct 2021
Evaluating Large Language Models Trained on Code
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
...
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
ELM
ALM
236
5,647
0
07 Jul 2021
Scaling Scaling Laws with Board Games
Andrew Jones
55
43
0
07 Apr 2021
Proof Artifact Co-training for Theorem Proving with Language Models
Jesse Michael Han
Jason M. Rute
Yuhuai Wu
Edward W. Ayers
Stanislas Polu
AIMat
99
126
0
11 Feb 2021
Scaling Laws for Transfer
Danny Hernandez
Jared Kaplan
T. Henighan
Sam McCandlish
90
250
0
02 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
475
2,120
0
31 Dec 2020
Random Network Distillation as a Diversity Metric for Both Image and Text Generation
Liam H. Fowl
Micah Goldblum
Arjun Gupta
Amr Sharaf
Tom Goldstein
EGVM
29
3
0
13 Oct 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
611
4,905
0
23 Jan 2020
A Constructive Prediction of the Generalization Error Across Scales
Jonathan S. Rosenfeld
Amir Rosenfeld
Yonatan Belinkov
Nir Shavit
105
215
0
27 Sep 2019
Improved Precision and Recall Metric for Assessing Generative Models
Tuomas Kynkaanniemi
Tero Karras
S. Laine
J. Lehtinen
Timo Aila
EGVM
105
865
0
15 Apr 2019
Task2Vec: Task Embedding for Meta-Learning
Alessandro Achille
Michael Lam
Rahul Tewari
Avinash Ravichandran
Subhransu Maji
Charless C. Fowlkes
Stefano Soatto
Pietro Perona
SSL
77
315
0
10 Feb 2019
Assessing Generative Models via Precision and Recall
Mehdi S. M. Sajjadi
Olivier Bachem
Mario Lucic
Olivier Bousquet
Sylvain Gelly
EGVM
82
581
0
31 May 2018
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
338
2,898
0
26 Sep 2016
1