ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.01492
  4. Cited By
RegMix: Data Mixture as Regression for Language Model Pre-training
v1v2 (latest)

RegMix: Data Mixture as Regression for Language Model Pre-training

1 July 2024
Qian Liu
Xiaosen Zheng
Niklas Muennighoff
Guangtao Zeng
Longxu Dou
Tianyu Pang
Jing Jiang
Min Lin
    MoE
ArXiv (abs)PDFHTML

Papers citing "RegMix: Data Mixture as Regression for Language Model Pre-training"

47 / 97 papers shown
Title
The Fine Line: Navigating Large Language Model Pretraining with
  Down-streaming Capability Analysis
The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis
Chen Yang
Junzhuo Li
Xinyao Niu
Xinrun Du
Songyang Gao
...
Stephen W. Huang
Shawn Yue
Wenhu Chen
Jie Fu
Ge Zhang
78
2
0
01 Apr 2024
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Jiasheng Ye
Peiju Liu
Tianxiang Sun
Yunhua Zhou
Jun Zhan
Xipeng Qiu
147
76
0
25 Mar 2024
Language models scale reliably with over-training and on downstream
  tasks
Language models scale reliably with over-training and on downstream tasks
S. Gadre
Georgios Smyrnis
Vaishaal Shankar
Suchin Gururangan
Mitchell Wortsman
...
Y. Carmon
Achal Dave
Reinhard Heckel
Niklas Muennighoff
Ludwig Schmidt
ALMELMLRM
183
48
0
13 Mar 2024
SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large
  Language Models by Summarizing Training Trajectories of Small Models
SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
Yu Yang
Siddhartha Mishra
Jeffrey N Chiang
Baharan Mirzasoleiman
98
24
0
12 Mar 2024
Balanced Data Sampling for Language Model Training with Clustering
Balanced Data Sampling for Language Model Training with Clustering
Yunfan Shao
Linyang Li
Zhaoye Fei
Hang Yan
Dahua Lin
Xipeng Qiu
95
12
0
22 Feb 2024
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
  Improves LLM Generalization
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization
Xuxi Chen
Zhendong Wang
Daouda Sow
Junjie Yang
Tianlong Chen
Yingbin Liang
Mingyuan Zhou
Zhangyang Wang
88
7
0
22 Feb 2024
Smaller Language Models are capable of selecting Instruction-Tuning
  Training Data for Larger Language Models
Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models
Dheeraj Mekala
Alex Nguyen
Jingbo Shang
ALM
68
21
0
16 Feb 2024
How to Train Data-Efficient LLMs
How to Train Data-Efficient LLMs
Noveen Sachdeva
Benjamin Coleman
Wang-Cheng Kang
Jianmo Ni
Lichan Hong
Ed H. Chi
James Caverlee
Julian McAuley
D. Cheng
97
64
0
15 Feb 2024
LESS: Selecting Influential Data for Targeted Instruction Tuning
LESS: Selecting Influential Data for Targeted Instruction Tuning
Mengzhou Xia
Sadhika Malladi
Suchin Gururangan
Sanjeev Arora
Danqi Chen
164
247
0
06 Feb 2024
Dolma: an Open Corpus of Three Trillion Tokens for Language Model
  Pretraining Research
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini
Rodney Michael Kinney
Akshita Bhagia
Dustin Schwenk
David Atkinson
...
Hanna Hajishirzi
Iz Beltagy
Dirk Groeneveld
Jesse Dodge
Kyle Lo
127
282
0
31 Jan 2024
DsDm: Model-Aware Dataset Selection with Datamodels
DsDm: Model-Aware Dataset Selection with Datamodels
Logan Engstrom
Axel Feldmann
Aleksander Madry
OODD
110
61
0
23 Jan 2024
TinyLlama: An Open-Source Small Language Model
TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang
Guangtao Zeng
Tianduo Wang
Wei Lu
ALMLRM
221
409
0
04 Jan 2024
What Makes Good Data for Alignment? A Comprehensive Study of Automatic
  Data Selection in Instruction Tuning
What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning
Wei Liu
Weihao Zeng
Keqing He
Yong Jiang
Junxian He
ALM
132
239
0
25 Dec 2023
Paloma: A Benchmark for Evaluating Language Model Fit
Paloma: A Benchmark for Evaluating Language Model Fit
Ian H. Magnusson
Akshita Bhagia
Valentin Hofmann
Luca Soldaini
A. Jha
...
Iz Beltagy
Hanna Hajishirzi
Noah A. Smith
Kyle Richardson
Jesse Dodge
180
27
0
16 Dec 2023
Efficient Online Data Mixing For Language Model Pre-Training
Efficient Online Data Mixing For Language Model Pre-Training
Alon Albalak
Liangming Pan
Colin Raffel
Wenjie Wang
101
46
0
05 Dec 2023
Data Diversity Matters for Robust Instruction Tuning
Data Diversity Matters for Robust Instruction Tuning
Alexander Bukharin
Tuo Zhao
164
44
0
21 Nov 2023
Self-Influence Guided Data Reweighting for Language Model Pre-training
Self-Influence Guided Data Reweighting for Language Model Pre-training
Megh Thakkar
Tolga Bolukbasi
Sriram Ganapathy
Shikhar Vashishth
Sarath Chandar
Partha P. Talukdar
MILM
111
26
0
02 Nov 2023
DoGE: Domain Reweighting with Generalization Estimation
DoGE: Domain Reweighting with Generalization Estimation
Simin Fan
Matteo Pagliardini
Martin Jaggi
77
44
0
23 Oct 2023
Skill-it! A Data-Driven Skills Framework for Understanding and Training
  Language Models
Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models
Mayee F. Chen
Nicholas Roberts
Kush S. Bhatia
Jue Wang
Ce Zhang
Frederic Sala
Christopher Ré
SyDa
88
66
0
26 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
539
12,132
0
18 Jul 2023
Scaling Data-Constrained Language Models
Scaling Data-Constrained Language Models
Niklas Muennighoff
Alexander M. Rush
Boaz Barak
Teven Le Scao
Aleksandra Piktus
Nouamane Tazi
S. Pyysalo
Thomas Wolf
Colin Raffel
ALM
191
226
0
25 May 2023
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Sang Michael Xie
Hieu H. Pham
Xuanyi Dong
Nan Du
Hanxiao Liu
Yifeng Lu
Percy Liang
Quoc V. Le
Tengyu Ma
Adams Wei Yu
MoMeMoE
152
205
0
17 May 2023
Data Selection for Language Models via Importance Resampling
Data Selection for Language Models via Importance Resampling
Sang Michael Xie
Shibani Santurkar
Tengyu Ma
Percy Liang
131
196
0
06 Feb 2023
Training Trajectories of Language Models Across Scales
Training Trajectories of Language Models Across Scales
Mengzhou Xia
Mikel Artetxe
Chunting Zhou
Xi Lin
Ramakanth Pasunuru
Danqi Chen
Luke Zettlemoyer
Ves Stoyanov
AIFinLRM
98
64
0
19 Dec 2022
The Stack: 3 TB of permissively licensed source code
The Stack: 3 TB of permissively licensed source code
Denis Kocetkov
Raymond Li
Loubna Ben Allal
Jia Li
Chenghao Mou
...
Sean M. Hughes
Thomas Wolf
Dzmitry Bahdanau
Leandro von Werra
H. D. Vries
111
339
0
20 Nov 2022
Prioritized Training on Points that are Learnable, Worth Learning, and
  Not Yet Learnt
Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Sören Mindermann
J. Brauner
Muhammed Razzak
Mrinank Sharma
Andreas Kirsch
...
Benedikt Höltgen
Aidan Gomez
Adrien Morisot
Sebastian Farquhar
Y. Gal
131
165
0
14 Jun 2022
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black
Stella Biderman
Eric Hallahan
Quentin G. Anthony
Leo Gao
...
Shivanshu Purohit
Laria Reynolds
J. Tow
Benqi Wang
Samuel Weinbach
189
841
0
14 Apr 2022
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
217
1,993
0
29 Mar 2022
Scaling Language Models: Methods, Analysis & Insights from Training
  Gopher
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W. Rae
Sebastian Borgeaud
Trevor Cai
Katie Millican
Jordan Hoffmann
...
Jeff Stanway
L. Bennett
Demis Hassabis
Koray Kavukcuoglu
G. Irving
199
1,327
0
08 Dec 2021
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLMOffRLLRM
431
4,609
0
27 Oct 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
484
2,128
0
31 Dec 2020
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELMRALM
396
4,585
0
07 Sep 2020
LogiQA: A Challenge Dataset for Machine Reading Comprehension with
  Logical Reasoning
LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning
Jian Liu
Leyang Cui
Hanmeng Liu
Dandan Huang
Yile Wang
Yue Zhang
RALM
137
382
0
16 Jul 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
1.2K
42,714
0
28 May 2020
SuperGlue: Learning Feature Matching with Graph Neural Networks
SuperGlue: Learning Feature Matching with Graph Neural Networks
Paul-Edouard Sarlin
Daniel DeTone
Tomasz Malisiewicz
Andrew Rabinovich
3DPCOffRL
193
1,958
0
26 Nov 2019
PIQA: Reasoning about Physical Commonsense in Natural Language
PIQA: Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk
Rowan Zellers
Ronan Le Bras
Jianfeng Gao
Yejin Choi
OODLRM
350
1,853
0
26 Nov 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
881
20,448
0
23 Oct 2019
HellaSwag: Can a Machine Really Finish Your Sentence?
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers
Ari Holtzman
Yonatan Bisk
Ali Farhadi
Yejin Choi
229
2,537
0
19 May 2019
CommonsenseQA: A Question Answering Challenge Targeting Commonsense
  Knowledge
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
Alon Talmor
Jonathan Herzig
Nicholas Lourie
Jonathan Berant
RALM
172
1,754
0
02 Nov 2018
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book
  Question Answering
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov
Peter Clark
Tushar Khot
Ashish Sabharwal
133
1,573
0
08 Sep 2018
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
1.3K
7,212
0
20 Apr 2018
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning
  Challenge
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark
Isaac Cowhey
Oren Etzioni
Tushar Khot
Ashish Sabharwal
Carissa Schoenick
Oyvind Tafjord
ELMRALMLRM
249
2,679
0
14 Mar 2018
Crowdsourcing Multiple Choice Science Questions
Crowdsourcing Multiple Choice Science Questions
Johannes Welbl
Nelson F. Liu
Matt Gardner
AI4Ed
146
522
0
19 Jul 2017
RACE: Large-scale ReAding Comprehension Dataset From Examinations
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Guokun Lai
Qizhe Xie
Hanxiao Liu
Yiming Yang
Eduard H. Hovy
ELM
315
1,360
0
15 Apr 2017
Understanding Black-box Predictions via Influence Functions
Understanding Black-box Predictions via Influence Functions
Pang Wei Koh
Percy Liang
TDI
234
2,918
0
14 Mar 2017
Pointer Sentinel Mixture Models
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
399
2,905
0
26 Sep 2016
The LAMBADA dataset: Word prediction requiring a broad discourse context
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno
Germán Kruszewski
Angeliki Lazaridou
Q. N. Pham
Raffaella Bernardi
Sandro Pezzelle
Marco Baroni
Gemma Boleda
Raquel Fernández
196
727
0
20 Jun 2016
Previous
12