Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.01492
Cited By
v1
v2 (latest)
RegMix: Data Mixture as Regression for Language Model Pre-training
1 July 2024
Qian Liu
Xiaosen Zheng
Niklas Muennighoff
Guangtao Zeng
Longxu Dou
Tianyu Pang
Jing Jiang
Min Lin
MoE
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"RegMix: Data Mixture as Regression for Language Model Pre-training"
50 / 97 papers shown
Title
Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
Mozhi Zhang
Howe Tissue
Lu Wang
Xipeng Qiu
120
1
0
12 Jun 2025
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning
Yiqing Liang
Jielin Qiu
Wenhao Ding
Zuxin Liu
James Tompkin
Mengdi Xu
Mengzhou Xia
Zhengzhong Tu
Laixi Shi
Jiacheng Zhu
OffRL
128
0
0
30 May 2025
Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning
Wanyun Xie
F. Tonin
Volkan Cevher
36
0
0
30 May 2025
Rethinking Data Mixture for Large Language Models: A Comprehensive Survey and New Perspectives
Yajiao Liu
Congliang Chen
Junchi Yang
Ruoyu Sun
MoMe
41
0
0
27 May 2025
GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining
Simin Fan
Maria Ios Glarou
Martin Jaggi
VLM
75
0
0
26 May 2025
Do Large Language Models (Really) Need Statistical Foundations?
Weijie Su
274
0
0
25 May 2025
Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
Xinran Gu
Kaifeng Lyu
Jiazheng Li
Jingzhao Zhang
83
0
0
23 May 2025
Model-Free Graph Data Selection under Distribution Shift
Ting-Wei Li
Ruizhong Qiu
Hanghang Tong
OOD
61
0
0
22 May 2025
IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment
Chenlin Ming
Chendi Qu
Mengzhang Cai
Qizhi Pei
Zhuoshi Pan
Yu Li
Xiaoming Duan
Lijun Wu
Zeang Sheng
69
0
0
19 May 2025
Qwen3 Technical Report
An Yang
A. Li
Baosong Yang
Beichen Zhang
Binyuan Hui
...
Zekun Wang
Zeyu Cui
Zhenru Zhang
Zhenhong Zhou
Zihan Qiu
LLMAG
OSLM
LRM
118
100
0
14 May 2025
Guiding Data Collection via Factored Scaling Curves
Lihan Zha
Apurva Badithela
Michael Zhang
Justin Lidard
Jeremy Bao
Emily Zhou
David Snyder
Allen Z. Ren
Dhruv Shah
Anirudha Majumdar
OffRL
140
2
0
12 May 2025
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
Kai Hua
Steven Wu
Ge Zhang
Ke Shen
LRM
85
0
0
12 May 2025
Learning Dynamics in Continual Pre-Training for Large Language Models
Xingjin Wang
Howe Tissue
Lu Wang
Linjing Li
D. Zeng
CLL
80
0
0
12 May 2025
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
Albert Ge
Tzu-Heng Huang
John Cooper
Avi Trost
Ziyi Chu
Satya Sai Srinath Namburi GNVV
Ziyang Cai
Kendall Park
Nicholas Roberts
Frederic Sala
107
1
0
01 May 2025
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Fengze Liu
Weidong Zhou
Binbin Liu
Zhimiao Yu
Yifan Zhang
...
Yifeng Yu
Bingni Zhang
Xiaohuan Zhou
Taifeng Wang
Yong Cao
134
1
0
23 Apr 2025
Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction
Wenke Xia
Ruoxuan Feng
Dong Wang
Di Hu
91
1
0
20 Apr 2025
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Xinlin Zhuang
Jiahui Peng
Ren Ma
Yucheng Wang
Tianyi Bai
Xingjian Wei
Jiantao Qiu
Chi Zhang
Ying Qian
Conghui He
151
0
0
19 Apr 2025
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Shizhe Diao
Yu Yang
Y. Fu
Xin Dong
Jane Polak Scowcroft
...
Hongxu Yin
M. Patwary
Yingyan
Jan Kautz
Pavlo Molchanov
122
2
0
17 Apr 2025
ZClip: Adaptive Spike Mitigation for LLM Pre-Training
Abhay Kumar
Louis Owen
Nilabhra Roy Chowdhury
Fabian Güra
VLM
90
1
0
03 Apr 2025
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
Thomson Yen
Andrew Siah
Haozhe Chen
Tianyi Peng
Daniel Guetta
Hongseok Namkoong
83
0
0
26 Mar 2025
Teaching LMMs for Image Quality Scoring and Interpreting
Zicheng Zhang
H. Wu
Ziheng Jia
Weisi Lin
Guangtao Zhai
129
2
0
12 Mar 2025
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Emmy Liu
Amanda Bertsch
Lintang Sutawika
Lindia Tjuatja
Patrick Fernandes
...
Siyang Song
Carolin (Haas) Lawrence
Aditi Raghunathan
Kiril Gashteovski
Graham Neubig
275
3
0
05 Mar 2025
SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity
Xiangyu Xi
Deyang Kong
Jian Yang
Jiawei Yang
Zheyu Chen
Wei Wang
Jinqiao Wang
Xunliang Cai
Shikun Zhang
Wei Ye
113
0
0
03 Mar 2025
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Jake Poznanski
Aman Rangapur
Jon Borchardt
Jason Dunkelberger
Regan Huff
Daniel Lin
Aman Rangapur
Christopher Wilhelm
Kyle Lo
Luca Soldaini
174
7
0
25 Feb 2025
Unsupervised Topic Models are Data Mixers for Pre-training Language Models
Jiahui Peng
Xinlin Zhuang
Qiu Jiantao
Ren Ma
Jing Yu
Tianyi Bai
Zeang Sheng
102
3
0
24 Feb 2025
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Longxu Dou
Qian Liu
Fan Zhou
Changyu Chen
Zili Wang
...
Tianyu Pang
Chao Du
Xinyi Wan
Wei Lu
Min Lin
243
3
0
18 Feb 2025
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Fan Zhou
Zengzhi Wang
Qian Liu
Junlong Li
Pengfei Liu
ALM
224
15
0
17 Feb 2025
How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines
Ayan Sengupta
Ayan Sengupta
Tanmoy Chakraborty
170
0
0
17 Feb 2025
MixMin: Finding Data Mixtures via Convex Minimization
Anvith Thudi
Evianne Rovers
Yangjun Ruan
Tristan Thrush
Chris J. Maddison
111
0
0
14 Feb 2025
Bag of Tricks for Inference-time Computation of LLM Reasoning
Fan Liu
Wenshuo Chao
Naiqiang Tan
Hao Liu
OffRL
LRM
173
5
0
11 Feb 2025
Scaling Laws for Precision
Tanishq Kumar
Zachary Ankner
Benjamin Spector
Blake Bordelon
Niklas Muennighoff
Mansheej Paul
Cengiz Pehlevan
Christopher Ré
Aditi Raghunathan
AIFin
MoMe
106
29
0
07 Nov 2024
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Lester James V. Miranda
Yizhong Wang
Yanai Elazar
Sachin Kumar
Valentina Pyatkin
Faeze Brahman
Noah A. Smith
Hannaneh Hajishirzi
Pradeep Dasigi
138
12
0
24 Oct 2024
Spatio-Temporal Control for Masked Motion Synthesis
Ekkasit Pinyoanuntapong
Muhammad Usama Saleem
Korrawe Karunratanakul
Pu Wang
Hongfei Xue
Chong Chen
Chuan Guo
Junli Cao
J. Ren
Sergey Tulyakov
VGen
92
7
0
14 Oct 2024
Scaling Laws for Predicting Downstream Performance in LLMs
Yangyi Chen
Binxuan Huang
Yifan Gao
Zhengyang Wang
Jingfeng Yang
Heng Ji
LRM
142
12
0
11 Oct 2024
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration
Tianyi Bai
Ling Yang
Zhen Hao Wong
Fupeng Sun
Jiahui Peng
...
Lijun Wu
Jiantao Qiu
Wentao Zhang
Binhang Yuan
Conghui He
LLMAG
79
6
0
10 Oct 2024
Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets
Tianjian Li
Haoran Xu
Weiting Tan
Kenton Murray
Daniel Khashabi
158
1
0
06 Oct 2024
Dynamic Gradient Alignment for Online Data Mixing
Simin Fan
David Grangier
Pierre Ablin
61
5
0
03 Oct 2024
Improving Pretraining Data Using Perplexity Correlations
Tristan Thrush
Christopher Potts
Tatsunori Hashimoto
109
22
0
09 Sep 2024
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Zichun Yu
Spandan Das
Chenyan Xiong
128
37
0
10 Jun 2024
D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
Haoran Que
Jiaheng Liu
Ge Zhang
Chenchen Zhang
Xingwei Qu
...
Jie Fu
Wenbo Su
Jiamang Wang
Lin Qu
Bo Zheng
CLL
150
17
0
03 Jun 2024
gzip Predicts Data-dependent Scaling Laws
Rohan Pandey
82
11
0
26 May 2024
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
139
45
0
26 May 2024
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman
Hailey Schoelkopf
Lintang Sutawika
Leo Gao
J. Tow
...
Xiangru Tang
Kevin A. Wang
Genta Indra Winata
Franccois Yvon
Andy Zou
ELM
ALM
198
63
3
23 May 2024
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Feiyang Kang
H. Just
Yifan Sun
Himanshu Jahagirdar
Yuanzhi Zhang
Rongxing Du
Anit Kumar Sahu
Ruoxi Jia
102
22
0
05 May 2024
Text Quality-Based Pruning for Efficient Training of Language Models
Vasu Sharma
Karthik Padthe
Newsha Ardalani
Kushal Tirumala
Russell Howes
...
Po-Yao Huang
Shang-Wen Li
Armen Aghajanyan
Gargi Ghosh
Luke Zettlemoyer
120
6
0
26 Apr 2024
OpenELM: An Efficient Language Model Family with Open Training and Inference Framework
Sachin Mehta
Mohammad Hossein Sekhavat
Qingqing Cao
Maxwell Horton
Yanzi Jin
...
Iman Mirzadeh
Mahyar Najibi
Dmitry Belenko
Peter Zatloukal
Mohammad Rastegari
OSLM
AIFin
108
61
0
22 Apr 2024
Compression Represents Intelligence Linearly
Yuzhen Huang
Jinghan Zhang
Zifei Shan
Junxian He
82
29
0
15 Apr 2024
Rho-1: Not All Tokens Are What You Need
Zheng-Wen Lin
Zhibin Gou
Yeyun Gong
Xiao Liu
Yelong Shen
...
Chen Lin
Yujiu Yang
Jian Jiao
Nan Duan
Weizhu Chen
CLL
160
75
0
11 Apr 2024
Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic
Sachin Goyal
Pratyush Maini
Zachary Chase Lipton
Aditi Raghunathan
J. Zico Kolter
105
46
0
10 Apr 2024
Sailor: Open Language Models for South-East Asia
Longxu Dou
Qian Liu
Guangtao Zeng
Jia Guo
Jiahui Zhou
Wei Lu
Min Lin
LRM
106
9
0
04 Apr 2024
1
2
Next