ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.01492
  4. Cited By
RegMix: Data Mixture as Regression for Language Model Pre-training

RegMix: Data Mixture as Regression for Language Model Pre-training

1 July 2024
Qian Liu
Xiaosen Zheng
Niklas Muennighoff
Guangtao Zeng
Longxu Dou
Tianyu Pang
Jing Jiang
Min-Bin Lin
    MoE
ArXivPDFHTML

Papers citing "RegMix: Data Mixture as Regression for Language Model Pre-training"

43 / 43 papers shown
Title
Qwen3 Technical Report
Qwen3 Technical Report
A. Yang
A. Li
Baosong Yang
Beichen Zhang
Binyuan Hui
...
Zekun Wang
Zeyu Cui
Z. Zhang
Zhenhong Zhou
Zihan Qiu
LLMAG
OSLM
LRM
42
0
0
14 May 2025
Guiding Data Collection via Factored Scaling Curves
Guiding Data Collection via Factored Scaling Curves
Lihan Zha
Apurva Badithela
Michael Zhang
Justin Lidard
Jeremy Bao
Emily Zhou
David Snyder
Allen Z. Ren
Dhruv Shah
Anirudha Majumdar
OffRL
34
0
0
12 May 2025
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
Kai Hua
Steven Wu
Ge Zhang
Ke Shen
LRM
28
0
0
12 May 2025
Learning Dynamics in Continual Pre-Training for Large Language Models
Learning Dynamics in Continual Pre-Training for Large Language Models
Xingjin Wang
Howe Tissue
Lu Wang
Linjing Li
D. Zeng
CLL
29
0
0
12 May 2025
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
Albert Ge
Tzu-Heng Huang
John Cooper
Avi Trost
Ziyi Chu
Satya Sai Srinath Namburi GNVV
Ziyang Cai
Kendall Park
Nicholas Roberts
Frederic Sala
53
0
0
01 May 2025
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Fengze Liu
Weidong Zhou
Binbin Liu
Zhimiao Yu
Yifan Zhang
...
Yifeng Yu
Bingni Zhang
Xiaohuan Zhou
Taifeng Wang
Yong Cao
66
1
0
23 Apr 2025
Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction
Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction
Wenke Xia
Ruoxuan Feng
Dong Wang
Di Hu
32
0
0
20 Apr 2025
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Xinlin Zhuang
Jiahui Peng
Ren Ma
Y. Wang
Tianyi Bai
Xingjian Wei
Jiantao Qiu
Chi Zhang
Ying Qian
Conghui He
50
0
0
19 Apr 2025
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Shizhe Diao
Yu Yang
Y. Fu
Xin Dong
Dan Su
...
Hongxu Yin
M. Patwary
Yingyan
Jan Kautz
Pavlo Molchanov
38
0
0
17 Apr 2025
ZClip: Adaptive Spike Mitigation for LLM Pre-Training
ZClip: Adaptive Spike Mitigation for LLM Pre-Training
Abhay Kumar
Louis Owen
Nilabhra Roy Chowdhury
Fabian Güra
VLM
40
1
0
03 Apr 2025
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
Thomson Yen
Andrew Siah
Haozhe Chen
Tianyi Peng
Daniel Guetta
Hongseok Namkoong
48
0
0
26 Mar 2025
Teaching LMMs for Image Quality Scoring and Interpreting
Zicheng Zhang
H. Wu
Ziheng Jia
Weisi Lin
Guangtao Zhai
62
1
0
12 Mar 2025
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
E. Liu
Amanda Bertsch
Lintang Sutawika
Lindia Tjuatja
Patrick Fernandes
...
Shri Kiran Srinivasan
Carolin (Haas) Lawrence
Aditi Raghunathan
Kiril Gashteovski
Graham Neubig
90
0
0
05 Mar 2025
SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity
Xiangyu Xi
Deyang Kong
Jian Yang
Jiawei Yang
Z. Chen
Wei Wang
J. T. Wang
Xunliang Cai
Shikun Zhang
Wei Ye
65
0
0
03 Mar 2025
Unsupervised Topic Models are Data Mixers for Pre-training Language Models
Unsupervised Topic Models are Data Mixers for Pre-training Language Models
Jiahui Peng
Xinlin Zhuang
Qiu Jiantao
Ren Ma
Jing Yu
Tianyi Bai
Conghui He
38
0
0
24 Feb 2025
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Longxu Dou
Qian Liu
Fan Zhou
Changyu Chen
Zili Wang
...
Tianyu Pang
Chao Du
Xinyi Wan
Wei Lu
Min Lin
106
1
0
18 Feb 2025
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Fan Zhou
Zengzhi Wang
Qian Liu
Junlong Li
Pengfei Liu
ALM
102
15
0
17 Feb 2025
How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines
How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines
Ayan Sengupta
Yash Goel
Tanmoy Chakraborty
50
0
0
17 Feb 2025
Bag of Tricks for Inference-time Computation of LLM Reasoning
Bag of Tricks for Inference-time Computation of LLM Reasoning
Fan Liu
Wenshuo Chao
Naiqiang Tan
Hao Liu
OffRL
LRM
79
3
0
11 Feb 2025
PiKE: Adaptive Data Mixing for Multi-Task Learning Under Low Gradient Conflicts
Zeman Li
Yuan Deng
Peilin Zhong
Meisam Razaviyayn
Vahab Mirrokni
MoMe
75
1
0
10 Feb 2025
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Lester James Validad Miranda
Yizhong Wang
Yanai Elazar
Sachin Kumar
Valentina Pyatkin
Faeze Brahman
Noah A. Smith
Hannaneh Hajishirzi
Pradeep Dasigi
47
8
0
08 Jan 2025
Scaling Laws for Precision
Scaling Laws for Precision
Tanishq Kumar
Zachary Ankner
Benjamin Spector
Blake Bordelon
Niklas Muennighoff
Mansheej Paul
C. Pehlevan
Christopher Ré
Aditi Raghunathan
AIFin
MoMe
46
13
0
07 Nov 2024
ControlMM: Controllable Masked Motion Generation
ControlMM: Controllable Masked Motion Generation
Ekkasit Pinyoanuntapong
Muhammad Usama Saleem
Korrawe Karunratanakul
Pu Wang
Hongfei Xue
Cheng Chen
Chuan Guo
Junli Cao
J. Ren
Sergey Tulyakov
VGen
37
4
0
14 Oct 2024
Scaling Laws for Predicting Downstream Performance in LLMs
Scaling Laws for Predicting Downstream Performance in LLMs
Yangyi Chen
Binxuan Huang
Yifan Gao
Zhengyang Wang
Jingfeng Yang
Heng Ji
LRM
47
8
0
11 Oct 2024
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
Tianyi Bai
Ling Yang
Zhen Hao Wong
Jiahui Peng
Xinlin Zhuang
...
Lijun Wu
Jiantao Qiu
Wentao Zhang
Binhang Yuan
Conghui He
LLMAG
23
4
0
10 Oct 2024
Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets
Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets
Tianjian Li
Haoran Xu
Weiting Tan
Kenton Murray
Daniel Khashabi
35
1
0
06 Oct 2024
Dynamic Gradient Alignment for Online Data Mixing
Dynamic Gradient Alignment for Online Data Mixing
Simin Fan
David Grangier
Pierre Ablin
31
3
0
03 Oct 2024
Improving Pretraining Data Using Perplexity Correlations
Improving Pretraining Data Using Perplexity Correlations
Tristan Thrush
Christopher Potts
Tatsunori Hashimoto
32
17
0
09 Sep 2024
A Survey of Multimodal Large Language Model from A Data-centric
  Perspective
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping-Chia Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
49
36
0
26 May 2024
Get more for less: Principled Data Selection for Warming Up Fine-Tuning
  in LLMs
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Feiyang Kang
H. Just
Yifan Sun
Himanshu Jahagirdar
Yuanzhi Zhang
Rongxing Du
Anit Kumar Sahu
Ruoxi Jia
56
17
0
05 May 2024
Text Quality-Based Pruning for Efficient Training of Language Models
Text Quality-Based Pruning for Efficient Training of Language Models
Vasu Sharma
Karthik Padthe
Newsha Ardalani
Kushal Tirumala
Russell Howes
...
Po-Yao Huang
Shang-Wen Li
Armen Aghajanyan
Gargi Ghosh
Luke Zettlemoyer
46
6
0
26 Apr 2024
OpenELM: An Efficient Language Model Family with Open Training and
  Inference Framework
OpenELM: An Efficient Language Model Family with Open Training and Inference Framework
Sachin Mehta
Mohammad Hossein Sekhavat
Qingqing Cao
Maxwell Horton
Yanzi Jin
...
Iman Mirzadeh
Mahyar Najibi
Dmitry Belenko
Peter Zatloukal
Mohammad Rastegari
OSLM
AIFin
38
50
0
22 Apr 2024
Compression Represents Intelligence Linearly
Compression Represents Intelligence Linearly
Yuzhen Huang
Jinghan Zhang
Zifei Shan
Junxian He
50
26
0
15 Apr 2024
Sailor: Open Language Models for South-East Asia
Sailor: Open Language Models for South-East Asia
Longxu Dou
Qian Liu
Guangtao Zeng
Jia Guo
Jiahui Zhou
Wei Lu
Min-Bin Lin
LRM
32
7
0
04 Apr 2024
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Jiasheng Ye
Peiju Liu
Tianxiang Sun
Yunhua Zhou
Jun Zhan
Xipeng Qiu
49
62
0
25 Mar 2024
Language models scale reliably with over-training and on downstream
  tasks
Language models scale reliably with over-training and on downstream tasks
S. Gadre
Georgios Smyrnis
Vaishaal Shankar
Suchin Gururangan
Mitchell Wortsman
...
Y. Carmon
Achal Dave
Reinhard Heckel
Niklas Muennighoff
Ludwig Schmidt
ALM
ELM
LRM
108
40
0
13 Mar 2024
SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large
  Language Models by Summarizing Training Trajectories of Small Models
SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
Yu Yang
Siddhartha Mishra
Jeffrey N Chiang
Baharan Mirzasoleiman
40
17
0
12 Mar 2024
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
  Improves LLM Generalization
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization
Xuxi Chen
Zhendong Wang
Daouda Sow
Junjie Yang
Tianlong Chen
Yingbin Liang
Mingyuan Zhou
Zhangyang Wang
34
5
0
22 Feb 2024
LESS: Selecting Influential Data for Targeted Instruction Tuning
LESS: Selecting Influential Data for Targeted Instruction Tuning
Mengzhou Xia
Sadhika Malladi
Suchin Gururangan
Sanjeev Arora
Danqi Chen
80
186
0
06 Feb 2024
Paloma: A Benchmark for Evaluating Language Model Fit
Paloma: A Benchmark for Evaluating Language Model Fit
Ian H. Magnusson
Akshita Bhagia
Valentin Hofmann
Luca Soldaini
A. Jha
...
Iz Beltagy
Hanna Hajishirzi
Noah A. Smith
Kyle Richardson
Jesse Dodge
132
21
0
16 Dec 2023
Data Diversity Matters for Robust Instruction Tuning
Data Diversity Matters for Robust Instruction Tuning
Alexander Bukharin
Tuo Zhao
79
35
0
21 Nov 2023
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
256
1,996
0
31 Dec 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,959
0
20 Apr 2018
1