ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.05816
  4. Cited By
Improving Pretraining Data Using Perplexity Correlations

Improving Pretraining Data Using Perplexity Correlations

9 September 2024
Tristan Thrush
Christopher Potts
Tatsunori Hashimoto
ArXivPDFHTML

Papers citing "Improving Pretraining Data Using Perplexity Correlations"

16 / 16 papers shown
Title
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Fengze Liu
Weidong Zhou
Binbin Liu
Zhimiao Yu
Yifan Zhang
...
Yifeng Yu
Bingni Zhang
Xiaohuan Zhou
Taifeng Wang
Yong Cao
66
1
0
23 Apr 2025
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Xiaoxuan Zhu
Zhouhong Gu
Baiqian Wu
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
42
0
0
01 Apr 2025
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
Thomson Yen
Andrew Siah
Haozhe Chen
Tianyi Peng
Daniel Guetta
Hongseok Namkoong
53
0
0
26 Mar 2025
Towards Automatic Continual Learning: A Self-Adaptive Framework for Continual Instruction Tuning
Towards Automatic Continual Learning: A Self-Adaptive Framework for Continual Instruction Tuning
Peiyi Lin
Fukai Zhang
Kai Niu
Hao Fu
CLL
64
0
0
20 Mar 2025
Task-Specific Data Selection for Instruction Tuning via Monosemantic Neuronal Activations
Task-Specific Data Selection for Instruction Tuning via Monosemantic Neuronal Activations
Da Ma
Gonghu Shang
Zhi Chen
L. Qin
Yijie Luo
Lei Pan
Shuai Fan
Lu Chen
Kai Yu
46
0
0
19 Mar 2025
Predictive Data Selection: The Data That Predicts Is the Data That Teaches
Predictive Data Selection: The Data That Predicts Is the Data That Teaches
Kashun Shum
Yuanmin Huang
Hongjian Zou
Qi Ding
Yixuan Liao
Xiao Chen
Qian Liu
Junxian He
67
2
0
02 Mar 2025
Predicting Emergent Capabilities by Finetuning
Predicting Emergent Capabilities by Finetuning
Charlie Snell
Eric Wallace
Dan Klein
Sergey Levine
ELM
LRM
84
5
0
25 Nov 2024
Rephrasing natural text data with different languages and quality levels
  for Large Language Model pre-training
Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training
Michael Pieler
Marco Bellagente
H. Teufel
Duy Phung
Nathan Cooper
...
Reshinth Adithyan
Zaid Alyafeai
Nikhil Pinnaparaju
Maksym Zhuravinskyi
Carlos Riquelme
32
1
0
28 Oct 2024
Scalable Data Ablation Approximations for Language Models through
  Modular Training and Merging
Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Clara Na
Ian H. Magnusson
A. Jha
Tom Sherborne
Emma Strubell
Jesse Dodge
Pradeep Dasigi
MoMe
38
5
0
21 Oct 2024
Optimizing Low-Resource Language Model Training: Comprehensive Analysis
  of Multi-Epoch, Multi-Lingual, and Two-Stage Approaches
Optimizing Low-Resource Language Model Training: Comprehensive Analysis of Multi-Epoch, Multi-Lingual, and Two-Stage Approaches
Kosuke Akimoto
Masafumi Oyamada
26
0
0
16 Oct 2024
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
Tianyi Bai
Ling Yang
Zhen Hao Wong
Jiahui Peng
Xinlin Zhuang
...
Lijun Wu
Jiantao Qiu
Wentao Zhang
Binhang Yuan
Conghui He
LLMAG
23
4
0
10 Oct 2024
Scalable Fine-tuning from Multiple Data Sources:A First-Order
  Approximation Approach
Scalable Fine-tuning from Multiple Data Sources:A First-Order Approximation Approach
Dongyue Li
Ziniu Zhang
Lu Wang
Hongyang R. Zhang
43
0
0
28 Sep 2024
Diversify and Conquer: Diversity-Centric Data Selection with Iterative
  Refinement
Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement
Simon Yu
Liangyu Chen
Sara Ahmadian
Marzieh Fadaee
37
7
0
17 Sep 2024
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
Prashant Kodali
Anmol Goel
Likhith Asapu
Vamshi Krishna Bonagiri
Anirudh Govil
Monojit Choudhury
Manish Shrivastava
Ponnurangam Kumaraguru
44
0
0
09 May 2024
OLMo: Accelerating the Science of Language Models
OLMo: Accelerating the Science of Language Models
Dirk Groeneveld
Iz Beltagy
Pete Walsh
Akshita Bhagia
Rodney Michael Kinney
...
Jesse Dodge
Kyle Lo
Luca Soldaini
Noah A. Smith
Hanna Hajishirzi
OSLM
141
359
0
01 Feb 2024
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
264
4,489
0
23 Jan 2020
1