Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.03476
Cited By
Does your data spark joy? Performance gains from domain upsampling at the end of training
5 June 2024
Cody Blakeney
Mansheej Paul
Brett W. Larsen
Sean Owen
Jonathan Frankle
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Does your data spark joy? Performance gains from domain upsampling at the end of training"
19 / 19 papers shown
Title
Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report
Paul Kassianik
Baturay Saglam
Alexander Chen
Blaine Nelson
Anu Vellore
...
Hyrum Anderson
Kojin Oshiba
Omar Santos
Yaron Singer
Amin Karbasi
PILM
61
0
0
28 Apr 2025
Trillion 7B Technical Report
Sungjun Han
Juyoung Suk
Suyeong An
Hyungguk Kim
Kyuseok Kim
Wonsuk Yang
Seungtaek Choi
Jamin Shin
116
0
0
21 Apr 2025
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Shizhe Diao
Yu Yang
Y. Fu
Xin Dong
Dan Su
...
Hongxu Yin
M. Patwary
Yingyan
Jan Kautz
Pavlo Molchanov
38
0
0
17 Apr 2025
Steering off Course: Reliability Challenges in Steering Language Models
Patrick Queiroz Da Silva
Hari Sethuraman
Dheeraj Rajagopal
Hannaneh Hajishirzi
Sachin Kumar
LLMSV
29
1
0
06 Apr 2025
Measurement of LLM's Philosophies of Human Nature
Minheng Ni
Ennan Wu
Zidong Gong
Z. Yang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
Lijuan Wang
Wangmeng Zuo
37
0
0
03 Apr 2025
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
Thomson Yen
Andrew Siah
Haozhe Chen
Tianyi Peng
Daniel Guetta
Hongseok Namkoong
48
0
0
26 Mar 2025
EuroBERT: Scaling Multilingual Encoders for European Languages
Nicolas Boizard
Hippolyte Gisserot-Boukhlef
Duarte M. Alves
André F. T. Martins
Ayoub Hammal
...
Maxime Peyrard
Nuno M. Guerreiro
Patrick Fernandes
Ricardo Rei
Pierre Colombo
125
1
0
07 Mar 2025
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Jake Poznanski
Jon Borchardt
Jason Dunkelberger
Regan Huff
Daniel Lin
Aman Rangapur
Christopher Wilhelm
Kyle Lo
Luca Soldaini
89
0
0
25 Feb 2025
Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining
Steven Feng
Shrimai Prabhumoye
Kezhi Kong
Dan Su
M. Patwary
M. Shoeybi
Bryan Catanzaro
77
2
0
18 Dec 2024
Predicting Emergent Capabilities by Finetuning
Charlie Snell
Eric Wallace
Dan Klein
Sergey Levine
ELM
LRM
79
5
0
25 Nov 2024
Sparse Upcycling: Inference Inefficient Finetuning
Sasha Doubov
Nikhil Sardana
Vitaliy Chiley
MoE
39
0
0
13 Nov 2024
Zyda-2: a 5 Trillion Token High-Quality Dataset
Yury Tokpanov
Paolo Glorioso
Quentin Anthony
Beren Millidge
39
3
0
09 Nov 2024
BSM: Small but Powerful Biological Sequence Model for Genes and Proteins
Weixi Xiang
Xueting Han
Xiujuan Chai
Jing Bai
26
1
0
15 Oct 2024
Unsupervised Data Validation Methods for Efficient Model Training
Yurii Paniv
37
1
0
10 Oct 2024
DEPT: Decoupled Embeddings for Pre-training Language Models
Alex Iacob
Lorenzo Sani
Meghdad Kurmanji
William F. Shen
Xinchi Qiu
Dongqi Cai
Yan Gao
Nicholas D. Lane
VLM
145
0
0
07 Oct 2024
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell
Jaehoon Lee
Kelvin Xu
Aviral Kumar
LRM
60
478
0
06 Aug 2024
Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models
Jupinder Parmar
Sanjev Satheesh
M. Patwary
M. Shoeybi
Bryan Catanzaro
53
28
0
09 Jul 2024
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Alexander Hägele
Elie Bakouch
Atli Kosson
Loubna Ben Allal
Leandro von Werra
Martin Jaggi
38
34
0
28 May 2024
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press
Noah A. Smith
M. Lewis
253
695
0
27 Aug 2021
1