ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.06604
  4. Cited By
Do we really have to filter out random noise in pre-training data for language models?

Do we really have to filter out random noise in pre-training data for language models?

10 February 2025
Jinghan Ru
Yuxin Xie
Xianwei Zhuang
Yuguo Yin
Zhihui Guo
Zhiming Liu
Qianli Ren
Yuexian Zou
ArXivPDFHTML

Papers citing "Do we really have to filter out random noise in pre-training data for language models?"

50 / 97 papers shown
Title
Multiscale Adaptive Conflict-Balancing Model For Multimedia Deepfake Detection
Multiscale Adaptive Conflict-Balancing Model For Multimedia Deepfake Detection
Zihan Xiong
Xiaohua Wu
Lei Chen
Fangqi Lou
42
0
0
19 May 2025
Open Set Domain Adaptation with Vision-language models via Gradient-aware Separation
Open Set Domain Adaptation with Vision-language models via Gradient-aware Separation
Haoyang Chen
VLM
52
0
0
16 May 2025
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
Dongchao Yang
Songxiang Liu
Haohan Guo
Jiankun Zhao
Yuanyuan Wang
...
Xubo Liu
Xueyuan Chen
Xu Tan
Xixin Wu
Helen Meng
168
1
0
14 Apr 2025
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
Xianwei Zhuang
Yuxin Xie
Yufan Deng
Dongchao Yang
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
96
3
0
03 Apr 2025
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
Xianwei Zhuang
Yuxin Xie
Yufan Deng
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
MLLM
VLM
LRM
133
8
0
21 Jan 2025
A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on
  Recursively Generated Data'
A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively Generated Data'
Ali Borji
78
44
0
16 Oct 2024
LLM-PCGC: Large Language Model-based Point Cloud Geometry Compression
LLM-PCGC: Large Language Model-based Point Cloud Geometry Compression
Yuqi Ye
Wei Gao
51
1
0
16 Aug 2024
Leveraging Web-Crawled Data for High-Quality Fine-Tuning
Leveraging Web-Crawled Data for High-Quality Fine-Tuning
Jing Zhou
Chenglin Jiang
Wei Shen
Xiao Zhou
Xiaonan He
ALM
56
4
0
15 Aug 2024
UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot
  Audio Task Learner
UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner
Dongchao Yang
Haohan Guo
Yuanyuan Wang
Rongjie Huang
Xiang Li
Xu Tan
Xixin Wu
Helen Meng
AuLLM
57
16
0
14 Jun 2024
Semantic Equitable Clustering: A Simple, Fast and Effective Strategy for
  Vision Transformer
Semantic Equitable Clustering: A Simple, Fast and Effective Strategy for Vision Transformer
Qihang Fan
Huaibo Huang
Mingrui Chen
Ran He
66
3
0
22 May 2024
Vision Transformer with Sparse Scan Prior
Vision Transformer with Sparse Scan Prior
Qihang Fan
Huaibo Huang
Mingrui Chen
Ran He
ViT
63
5
0
22 May 2024
Talking Nonsense: Probing Large Language Models' Understanding of
  Adversarial Gibberish Inputs
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs
Valeriia Cherepanova
James Zou
AAML
70
5
0
26 Apr 2024
On Training Data Influence of GPT Models
On Training Data Influence of GPT Models
Qingyi Liu
Yekun Chai
Shuohuan Wang
Yu Sun
Qiwei Peng
Keze Wang
Hua Wu
TDI
AI4CE
45
6
0
11 Apr 2024
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
Zeyuan Allen-Zhu
Yuanzhi Li
KELM
23
66
0
08 Apr 2024
How Bad is Training on Synthetic Data? A Statistical Analysis of
  Language Model Collapse
How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse
M. Seddik
Suei-Wen Chen
Soufiane Hayou
Pierre Youssef
Merouane Debbah
74
35
0
07 Apr 2024
IntactKV: Improving Large Language Model Quantization by Keeping Pivot
  Tokens Intact
IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact
Ruikang Liu
Haoli Bai
Haokun Lin
Yuening Li
Han Gao
Zheng-Jun Xu
Lu Hou
Jun Yao
Chun Yuan
MQ
35
29
0
02 Mar 2024
A Comprehensive Evaluation of Quantization Strategies for Large Language
  Models
A Comprehensive Evaluation of Quantization Strategies for Large Language Models
Renren Jin
Jiangcun Du
Wuwei Huang
Wei Liu
Jian Luan
Bin Wang
Deyi Xiong
MQ
44
32
0
26 Feb 2024
Dolma: an Open Corpus of Three Trillion Tokens for Language Model
  Pretraining Research
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini
Rodney Michael Kinney
Akshita Bhagia
Dustin Schwenk
David Atkinson
...
Hanna Hajishirzi
Iz Beltagy
Dirk Groeneveld
Jesse Dodge
Kyle Lo
63
265
0
31 Jan 2024
Preparing Lessons for Progressive Training on Language Models
Preparing Lessons for Progressive Training on Language Models
Yu Pan
Ye Yuan
Yichun Yin
Jiaxin Shi
Zenglin Xu
Ming Zhang
Lifeng Shang
Xin Jiang
Qun Liu
55
9
0
17 Jan 2024
What's In My Big Data?
What's In My Big Data?
Yanai Elazar
Akshita Bhagia
Ian H. Magnusson
Abhilasha Ravichander
Dustin Schwenk
...
Luca Soldaini
Sameer Singh
Hanna Hajishirzi
Noah A. Smith
Jesse Dodge
25
93
0
31 Oct 2023
Reusing Pretrained Models by Multi-linear Operators for Efficient
  Training
Reusing Pretrained Models by Multi-linear Operators for Efficient Training
Yu Pan
Ye Yuan
Yichun Yin
Zenglin Xu
Lifeng Shang
Xin Jiang
Qun Liu
66
16
0
16 Oct 2023
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
Dongchao Yang
Jinchuan Tian
Xuejiao Tan
Rongjie Huang
Songxiang Liu
...
Jiang Bian
Xixin Wu
Zhou Zhao
Shinji Watanabe
Helen M. Meng
CVBM
AuLLM
67
122
0
01 Oct 2023
Understanding and Mitigating the Label Noise in Pre-training on
  Downstream Tasks
Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks
Hao Chen
Jindong Wang
Ankit Shah
Ran Tao
Hongxin Wei
Berfin cSimcsek
Masashi Sugiyama
Bhiksha Raj
60
27
0
29 Sep 2023
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
Zeyuan Allen-Zhu
Yuanzhi Li
KELM
75
149
0
25 Sep 2023
RMT: Retentive Networks Meet Vision Transformers
RMT: Retentive Networks Meet Vision Transformers
Qihang Fan
Huaibo Huang
Mingrui Chen
Hongmin Liu
Ran He
ViT
70
78
0
20 Sep 2023
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data
  Selection for Instruction Tuning
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
Ming Li
Yong Zhang
Zhitao Li
Jiuhai Chen
Lichang Chen
Ning Cheng
Jianzong Wang
Dinesh Manocha
Jing Xiao
98
194
0
23 Aug 2023
Beyond Scale: the Diversity Coefficient as a Data Quality Metric
  Demonstrates LLMs are Pre-trained on Formally Diverse Data
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data
Alycia Lee
Brando Miranda
Sudharsan Sundar
Sanmi Koyejo
81
17
0
24 Jun 2023
Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards
  Simpler Subnetworks
Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks
F. Chen
D. Kunin
Atsushi Yamamura
Surya Ganguli
70
27
0
07 Jun 2023
Toward Understanding Generative Data Augmentation
Toward Understanding Generative Data Augmentation
Chenyu Zheng
Guoqiang Wu
Chongxuan Li
46
28
0
27 May 2023
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
  Age, Domain Coverage, Quality, & Toxicity
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Shayne Longpre
Gregory Yauney
Emily Reif
Katherine Lee
Adam Roberts
...
Denny Zhou
Jason W. Wei
Kevin Robinson
David M. Mimno
Daphne Ippolito
76
154
0
22 May 2023
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Sang Michael Xie
Hieu H. Pham
Xuanyi Dong
Nan Du
Hanxiao Liu
Yifeng Lu
Percy Liang
Quoc V. Le
Tengyu Ma
Adams Wei Yu
MoMe
MoE
89
195
0
17 May 2023
FedSOV: Federated Model Secure Ownership Verification with Unforgeable
  Signature
FedSOV: Federated Model Secure Ownership Verification with Unforgeable Signature
Wenyuan Yang
Gongxi Zhu
Yuguo Yin
Hanlin Gu
Lixin Fan
Qiang Yang
Xiaochun Cao
FedML
26
6
0
10 May 2023
FedZKP: Federated Model Ownership Verification with Zero-knowledge Proof
FedZKP: Federated Model Ownership Verification with Zero-knowledge Proof
Wenyuan Yang
Yuguo Yin
Gongxi Zhu
Hanlin Gu
Lixin Fan
Xiaochun Cao
Qiang Yang
FedML
35
9
0
08 May 2023
HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio
  Codec
HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec
Dongchao Yang
Songxiang Liu
Rongjie Huang
Jinchuan Tian
Chao Weng
Yuexian Zou
175
126
0
04 May 2023
ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and
  Multilingual Natural Language Generation
ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation
Bang-ju Yang
Fenglin Liu
Yuexian Zou
Xian Wu
Yaowei Wang
David Clifton
45
9
0
11 Mar 2023
Imbalanced Open Set Domain Adaptation via Moving-threshold Estimation
  and Gradual Alignment
Imbalanced Open Set Domain Adaptation via Moving-threshold Estimation and Gradual Alignment
Jinghan Ru
Jun Tian
Zhekai Du
Chengwei Xiao
Jingjing Li
Jikang Cheng
68
12
0
08 Mar 2023
Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves
  Generalization
Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization
Xingxuan Zhang
Renzhe Xu
Han Yu
Hao Zou
Peng Cui
41
40
0
03 Mar 2023
LLaMA: Open and Efficient Foundation Language Models
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALM
PILM
817
12,840
0
27 Feb 2023
A Comprehensive Survey on Source-free Domain Adaptation
A Comprehensive Survey on Source-free Domain Adaptation
Zhiqi Yu
Jingjing Li
Zhekai Du
Lei Zhu
Jikang Cheng
TTA
91
99
0
23 Feb 2023
Data Selection for Language Models via Importance Resampling
Data Selection for Language Models via Importance Resampling
Sang Michael Xie
Shibani Santurkar
Tengyu Ma
Percy Liang
86
186
0
06 Feb 2023
Revisiting Discriminative vs. Generative Classifiers: Theory and
  Implications
Revisiting Discriminative vs. Generative Classifiers: Theory and Implications
Chenyu Zheng
Guoqiang Wu
Fan Bao
Yue Cao
Chongxuan Li
Jun Zhu
BDL
50
30
0
05 Feb 2023
InstructTTS: Modelling Expressive TTS in Discrete Latent Space with
  Natural Language Style Prompt
InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt
Dongchao Yang
Songxiang Liu
Rongjie Huang
Chao Weng
Helen Meng
DiffM
VLM
57
94
0
31 Jan 2023
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
Sanghyun Woo
Shoubhik Debnath
Ronghang Hu
Xinlei Chen
Zhuang Liu
In So Kweon
Saining Xie
SyDa
128
760
0
02 Jan 2023
Noise-aware Learning from Web-crawled Image-Text Data for Image
  Captioning
Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning
Woohyun Kang
Jonghwan Mun
Sungjun Lee
Byungseok Roh
VLM
37
18
0
27 Dec 2022
Is Out-of-Distribution Detection Learnable?
Is Out-of-Distribution Detection Learnable?
Zhen Fang
Yixuan Li
Jie Lu
Jiahua Dong
Bo Han
Feng Liu
OODD
63
125
0
26 Oct 2022
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
  Language Models
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models
Hong Liu
Sang Michael Xie
Zhiyuan Li
Tengyu Ma
AI4CE
79
50
0
25 Oct 2022
LAION-5B: An open large-scale dataset for training next generation
  image-text models
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann
Romain Beaumont
Richard Vencu
Cade Gordon
Ross Wightman
...
Srivatsa Kundurthy
Katherine Crowson
Ludwig Schmidt
R. Kaczmarczyk
J. Jitsev
VLM
MLLM
CLIP
127
3,355
0
16 Oct 2022
Diffsound: Discrete Diffusion Model for Text-to-sound Generation
Diffsound: Discrete Diffusion Model for Text-to-sound Generation
Dongchao Yang
Jianwei Yu
Helin Wang
Wen Wang
Chao Weng
Yuexian Zou
Dong Yu
DiffM
51
298
0
20 Jul 2022
Improving Pre-trained Language Model Fine-tuning with Noise Stability
  Regularization
Improving Pre-trained Language Model Fine-tuning with Noise Stability Regularization
Hang Hua
Xingjian Li
Dejing Dou
Chengzhong Xu
Jiebo Luo
78
15
0
12 Jun 2022
Dataset Pruning: Reducing Training Data by Examining Generalization
  Influence
Dataset Pruning: Reducing Training Data by Examining Generalization Influence
Shuo Yang
Zeke Xie
Hanyu Peng
Minjing Xu
Mingming Sun
P. Li
DD
164
111
0
19 May 2022
12
Next