ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.16215
  4. Cited By
Pre-training Distillation for Large Language Models: A Design Space
  Exploration

Pre-training Distillation for Large Language Models: A Design Space Exploration

21 October 2024
Hao Peng
Xin Lv
Yushi Bai
Zijun Yao
Jing Zhang
Lei Hou
Juanzi Li
ArXivPDFHTML

Papers citing "Pre-training Distillation for Large Language Models: A Design Space Exploration"

21 / 21 papers shown
Title
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
Yunjia Qi
Hao Peng
Xinyu Wang
Bin Xu
Lei Hou
Juanzi Li
96
3
0
31 Oct 2024
Knowledge Distillation Based on Transformed Teacher Matching
Knowledge Distillation Based on Transformed Teacher Matching
Kaixiang Zheng
En-Hui Yang
69
20
0
17 Feb 2024
Orca 2: Teaching Small Language Models How to Reason
Orca 2: Teaching Small Language Models How to Reason
Arindam Mitra
Luciano Del Corro
Shweti Mahajan
Andres Codas
Clarisse Simoes
...
Hamid Palangi
Guoqing Zheng
Corby Rosset
Hamed Khanpour
Ahmed Hassan Awadallah
ReLM
LRM
72
141
0
18 Nov 2023
NormKD: Normalized Logits for Knowledge Distillation
NormKD: Normalized Logits for Knowledge Distillation
Zhihao Chi
Tu Zheng
Hengjia Li
Zheng Yang
Boxi Wu
Binbin Lin
D. Cai
54
14
0
01 Aug 2023
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
837
12,893
0
04 Mar 2022
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
241
4,392
0
27 Oct 2021
Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in
  Knowledge Distillation
Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation
Taehyeon Kim
Jaehoon Oh
Nakyil Kim
Sangwook Cho
Se-Young Yun
51
235
0
19 May 2021
DynaBERT: Dynamic BERT with Adaptive Width and Depth
DynaBERT: Dynamic BERT with Adaptive Width and Depth
Lu Hou
Zhiqi Huang
Lifeng Shang
Xin Jiang
Xiao Chen
Qun Liu
MQ
76
322
0
08 Apr 2020
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
Zhiqing Sun
Hongkun Yu
Xiaodan Song
Renjie Liu
Yiming Yang
Denny Zhou
MQ
104
816
0
06 Apr 2020
FastBERT: a Self-distilling BERT with Adaptive Inference Time
FastBERT: a Self-distilling BERT with Adaptive Inference Time
Weijie Liu
Peng Zhou
Zhe Zhao
Zhiruo Wang
Haotang Deng
Qi Ju
84
359
0
05 Apr 2020
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
  of Pre-Trained Transformers
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Wenhui Wang
Furu Wei
Li Dong
Hangbo Bao
Nan Yang
Ming Zhou
VLM
129
1,265
0
25 Feb 2020
Uninformed Students: Student-Teacher Anomaly Detection with
  Discriminative Latent Embeddings
Uninformed Students: Student-Teacher Anomaly Detection with Discriminative Latent Embeddings
Paul Bergmann
Michael Fauser
David Sattlegger
C. Steger
72
662
0
06 Nov 2019
Contrastive Representation Distillation
Contrastive Representation Distillation
Yonglong Tian
Dilip Krishnan
Phillip Isola
144
1,048
0
23 Oct 2019
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and
  lighter
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh
Lysandre Debut
Julien Chaumond
Thomas Wolf
223
7,498
0
02 Oct 2019
TinyBERT: Distilling BERT for Natural Language Understanding
TinyBERT: Distilling BERT for Natural Language Understanding
Xiaoqi Jiao
Yichun Yin
Lifeng Shang
Xin Jiang
Xiao Chen
Linlin Li
F. Wang
Qun Liu
VLM
94
1,860
0
23 Sep 2019
Adaptive Regularization of Labels
Adaptive Regularization of Labels
Qianggang Ding
Sifan Wu
Hao Sun
Jiadong Guo
Shutao Xia
ODL
51
29
0
15 Aug 2019
When Does Label Smoothing Help?
When Does Label Smoothing Help?
Rafael Müller
Simon Kornblith
Geoffrey E. Hinton
UQCV
189
1,943
0
06 Jun 2019
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning Training
Dhiraj D. Kalamkar
Dheevatsa Mudigere
Naveen Mellempudi
Dipankar Das
K. Banerjee
...
Sudarshan Srinivasan
Abhisek Kundu
M. Smelyanskiy
Bharat Kaul
Pradeep Dubey
MQ
78
346
0
29 May 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.7K
94,729
0
11 Oct 2018
Paying More Attention to Attention: Improving the Performance of
  Convolutional Neural Networks via Attention Transfer
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer
Sergey Zagoruyko
N. Komodakis
116
2,578
0
12 Dec 2016
Adam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization
Diederik P. Kingma
Jimmy Ba
ODL
1.7K
150,006
0
22 Dec 2014
1