Knowledge Distillation vs. Pretraining from Scratch under a Fixed
(Computation) Budget

Knowledge Distillation vs. Pretraining from Scratch under a Fixed (Computation) Budget

30 April 2024

Fabian David Schmidt

Katharina von der Wense

Papers citing "Knowledge Distillation vs. Pretraining from Scratch under a Fixed (Computation) Budget"

10 / 10 papers shown

Title
Probing Across Time: What Does RoBERTa Know and When? Leo Z. Liu Yizhong Wang Jungo Kasai Hannaneh Hajishirzi Noah A. Smith KELM 65 82 0 16 Apr 2021
How to Train BERT with an Academic Budget Peter Izsak Moshe Berchansky Omer Levy 77 116 0 15 Apr 2021
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices Zhiqing Sun Hongkun Yu Xiaodan Song Renjie Liu Yiming Yang Denny Zhou MQ 87 807 0 06 Apr 2020
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers Wenhui Wang Furu Wei Li Dong Hangbo Bao Nan Yang Ming Zhou VLM 87 1,230 0 25 Feb 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 449 4,662 0 23 Jan 2020
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter Victor Sanh Lysandre Debut Julien Chaumond Thomas Wolf 126 7,437 0 02 Oct 2019
TinyBERT: Distilling BERT for Natural Language Understanding Xiaoqi Jiao Yichun Yin Lifeng Shang Xin Jiang Xiao Chen Linlin Li F. Wang Qun Liu VLM 59 1,847 0 23 Sep 2019
Neural Network Acceptability Judgments Alex Warstadt Amanpreet Singh Samuel R. Bowman 169 1,390 0 31 May 2018
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference Adina Williams Nikita Nangia Samuel R. Bowman 397 4,444 0 18 Apr 2017
SQuAD: 100,000+ Questions for Machine Comprehension of Text Pranav Rajpurkar Jian Zhang Konstantin Lopyrev Percy Liang RALM 151 8,067 0 16 Jun 2016