ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.07705
  4. Cited By
How to Train BERT with an Academic Budget

How to Train BERT with an Academic Budget

15 April 2021
Peter Izsak
Moshe Berchansky
Omer Levy
ArXivPDFHTML

Papers citing "How to Train BERT with an Academic Budget"

50 / 72 papers shown
Title
Prediction-powered estimators for finite population statistics in highly imbalanced textual data: Public hate crime estimation
Prediction-powered estimators for finite population statistics in highly imbalanced textual data: Public hate crime estimation
Hannes Waldetoft
Jakob Torgander
Måns Magnusson
29
0
0
05 May 2025
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt
Aaron Mueller
Leshem Choshen
E. Wilcox
Chengxu Zhuang
...
Rafael Mosquera
Bhargavi Paranjape
Adina Williams
Tal Linzen
Ryan Cotterell
43
108
0
10 Apr 2025
Slamming: Training a Speech Language Model on One GPU in a Day
Slamming: Training a Speech Language Model on One GPU in a Day
Gallil Maimon
Avishai Elmakies
Yossi Adi
38
3
0
19 Feb 2025
A distributional simplicity bias in the learning dynamics of transformers
A distributional simplicity bias in the learning dynamics of transformers
Riccardo Rende
Federica Gerace
A. Laio
Sebastian Goldt
79
8
0
17 Feb 2025
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for
  Fast, Memory Efficient, and Long Context Finetuning and Inference
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Benjamin Warner
Antoine Chaffin
Benjamin Clavié
Orion Weller
Oskar Hallström
...
Tom Aarsen
Nathan Cooper
Griffin Adams
Jeremy Howard
Iacopo Poli
90
79
0
18 Dec 2024
Normalization Layer Per-Example Gradients are Sufficient to Predict
  Gradient Noise Scale in Transformers
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
Gavia Gray
Aman Tiwari
Shane Bergsma
Joel Hestness
30
1
0
01 Nov 2024
$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources
100Kor100Days:Trade−offswhenPre−TrainingwithAcademicResources100K or 100 Days: Trade-offs when Pre-Training with Academic Resources100Kor100Days:Trade−offswhenPre−TrainingwithAcademicResources
Apoorv Khandelwal
Tian Yun
Nihal V. Nayak
Jack Merullo
Stephen H. Bach
Chen Sun
Ellie Pavlick
VLM
AI4CE
OnRL
66
2
0
30 Oct 2024
Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword
  Tokenization
Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization
Zilong Li
30
0
0
19 Oct 2024
Exploring the Benefit of Activation Sparsity in Pre-training
Exploring the Benefit of Activation Sparsity in Pre-training
Zhengyan Zhang
Chaojun Xiao
Qiujieli Qin
Yankai Lin
Zhiyuan Zeng
Xu Han
Zhiyuan Liu
Ruobing Xie
Maosong Sun
Jie Zhou
MoE
64
3
0
04 Oct 2024
Expanding Expressivity in Transformer Models with MöbiusAttention
Expanding Expressivity in Transformer Models with MöbiusAttention
Anna-Maria Halacheva
M. Nayyeri
Steffen Staab
27
1
0
08 Sep 2024
Hydra: Bidirectional State Space Models Through Generalized Matrix
  Mixers
Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers
Sukjun Hwang
Aakash Lahoti
Tri Dao
Albert Gu
Mamba
62
12
0
13 Jul 2024
Aligning Programming Language and Natural Language: Exploring Design
  Choices in Multi-Modal Transformer-Based Embedding for Bug Localization
Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug Localization
Partha Chakraborty
Venkatraman Arumugam
M. Nagappan
31
0
0
25 Jun 2024
Knowledge Distillation vs. Pretraining from Scratch under a Fixed
  (Computation) Budget
Knowledge Distillation vs. Pretraining from Scratch under a Fixed (Computation) Budget
Minh Duc Bui
Fabian David Schmidt
Goran Glavaš
K. Wense
28
0
0
30 Apr 2024
PeLLE: Encoder-based language models for Brazilian Portuguese based on
  open data
PeLLE: Encoder-based language models for Brazilian Portuguese based on open data
Guilherme Lamartine de Mello
Marcelo Finger
F. Serras
M. Carpi
Marcos Menon Jose
Pedro Henrique Domingues
Paulo Cavalim
36
0
0
29 Feb 2024
Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling
Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling
Mahdi Karami
Ali Ghodsi
VLM
48
6
0
28 Feb 2024
The Effect of Batch Size on Contrastive Self-Supervised Speech
  Representation Learning
The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning
Nik Vaessen
David A. van Leeuwen
35
3
0
21 Feb 2024
The Compute Divide in Machine Learning: A Threat to Academic
  Contribution and Scrutiny?
The Compute Divide in Machine Learning: A Threat to Academic Contribution and Scrutiny?
T. Besiroglu
S. Bergerson
Amelia Michael
Lennart Heim
Xueyun Luo
Neil Thompson
30
11
0
04 Jan 2024
MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining
MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining
Jacob P. Portes
Alex Trott
Sam Havens
Daniel King
Abhinav Venigalla
Moin Nadeem
Nikhil Sardana
D. Khudia
Jonathan Frankle
26
16
0
29 Dec 2023
Spike No More: Stabilizing the Pre-training of Large Language Models
Spike No More: Stabilizing the Pre-training of Large Language Models
Sho Takase
Shun Kiyono
Sosuke Kobayashi
Jun Suzuki
20
14
0
28 Dec 2023
Accelerated Convergence of Stochastic Heavy Ball Method under
  Anisotropic Gradient Noise
Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise
Rui Pan
Yuxing Liu
Xiaoyu Wang
Tong Zhang
26
5
0
22 Dec 2023
CLIMB: Curriculum Learning for Infant-inspired Model Building
CLIMB: Curriculum Learning for Infant-inspired Model Building
Richard Diehl Martinez
Zébulon Goriely
Hope McGovern
Christopher Davis
Andrew Caines
P. Buttery
Lisa Beinborn
35
10
0
15 Nov 2023
Explicit Morphological Knowledge Improves Pre-training of Language
  Models for Hebrew
Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew
Eylon Gueta
Omer Goldman
Reut Tsarfaty
11
1
0
01 Nov 2023
A Quadratic Synchronization Rule for Distributed Deep Learning
A Quadratic Synchronization Rule for Distributed Deep Learning
Xinran Gu
Kaifeng Lyu
Sanjeev Arora
Jingzhao Zhang
Longbo Huang
54
1
0
22 Oct 2023
A Simple and Robust Framework for Cross-Modality Medical Image
  Segmentation applied to Vision Transformers
A Simple and Robust Framework for Cross-Modality Medical Image Segmentation applied to Vision Transformers
Matteo Bastico
David Ryckelynck
Laurent Corté
Yannick Tillier
Etienne Decencière
MedIm
ViT
34
2
0
09 Oct 2023
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models
  and Latent Space Geometry Optimization
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization
Che Liu
Sibo Cheng
Chong Chen
Mengyun Qiao
Weitong Zhang
Anand Shah
Wenjia Bai
Rossella Arcucci
VLM
30
56
0
17 Jul 2023
No Train No Gain: Revisiting Efficient Training Algorithms For
  Transformer-based Language Models
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
Jean Kaddour
Oscar Key
Piotr Nawrot
Pasquale Minervini
Matt J. Kusner
22
41
0
12 Jul 2023
Biomedical Language Models are Robust to Sub-optimal Tokenization
Biomedical Language Models are Robust to Sub-optimal Tokenization
Bernal Jiménez Gutiérrez
Huan Sun
Yu-Chuan Su
22
6
0
30 Jun 2023
Surveying (Dis)Parities and Concerns of Compute Hungry NLP Research
Surveying (Dis)Parities and Concerns of Compute Hungry NLP Research
Ji-Ung Lee
Haritz Puerto
Betty van Aken
Yuki Arase
Jessica Zosa Forde
...
Andreas Rucklé
Iryna Gurevych
Roy Schwartz
Emma Strubell
Jesse Dodge
18
6
0
29 Jun 2023
Lost in Translation: Large Language Models in Non-English Content
  Analysis
Lost in Translation: Large Language Models in Non-English Content Analysis
Gabriel Nicholas
Aliya Bhatia
ELM
18
35
0
12 Jun 2023
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with
  Architecture-Routed Mixture-of-Experts
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts
Ganesh Jawahar
Haichuan Yang
Yunyang Xiong
Zechun Liu
Dilin Wang
...
Barlas Oğuz
Muhammad Abdul-Mageed
L. Lakshmanan
Raghuraman Krishnamoorthi
Vikas Chandra
27
4
0
08 Jun 2023
Data-Efficient French Language Modeling with CamemBERTa
Data-Efficient French Language Modeling with CamemBERTa
Wissam Antoun
Benoît Sagot
Djamé Seddah
23
7
0
02 Jun 2023
How to Distill your BERT: An Empirical Study on the Impact of Weight
  Initialisation and Distillation Objectives
How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives
Xinpeng Wang
Leonie Weissweiler
Hinrich Schütze
Barbara Plank
28
8
0
24 May 2023
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model
  Pre-training
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
Hong Liu
Zhiyuan Li
David Leo Wright Hall
Percy Liang
Tengyu Ma
VLM
55
130
0
23 May 2023
Cuttlefish: Low-Rank Model Training without All the Tuning
Cuttlefish: Low-Rank Model Training without All the Tuning
Hongyi Wang
Saurabh Agarwal
Pongsakorn U-chupala
Yoshiki Tanaka
Eric P. Xing
Dimitris Papailiopoulos
OffRL
56
22
0
04 May 2023
Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models
Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models
Junmo Kang
Wei-ping Xu
Alan Ritter
47
15
0
02 May 2023
The MiniPile Challenge for Data-Efficient Language Models
The MiniPile Challenge for Data-Efficient Language Models
Jean Kaddour
MoE
ALM
24
40
0
17 Apr 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature
  Review
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review
Li Shen
Yan Sun
Zhiyuan Yu
Liang Ding
Xinmei Tian
Dacheng Tao
VLM
30
41
0
07 Apr 2023
Do Transformers Parse while Predicting the Masked Word?
Do Transformers Parse while Predicting the Masked Word?
Haoyu Zhao
A. Panigrahi
Rong Ge
Sanjeev Arora
76
31
0
14 Mar 2023
The Framework Tax: Disparities Between Inference Efficiency in NLP
  Research and Deployment
The Framework Tax: Disparities Between Inference Efficiency in NLP Research and Deployment
Jared Fernandez
Jacob Kahn
Clara Na
Yonatan Bisk
Emma Strubell
FedML
33
10
0
13 Feb 2023
Data Selection for Language Models via Importance Resampling
Data Selection for Language Models via Importance Resampling
Sang Michael Xie
Shibani Santurkar
Tengyu Ma
Percy Liang
44
173
0
06 Feb 2023
Which Model Shall I Choose? Cost/Quality Trade-offs for Text
  Classification Tasks
Which Model Shall I Choose? Cost/Quality Trade-offs for Text Classification Tasks
Shi Zong
Joshua Seltzer
Jia-Yu Pan
Pan
Kathy Cheng
Jimmy J. Lin
27
4
0
17 Jan 2023
NarrowBERT: Accelerating Masked Language Model Pretraining and Inference
NarrowBERT: Accelerating Masked Language Model Pretraining and Inference
Haoxin Li
Phillip Keung
Daniel Cheng
Jungo Kasai
Noah A. Smith
25
3
0
11 Jan 2023
Does compressing activations help model parallel training?
Does compressing activations help model parallel training?
S. Bian
Dacheng Li
Hongyi Wang
Eric P. Xing
Shivaram Venkataraman
19
5
0
06 Jan 2023
Cramming: Training a Language Model on a Single GPU in One Day
Cramming: Training a Language Model on a Single GPU in One Day
Jonas Geiping
Tom Goldstein
MoE
30
85
0
28 Dec 2022
Pretraining Without Attention
Pretraining Without Attention
Junxiong Wang
J. Yan
Albert Gu
Alexander M. Rush
27
48
0
20 Dec 2022
ExtremeBERT: A Toolkit for Accelerating Pretraining of Customized BERT
ExtremeBERT: A Toolkit for Accelerating Pretraining of Customized BERT
Rui Pan
Shizhe Diao
Jianlin Chen
Tong Zhang
VLM
16
7
0
30 Nov 2022
Word-Level Representation From Bytes For Language Modeling
Word-Level Representation From Bytes For Language Modeling
Chul Lee
Qipeng Guo
Xipeng Qiu
17
1
0
23 Nov 2022
Training a Vision Transformer from scratch in less than 24 hours with 1
  GPU
Training a Vision Transformer from scratch in less than 24 hours with 1 GPU
Saghar Irandoust
Thibaut Durand
Yunduz Rakhmangulova
Wenjie Zi
Hossein Hajimirsadeghi
ViT
33
6
0
09 Nov 2022
Mask More and Mask Later: Efficient Pre-training of Masked Language
  Models by Disentangling the [MASK] Token
Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token
Baohao Liao
David Thulke
Sanjika Hewavitharana
Hermann Ney
Christof Monz
36
9
0
09 Nov 2022
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
  Language Models
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models
Hong Liu
Sang Michael Xie
Zhiyuan Li
Tengyu Ma
AI4CE
40
49
0
25 Oct 2022
12
Next