ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.00962
  4. Cited By
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
v1v2v3v4v5 (latest)

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

1 April 2019
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
    ODL
ArXiv (abs)PDFHTMLGithub (1698★)

Papers citing "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes"

50 / 611 papers shown
Title
Accelerating Distributed K-FAC with Smart Parallelism of Computing and
  Communication Tasks
Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks
Shaoshuai Shi
Lin Zhang
Yue Liu
123
9
0
14 Jul 2021
Automated Learning Rate Scheduler for Large-batch Training
Automated Learning Rate Scheduler for Large-batch Training
Chiheon Kim
Saehoon Kim
Jongmin Kim
Donghoon Lee
Sungwoong Kim
57
20
0
13 Jul 2021
KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural
  Networks
KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks
J. G. Pauloski
Qi Huang
Lei Huang
Shivaram Venkataraman
Kyle Chard
Ian Foster
Zhao-jie Zhang
86
29
0
04 Jul 2021
ResIST: Layer-Wise Decomposition of ResNets for Distributed Training
ResIST: Layer-Wise Decomposition of ResNets for Distributed Training
Chen Dun
Cameron R. Wolfe
C. Jermaine
Anastasios Kyrillidis
95
21
0
02 Jul 2021
What can linear interpolation of neural network loss landscapes tell us?
What can linear interpolation of neural network loss landscapes tell us?
Tiffany J. Vlaar
Jonathan Frankle
MoMe
78
28
0
30 Jun 2021
High-probability Bounds for Non-Convex Stochastic Optimization with
  Heavy Tails
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails
Ashok Cutkosky
Harsh Mehta
83
62
0
28 Jun 2021
AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural
  Networks
AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks
Alexandra Peste
Eugenia Iofinova
Adrian Vladu
Dan Alistarh
AI4CE
427
72
0
23 Jun 2021
Secure Distributed Training at Scale
Secure Distributed Training at Scale
Eduard A. Gorbunov
Alexander Borzunov
Michael Diskin
Max Ryabinin
FedML
90
15
0
21 Jun 2021
Multirate Training of Neural Networks
Multirate Training of Neural Networks
Tiffany J. Vlaar
Benedict Leimkuhler
55
4
0
20 Jun 2021
Distributed Deep Learning in Open Collaborations
Distributed Deep Learning in Open Collaborations
Michael Diskin
Alexey Bukhtiyarov
Max Ryabinin
Lucile Saulnier
Quentin Lhoest
...
Denis Mazur
Ilia Kobelev
Yacine Jernite
Thomas Wolf
Gennady Pekhimenko
FedML
134
59
0
18 Jun 2021
Large-Scale Chemical Language Representations Capture Molecular
  Structure and Properties
Large-Scale Chemical Language Representations Capture Molecular Structure and Properties
Jerret Ross
Brian M. Belgodere
Vijil Chenthamarakshan
Inkit Padhi
Youssef Mroueh
Payel Das
AI4CE
91
305
0
17 Jun 2021
On Large-Cohort Training for Federated Learning
On Large-Cohort Training for Federated Learning
Zachary B. Charles
Zachary Garrett
Zhouyuan Huo
Sergei Shmulyian
Virginia Smith
FedML
79
114
0
15 Jun 2021
Pre-Trained Models: Past, Present and Future
Pre-Trained Models: Past, Present and Future
Xu Han
Zhengyan Zhang
Ning Ding
Yuxian Gu
Xiao Liu
...
Jie Tang
Ji-Rong Wen
Jinhui Yuan
Wayne Xin Zhao
Jun Zhu
AIFinMQAI4MH
179
865
0
14 Jun 2021
On the Convergence of Differentially Private Federated Learning on
  Non-Lipschitz Objectives, and with Normalized Client Updates
On the Convergence of Differentially Private Federated Learning on Non-Lipschitz Objectives, and with Normalized Client Updates
Rudrajit Das
Abolfazl Hashemi
Sujay Sanghavi
Inderjit S. Dhillon
FedML
85
4
0
13 Jun 2021
A Pseudo Label-wise Attention Network for Automatic ICD Coding
A Pseudo Label-wise Attention Network for Automatic ICD Coding
Yifan Wu
Min Zeng
Ying Yu
Min Li
63
12
0
12 Jun 2021
Federated Learning with Buffered Asynchronous Aggregation
Federated Learning with Buffered Asynchronous Aggregation
John Nguyen
Kshitiz Malik
Hongyuan Zhan
Ashkan Yousefpour
Michael G. Rabbat
Mani Malek
Dzmitry Huba
FedML
101
316
0
11 Jun 2021
A Coupled Design of Exploiting Record Similarity for Practical Vertical
  Federated Learning
A Coupled Design of Exploiting Record Similarity for Practical Vertical Federated Learning
Zhaomin Wu
Qinbin Li
Bingsheng He
FedML
73
20
0
11 Jun 2021
Generative Models as a Data Source for Multiview Representation Learning
Generative Models as a Data Source for Multiview Representation Learning
Ali Jahanian
Xavier Puig
Yonglong Tian
Phillip Isola
101
129
0
09 Jun 2021
Learning Multilingual Representation for Natural Language Understanding
  with Enhanced Cross-Lingual Supervision
Learning Multilingual Representation for Natural Language Understanding with Enhanced Cross-Lingual Supervision
Yinpeng Guo
Liangyou Li
Xin Jiang
Qun Liu
55
0
0
09 Jun 2021
Self-Attention Between Datapoints: Going Beyond Individual Input-Output
  Pairs in Deep Learning
Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
Jannik Kossen
Neil Band
Clare Lyle
Aidan Gomez
Tom Rainforth
Y. Gal
OOD3DPC
133
142
0
04 Jun 2021
When Vision Transformers Outperform ResNets without Pre-training or
  Strong Data Augmentations
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
Xiangning Chen
Cho-Jui Hsieh
Boqing Gong
ViT
117
330
0
03 Jun 2021
A Generalizable Approach to Learning Optimizers
A Generalizable Approach to Learning Optimizers
Diogo Almeida
Clemens Winter
Jie Tang
Wojciech Zaremba
AI4CE
97
29
0
02 Jun 2021
LRTuner: A Learning Rate Tuner for Deep Neural Networks
LRTuner: A Learning Rate Tuner for Deep Neural Networks
Nikhil Iyer
V. Thejas
Nipun Kwatra
Ramachandran Ramjee
Muthian Sivathanu
ODL
55
1
0
30 May 2021
Tesseract: Parallelize the Tensor Parallelism Efficiently
Tesseract: Parallelize the Tensor Parallelism Efficiently
Boxiang Wang
Qifan Xu
Zhengda Bian
Yang You
VLMGNN
44
35
0
30 May 2021
Maximizing Parallelism in Distributed Training for Huge Neural Networks
Maximizing Parallelism in Distributed Training for Huge Neural Networks
Zhengda Bian
Qifan Xu
Boxiang Wang
Yang You
MoE
63
48
0
30 May 2021
Knowledge Inheritance for Pre-trained Language Models
Knowledge Inheritance for Pre-trained Language Models
Yujia Qin
Yankai Lin
Jing Yi
Jiajie Zhang
Xu Han
...
Yusheng Su
Zhiyuan Liu
Peng Li
Maosong Sun
Jie Zhou
VLM
85
50
0
28 May 2021
Hierarchical Transformer Encoders for Vietnamese Spelling Correction
Hierarchical Transformer Encoders for Vietnamese Spelling Correction
H. Tran
C. Dinh
Long Phan
S. T. Nguyen
62
12
0
28 May 2021
Accelerating Gossip SGD with Periodic Global Averaging
Accelerating Gossip SGD with Periodic Global Averaging
Yiming Chen
Kun Yuan
Yingya Zhang
Pan Pan
Yinghui Xu
W. Yin
79
44
0
19 May 2021
SHARE: a System for Hierarchical Assistive Recipe Editing
SHARE: a System for Hierarchical Assistive Recipe Editing
Shuyang Li
Yufei Li
Jianmo Ni
Julian McAuley
52
20
0
17 May 2021
Compressed Communication for Distributed Training: Adaptive Methods and
  System
Compressed Communication for Distributed Training: Adaptive Methods and System
Yuchen Zhong
Cong Xie
Shuai Zheng
Yanghua Peng
74
9
0
17 May 2021
Drill the Cork of Information Bottleneck by Inputting the Most Important
  Data
Drill the Cork of Information Bottleneck by Inputting the Most Important Data
Xinyu Peng
Jiawei Zhang
Feiyue Wang
Li Li
46
6
0
15 May 2021
Breaking the Computation and Communication Abstraction Barrier in
  Distributed Machine Learning Workloads
Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads
Abhinav Jangda
Jun Huang
Guodong Liu
Amir Hossein Nodehi Sabet
Saeed Maleki
Youshan Miao
Madan Musuvathi
Todd Mytkowicz
Olli Saarikivi
77
64
0
12 May 2021
ReadTwice: Reading Very Large Documents with Memories
ReadTwice: Reading Very Large Documents with Memories
Yury Zemlyanskiy
Joshua Ainslie
Michiel de Jong
Philip Pham
Ilya Eckstein
Fei Sha
AIMatRALM
88
18
0
10 May 2021
Graph Inference Representation: Learning Graph Positional Embeddings
  with Anchor Path Encoding
Graph Inference Representation: Learning Graph Positional Embeddings with Anchor Path Encoding
Yuheng Lu
Jinpeng Chen
Chuxiong Sun
Jie Hu
GNN
30
2
0
09 May 2021
ResMLP: Feedforward networks for image classification with
  data-efficient training
ResMLP: Feedforward networks for image classification with data-efficient training
Hugo Touvron
Piotr Bojanowski
Mathilde Caron
Matthieu Cord
Alaaeldin El-Nouby
...
Gautier Izacard
Armand Joulin
Gabriel Synnaeve
Jakob Verbeek
Hervé Jégou
VLM
140
675
0
07 May 2021
Initialization and Regularization of Factorized Neural Layers
Initialization and Regularization of Factorized Neural Layers
M. Khodak
Neil A. Tenenholtz
Lester W. Mackey
Nicolò Fusi
159
57
0
03 May 2021
Forming Ensembles at Runtime: A Machine Learning Approach
Forming Ensembles at Runtime: A Machine Learning Approach
T. Bures
I. Gerostathopoulos
P. Hnetynka
J. Pacovský
19
6
0
30 Apr 2021
DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training
DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training
Kun Yuan
Yiming Chen
Xinmeng Huang
Yingya Zhang
Pan Pan
Yinghui Xu
W. Yin
MoE
115
64
0
24 Apr 2021
ScaleCom: Scalable Sparsified Gradient Compression for
  Communication-Efficient Distributed Training
ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training
Chia-Yu Chen
Jiamin Ni
Songtao Lu
Xiaodong Cui
Pin-Yu Chen
...
Naigang Wang
Swagath Venkataramani
Vijayalakshmi Srinivasan
Wei Zhang
K. Gopalakrishnan
79
67
0
21 Apr 2021
Operationalizing a National Digital Library: The Case for a Norwegian
  Transformer Model
Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model
P. Kummervold
Javier de la Rosa
Freddy Wetjen
Svein Arne Brygfjeld
107
56
0
19 Apr 2021
How to Train BERT with an Academic Budget
How to Train BERT with an Academic Budget
Peter Izsak
Moshe Berchansky
Omer Levy
148
119
0
15 Apr 2021
Span Pointer Networks for Non-Autoregressive Task-Oriented Semantic
  Parsing
Span Pointer Networks for Non-Autoregressive Task-Oriented Semantic Parsing
Akshat Shrivastava
P. Chuang
Arun Babu
Shrey Desai
Abhinav Arora
Alexander Zotov
Ahmed Aly
78
21
0
15 Apr 2021
Demystifying BERT: Implications for Accelerator Design
Demystifying BERT: Implications for Accelerator Design
Suchita Pati
Shaizeen Aga
Nuwan Jayasena
Matthew D. Sinclair
LLMAG
91
17
0
14 Apr 2021
Large-Scale Contextualised Language Modelling for Norwegian
Large-Scale Contextualised Language Modelling for Norwegian
Andrey Kutuzov
Jeremy Barnes
Erik Velldal
Lilja Ovrelid
Stephan Oepen
84
38
0
13 Apr 2021
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training
  with LAMB's Convergence Speed
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed
Conglong Li
A. A. Awan
Hanlin Tang
Samyam Rajbhandari
Yuxiong He
120
33
0
13 Apr 2021
Software-Hardware Co-design for Fast and Scalable Training of Deep
  Learning Recommendation Models
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
Dheevatsa Mudigere
Y. Hao
Jianyu Huang
Zhihao Jia
Andrew Tulloch
...
Ajit Mathews
Lin Qiao
M. Smelyanskiy
Bill Jia
Vijay Rao
115
155
0
12 Apr 2021
An Empirical Study of Training Self-Supervised Vision Transformers
An Empirical Study of Training Self-Supervised Vision Transformers
Xinlei Chen
Saining Xie
Kaiming He
ViT
185
1,875
0
05 Apr 2021
Physics-informed neural networks for the shallow-water equations on the
  sphere
Physics-informed neural networks for the shallow-water equations on the sphere
Alexander Bihlo
R. Popovych
81
79
0
01 Apr 2021
Patch Craft: Video Denoising by Deep Modeling and Patch Matching
Patch Craft: Video Denoising by Deep Modeling and Patch Matching
Gregory Vaksman
Michael Elad
P. Milanfar
50
66
0
25 Mar 2021
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers
AutoMix: Unveiling the Power of Mixup for Stronger Classifiers
Zicheng Liu
Siyuan Li
Di Wu
Jianzhu Guo
Zhiyuan Chen
Lirong Wu
Stan Z. Li
121
78
0
24 Mar 2021
Previous
123...101112139
Next