Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1904.00962
Cited By
v1
v2
v3
v4
v5 (latest)
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
1 April 2019
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
ODL
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1698★)
Papers citing
"Large Batch Optimization for Deep Learning: Training BERT in 76 minutes"
50 / 611 papers shown
Title
Pretraining the Noisy Channel Model for Task-Oriented Dialogue
Qi Liu
Lei Yu
Laura Rimell
Phil Blunsom
113
26
0
18 Mar 2021
Large Batch Simulation for Deep Reinforcement Learning
Brennan Shacklett
Erik Wijmans
Aleksei Petrenko
Manolis Savva
Dhruv Batra
V. Koltun
Kayvon Fatahalian
3DV
OffRL
AI4CE
93
26
0
12 Mar 2021
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
J. Clark
Dan Garrette
Iulia Turc
John Wieting
129
224
0
11 Mar 2021
Better SGD using Second-order Momentum
Hoang Tran
Ashok Cutkosky
ODL
54
12
0
04 Mar 2021
Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices
Max Ryabinin
Eduard A. Gorbunov
Vsevolod Plokhotnyuk
Gennady Pekhimenko
135
36
0
04 Mar 2021
Perceiver: General Perception with Iterative Attention
Andrew Jaegle
Felix Gimeno
Andrew Brock
Andrew Zisserman
Oriol Vinyals
João Carreira
VLM
ViT
MDE
218
1,029
0
04 Mar 2021
Lost in Pruning: The Effects of Pruning Neural Networks beyond Test Accuracy
Lucas Liebenwein
Cenk Baykal
Brandon Carter
David K Gifford
Daniela Rus
AAML
84
74
0
04 Mar 2021
Acceleration via Fractal Learning Rate Schedules
Naman Agarwal
Surbhi Goel
Cyril Zhang
80
18
0
01 Mar 2021
On the Utility of Gradient Compression in Distributed Training Systems
Saurabh Agarwal
Hongyi Wang
Shivaram Venkataraman
Dimitris Papailiopoulos
111
47
0
28 Feb 2021
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
Jeremy M. Cohen
Simran Kaur
Yuanzhi Li
J. Zico Kolter
Ameet Talwalkar
ODL
142
279
0
26 Feb 2021
MARINA: Faster Non-Convex Distributed Learning with Compression
Eduard A. Gorbunov
Konstantin Burlachenko
Zhize Li
Peter Richtárik
113
110
0
15 Feb 2021
Learning by Turning: Neural Architecture Aware Optimisation
Yang Liu
Jeremy Bernstein
M. Meister
Yisong Yue
ODL
129
26
0
14 Feb 2021
Optimizing Inference Performance of Transformers on CPUs
D. Dice
Alex Kogan
64
16
0
12 Feb 2021
A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes
Zachary Nado
Justin M. Gilmer
Christopher J. Shallue
Rohan Anil
George E. Dahl
ODL
106
27
0
12 Feb 2021
High-Performance Large-Scale Image Recognition Without Normalization
Andrew Brock
Soham De
Samuel L. Smith
Karen Simonyan
VLM
337
525
0
11 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
571
3,917
0
11 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
150
117
0
31 Jan 2021
Zero-shot Generalization in Dialog State Tracking through Generative Question Answering
Shuyang Li
Jin Cao
Mukund Sridhar
Henghui Zhu
Shang-Wen Li
Wael Hamza
Julian McAuley
BDL
74
46
0
20 Jan 2021
Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration
Congliang Chen
Li Shen
Fangyu Zou
Wei Liu
81
29
0
14 Jan 2021
EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets
Xiaohan Chen
Yu Cheng
Shuohang Wang
Zhe Gan
Zhangyang Wang
Jingjing Liu
143
100
0
31 Dec 2020
AraGPT2: Pre-Trained Transformer for Arabic Language Generation
Wissam Antoun
Fady Baly
Hazem M. Hajj
VLM
71
106
0
31 Dec 2020
Universal Sentence Representation Learning with Conditional Masked Language Model
Ziyi Yang
Yinfei Yang
Daniel Cer
Jax Law
Eric F. Darve
SSL
95
58
0
28 Dec 2020
Sub-Linear Memory: How to Make Performers SLiM
Valerii Likhosherstov
K. Choromanski
Jared Davis
Xingyou Song
Adrian Weller
82
19
0
21 Dec 2020
Attention over learned object embeddings enables complex visual reasoning
David Ding
Felix Hill
Adam Santoro
Malcolm Reynolds
M. Botvinick
OCL
114
71
0
15 Dec 2020
GottBERT: a pure German Language Model
Raphael Scheible
Fabian Thomczyk
P. Tippmann
V. Jaravine
M. Boeker
VLM
61
81
0
03 Dec 2020
CPM: A Large-scale Generative Chinese Pre-trained Language Model
Zhengyan Zhang
Xu Han
Hao Zhou
Pei Ke
Yuxian Gu
...
Wentao Han
Jie Tang
Juan-Zi Li
Xiaoyan Zhu
Maosong Sun
74
119
0
01 Dec 2020
Self supervised contrastive learning for digital histopathology
Ozan Ciga
Tony Xu
Anne L. Martel
SSL
180
319
0
27 Nov 2020
Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup
Cheng Yang
Shengnan Wang
Chao Yang
Yuechuan Li
Ru He
Jingqiao Zhang
85
25
0
27 Nov 2020
Improving Layer-wise Adaptive Rate Methods using Trust Ratio Clipping
Jeffrey Fong
Siwei Chen
Kaiqi Chen
33
2
0
27 Nov 2020
Adam
+
^+
+
: A Stochastic Method with Adaptive Variance Reduction
Mingrui Liu
Wei Zhang
Francesco Orabona
Tianbao Yang
64
28
0
24 Nov 2020
Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions
Jianan Wang
Boyang Albert Li
Xiangyu Fan
Jing-Hua Lin
Yanwei Fu
54
2
0
15 Nov 2020
TLab: Traffic Map Movie Forecasting Based on HR-NET
Fanyou Wu
Yang Liu
Zhiyuan Liu
X. Qu
R. Gazo
E. Haviarova
37
5
0
13 Nov 2020
Morphological Disambiguation from Stemming Data
Antoine Nzeyimana
28
6
0
11 Nov 2020
Explainable COVID-19 Detection Using Chest CT Scans and Deep Learning
H. Alshazly
C. Linse
Erhardt Barth
T. Martinetz
85
162
0
09 Nov 2020
Exploring the limits of Concurrency in ML Training on Google TPUs
Sameer Kumar
James Bradbury
C. Young
Yu Emma Wang
Anselm Levskaya
...
Tao Wang
Tayo Oguntebi
Yazhou Zu
Yuanzhong Xu
Andy Swing
BDL
AIMat
MoE
LRM
64
27
0
07 Nov 2020
Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour
Arissa Wongpanich
Hieu H. Pham
J. Demmel
Mingxing Tan
Quoc V. Le
Yang You
Sameer Kumar
78
8
0
30 Oct 2020
Improving Multilingual Models with Language-Clustered Vocabularies
Hyung Won Chung
Dan Garrette
Kiat Chuan Tan
Jason Riesa
VLM
131
65
0
24 Oct 2020
DICT-MLM: Improved Multilingual Pre-Training using Bilingual Dictionaries
Aditi Chaudhary
K. Raman
Krishna Srinivasan
Jiecao Chen
88
25
0
23 Oct 2020
Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning
Sungkyun Chang
Donmoon Lee
Jeongsoon Park
Hyungui Lim
Kyogu Lee
Karam Ko
Yoonchang Han
103
35
0
22 Oct 2020
Towards Fully Bilingual Deep Language Modeling
Li-Hsin Chang
S. Pyysalo
Jenna Kanerva
Filip Ginter
67
3
0
22 Oct 2020
Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters
Shaoshuai Shi
Xianhao Zhou
Shutao Song
Xingyao Wang
Zilin Zhu
...
Chenyang Guo
Bo Yang
Zhibo Chen
Yongjian Wu
Xiaowen Chu
GNN
81
56
0
20 Oct 2020
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
Hicham El Boukkouri
Olivier Ferret
Thomas Lavergne
Hiroshi Noji
Pierre Zweigenbaum
Junichi Tsujii
167
162
0
20 Oct 2020
How much progress have we made in neural network training? A New Evaluation Protocol for Benchmarking Optimizers
Yuanhao Xiong
Xuanqing Liu
Li-Cheng Lan
Yang You
Si Si
Cho-Jui Hsieh
OOD
104
1
0
19 Oct 2020
Permutationless Many-Jet Event Reconstruction with Symmetry Preserving Attention Networks
M. Fenton
Alexander Shmakov
Ta-Wei Ho
S. Hsu
D. Whiteson
Pierre Baldi
93
39
0
19 Oct 2020
RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering
Yingqi Qu
Yuchen Ding
Jing Liu
Kai Liu
Ruiyang Ren
Xin Zhao
Daxiang Dong
Hua Wu
Haifeng Wang
RALM
OffRL
283
618
0
16 Oct 2020
Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task
Thibault Sellam
Amy Pu
Hyung Won Chung
Sebastian Gehrmann
Qijun Tan
Markus Freitag
Dipanjan Das
Ankur P. Parikh
VLM
78
37
0
08 Oct 2020
Which *BERT? A Survey Organizing Contextualized Encoders
Patrick Xia
Shijie Wu
Benjamin Van Durme
62
50
0
02 Oct 2020
Normalization Techniques in Training DNNs: Methodology, Analysis and Application
Lei Huang
Jie Qin
Yi Zhou
Fan Zhu
Li Liu
Ling Shao
AI4CE
176
278
0
27 Sep 2020
HetSeq: Distributed GPU Training on Heterogeneous Infrastructure
Yifan Ding
Nicholas Botzer
Tim Weninger
VLM
MoE
37
7
0
25 Sep 2020
VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware
Andrew Or
Haoyu Zhang
M. Freedman
73
10
0
20 Sep 2020
Previous
1
2
3
...
10
11
12
13
Next