Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1904.00962
Cited By
v1
v2
v3
v4
v5 (latest)
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
1 April 2019
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
ODL
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1698★)
Papers citing
"Large Batch Optimization for Deep Learning: Training BERT in 76 minutes"
50 / 611 papers shown
Title
Unit Scaling: Out-of-the-Box Low-Precision Training
Charlie Blake
Douglas Orr
Carlo Luschi
MQ
85
7
0
20 Mar 2023
CerviFormer: A Pap-smear based cervical cancer classification method using cross attention and latent transformer
Bhaswati Singha Deo
M. Pal
P. Panigrahi
A. Pradhan
MedIm
50
25
0
17 Mar 2023
Trained on 100 million words and still in shape: BERT meets British National Corpus
David Samuel
Andrey Kutuzov
Lilja Øvrelid
Erik Velldal
106
32
0
17 Mar 2023
MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain
Keno K. Bressem
Jens-Michalis Papaioannou
Paul Grundmann
Florian Borchert
Lisa Christine Adams
...
Moritz Augustin
Lennart Grosser
Marcus R. Makowski
Hugo J. W. L. Aerts
Alexander Loser
AI4MH
63
34
0
14 Mar 2023
InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning
Ziheng Qin
Kaidi Wang
Zangwei Zheng
Jianyang Gu
Xiang Peng
...
Daquan Zhou
Lei Shang
Baigui Sun
Xuansong Xie
Yang You
190
53
0
08 Mar 2023
Judging Adam: Studying the Performance of Optimization Methods on ML4SE Tasks
D. Pasechnyuk
Anton Prazdnichnykh
Mikhail Evtikhiev
T. Bryksin
71
1
0
06 Mar 2023
What Is Missing in IRM Training and Evaluation? Challenges and Solutions
Yihua Zhang
Pranay Sharma
Parikshit Ram
Min-Fong Hong
Kush R. Varshney
Sijia Liu
84
13
0
04 Mar 2023
Learning to Grow Pretrained Models for Efficient Transformer Training
Peihao Wang
Yikang Shen
Lucas Torroba Hennigen
P. Greengard
Leonid Karlinsky
Rogerio Feris
David D. Cox
Zhangyang Wang
Yoon Kim
75
56
0
02 Mar 2023
CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis
Ji-Hoon Kim
Hongying Yang
Yooncheol Ju
Il-Hwan Kim
Byeong-Yeol Kim
81
9
0
28 Feb 2023
BrainBERT: Self-supervised representation learning for intracranial recordings
Christopher Wang
Vighnesh Subramaniam
A. Yaari
Gabriel Kreiman
Boris Katz
Ignacio Cases
Andrei Barbu
MedIm
SSL
102
41
0
28 Feb 2023
Spatial Bias for Attention-free Non-local Neural Networks
Junhyung Go
Jongbin Ryu
SSL
69
10
0
24 Feb 2023
DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining
Lin Zhang
Shaoshuai Shi
Xiaowen Chu
Wei Wang
Yue Liu
Chengjian Liu
77
11
0
24 Feb 2023
Practical Knowledge Distillation: Using DNNs to Beat DNNs
Chungman Lee
Pavlos Anastasios Apostolopulos
Igor L. Markov
FedML
67
1
0
23 Feb 2023
Equivariant Polynomials for Graph Neural Networks
Omri Puny
Derek Lim
B. Kiani
Haggai Maron
Y. Lipman
96
33
0
22 Feb 2023
mSAM: Micro-Batch-Averaged Sharpness-Aware Minimization
Kayhan Behdin
Qingquan Song
Aman Gupta
S. Keerthi
Ayan Acharya
Borja Ocejo
Gregory Dexter
Rajiv Khanna
D. Durfee
Rahul Mazumder
AAML
71
7
0
19 Feb 2023
Learning Language Representations with Logical Inductive Bias
Jianshu Chen
NAI
AI4CE
LRM
55
3
0
19 Feb 2023
Improving Training Stability for Multitask Ranking Models in Recommender Systems
Jiaxi Tang
Yoel Drori
Daryl Chang
M. Sathiamoorthy
Justin Gilmer
Li Wei
Xinyang Yi
Lichan Hong
Ed H. Chi
100
10
0
17 Feb 2023
G-Signatures: Global Graph Propagation With Randomized Signatures
Bernhard Schafl
Lukas Gruber
Johannes Brandstetter
Sepp Hochreiter
164
2
0
17 Feb 2023
SWIFT: Expedited Failure Recovery for Large-scale DNN Training
Keon Jang
Hassan M. G. Wassel
Behnam Montazeri
Michael Ryan
David Wetherall
61
8
0
13 Feb 2023
Constrained Empirical Risk Minimization: Theory and Practice
Eric Marcus
Ray Sheombarsing
Jan-Jakob Sonke
Jonas Teuwen
82
1
0
09 Feb 2023
DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule
Maor Ivgi
Oliver Hinder
Y. Carmon
ODL
159
66
0
08 Feb 2023
Generalizing Neural Wave Functions
Nicholas Gao
Stephan Günnemann
69
24
0
08 Feb 2023
Clinical BioBERT Hyperparameter Optimization using Genetic Algorithm
N. Kollapally
J. Geller
36
2
0
08 Feb 2023
Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion
Ashok Cutkosky
Harsh Mehta
Francesco Orabona
109
34
0
07 Feb 2023
A Survey on Efficient Training of Transformers
Bohan Zhuang
Jing Liu
Zizheng Pan
Haoyu He
Yuetian Weng
Chunhua Shen
132
50
0
02 Feb 2023
A Survey of Deep Learning: From Activations to Transformers
Johannes Schneider
Michalis Vlachos
ViT
MedIm
AI4TS
AI4CE
112
10
0
01 Feb 2023
STAIR: Learning Sparse Text and Image Representation in Grounded Tokens
Chen Chen
Bowen Zhang
Liangliang Cao
Jiguang Shen
Tom Gunter
Albin Madappally Jose
Alexander Toshev
Jonathon Shlens
Ruoming Pang
Yinfei Yang
VLM
3DV
63
16
0
30 Jan 2023
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
Max Ryabinin
Tim Dettmers
Michael Diskin
Alexander Borzunov
MoE
111
38
0
27 Jan 2023
On the Importance of Noise Scheduling for Diffusion Models
Ting Chen
DiffM
127
161
0
26 Jan 2023
Parameter-Efficient Low-Resource Dialogue State Tracking by Prompt Tuning
Mingyu Derek Ma
Jiun-Yu Kao
Shuyang Gao
Arpit Gupta
Di Jin
Tagyoung Chung
Nanyun Peng
77
7
0
26 Jan 2023
Embodied Agents for Efficient Exploration and Smart Scene Description
Roberto Bigazzi
Marcella Cornia
S. Cascianelli
Lorenzo Baraldi
Rita Cucchiara
LM&Ro
71
7
0
17 Jan 2023
MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module
Ondvrej Plátek
Ondrej Dusek
63
2
0
17 Jan 2023
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling
Keyu Tian
Yi Jiang
Qishuai Diao
Chen Lin
Liwei Wang
Zehuan Yuan
99
106
0
09 Jan 2023
Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation
Xiao-Tong Yuan
P. Li
93
2
0
09 Jan 2023
Cramming: Training a Language Model on a Single GPU in One Day
Jonas Geiping
Tom Goldstein
MoE
122
91
0
28 Dec 2022
Scalable Adaptive Computation for Iterative Generation
Allan Jabri
David Fleet
Ting-Li Chen
DiffM
91
117
0
22 Dec 2022
Bort: Towards Explainable Neural Networks with Bounded Orthogonal Constraint
Borui Zhang
Wenzhao Zheng
Jie Zhou
Jiwen Lu
AAML
90
7
0
18 Dec 2022
From Xception to NEXcepTion: New Design Decisions and Neural Architecture Search
Hadar Shavit
Filip Jatelnicki
Pol Mor-Puigventós
W. Kowalczyk
49
2
0
16 Dec 2022
NLIP: Noise-robust Language-Image Pre-training
Runhu Huang
Yanxin Long
Jianhua Han
Hang Xu
Xiwen Liang
Chunjing Xu
Xiaodan Liang
VLM
111
30
0
14 Dec 2022
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
Chaoyang He
Shuai Zheng
Aston Zhang
George Karypis
Trishul Chilimbi
Mahdi Soltanolkotabi
Salman Avestimehr
MoE
52
1
0
10 Dec 2022
Deep Incubation: Training Large Models by Divide-and-Conquering
Zanlin Ni
Yulin Wang
Jiangwei Yu
Haojun Jiang
Yu Cao
Gao Huang
VLM
101
11
0
08 Dec 2022
Improved Deep Neural Network Generalization Using m-Sharpness-Aware Minimization
Kayhan Behdin
Qingquan Song
Aman Gupta
D. Durfee
Ayan Acharya
S. Keerthi
Rahul Mazumder
AAML
55
5
0
07 Dec 2022
PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices
Kazuki Osawa
Shigang Li
Torsten Hoefler
AI4CE
89
26
0
25 Nov 2022
A Self-Attention Ansatz for Ab-initio Quantum Chemistry
Ingrid von Glehn
J. Spencer
David Pfau
71
68
0
24 Nov 2022
Differentially Private Image Classification from Features
Harsh Mehta
Walid Krichene
Abhradeep Thakurta
Alexey Kurakin
Ashok Cutkosky
118
8
0
24 Nov 2022
Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size
Alexander Nikulin
Vladislav Kurenkov
Denis Tarasov
Dmitry Akimov
Sergey Kolesnikov
OffRL
93
15
0
20 Nov 2022
SeDR: Segment Representation Learning for Long Documents Dense Retrieval
Junying Chen
Qingcai Chen
Dongfang Li
Yutao Huang
67
6
0
20 Nov 2022
VeLO: Training Versatile Learned Optimizers by Scaling Up
Luke Metz
James Harrison
C. Freeman
Amil Merchant
Lucas Beyer
...
Naman Agrawal
Ben Poole
Igor Mordatch
Adam Roberts
Jascha Narain Sohl-Dickstein
143
60
0
17 Nov 2022
How to Fine-Tune Vision Models with SGD
Ananya Kumar
Ruoqi Shen
Sébastien Bubeck
Suriya Gunasekar
VLM
136
31
0
17 Nov 2022
Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token
Baohao Liao
David Thulke
Sanjika Hewavitharana
Hermann Ney
Christof Monz
75
9
0
09 Nov 2022
Previous
1
2
3
...
5
6
7
...
11
12
13
Next