ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.00962
  4. Cited By
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
v1v2v3v4v5 (latest)

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

1 April 2019
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
    ODL
ArXiv (abs)PDFHTMLGithub (1698★)

Papers citing "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes"

50 / 611 papers shown
Title
Towards Controllable Agent in MOBA Games with Generative Modeling
Towards Controllable Agent in MOBA Games with Generative Modeling
Shubao Zhang
68
0
0
15 Dec 2021
AI and extreme scale computing to learn and infer the physics of higher
  order gravitational wave modes of quasi-circular, spinning, non-precessing
  binary black hole mergers
AI and extreme scale computing to learn and infer the physics of higher order gravitational wave modes of quasi-circular, spinning, non-precessing binary black hole mergers
Asad Khan
E. A. H. abd
Prayush Kumar
76
5
0
13 Dec 2021
Injecting Semantic Concepts into End-to-End Image Captioning
Injecting Semantic Concepts into End-to-End Image Captioning
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lin Liang
Zhe Gan
Lijuan Wang
Yezhou Yang
Zicheng Liu
ViTVLM
86
91
0
09 Dec 2021
Extending AdamW by Leveraging Its Second Moment and Magnitude
Extending AdamW by Leveraging Its Second Moment and Magnitude
Guoqiang Zhang
Niwa Kenta
W. Kleijn
55
3
0
09 Dec 2021
Improving language models by retrieving from trillions of tokens
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud
A. Mensch
Jordan Hoffmann
Trevor Cai
Eliza Rutherford
...
Simon Osindero
Karen Simonyan
Jack W. Rae
Erich Elsen
Laurent Sifre
KELMRALM
303
1,109
0
08 Dec 2021
Boosting Discriminative Visual Representation Learning with
  Scenario-Agnostic Mixup
Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup
Siyuan Li
Zicheng Liu
Zedong Wang
Di Wu
Zihan Liu
Stan Z. Li
111
27
0
30 Nov 2021
Generating More Pertinent Captions by Leveraging Semantics and Style on
  Multi-Source Datasets
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets
Marcella Cornia
Lorenzo Baraldi
G. Fiameni
Rita Cucchiara
111
12
0
24 Nov 2021
CGX: Adaptive System Support for Communication-Efficient Deep Learning
CGX: Adaptive System Support for Communication-Efficient Deep Learning
I. Markov
Hamidreza Ramezanikebrya
Dan Alistarh
GNN
82
5
0
16 Nov 2021
A Histopathology Study Comparing Contrastive Semi-Supervised and Fully
  Supervised Learning
A Histopathology Study Comparing Contrastive Semi-Supervised and Fully Supervised Learning
Lantian Zhang
M. Amgad
L. Cooper
SSL
52
3
0
10 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLMCLIP
115
644
0
09 Nov 2021
A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks
A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks
Daniel Nichols
Siddharth Singh
Shuqing Lin
A. Bhatele
OOD
64
9
0
09 Nov 2021
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient
  Framework
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework
Xingcheng Yao
Yanan Zheng
Xiaocong Yang
Zhilin Yang
86
46
0
07 Nov 2021
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
Sanjith Athlur
Nitika Saran
Muthian Sivathanu
Ramachandran Ramjee
Nipun Kwatra
GNN
118
84
0
07 Nov 2021
Large-Scale Deep Learning Optimizations: A Comprehensive Survey
Large-Scale Deep Learning Optimizations: A Comprehensive Survey
Xiaoxin He
Fuzhao Xue
Xiaozhe Ren
Yang You
90
15
0
01 Nov 2021
MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning
  on HPC Systems
MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems
S. Farrell
M. Emani
J. Balma
L. Drescher
Aleksandr Drozd
...
Akihiro Tabuchi
V. Vishwanath
Mohamed Wahib
Masafumi Yamazaki
Junqi Yin
VLM
78
37
0
21 Oct 2021
Asynchronous Decentralized Distributed Training of Acoustic Models
Asynchronous Decentralized Distributed Training of Acoustic Models
Xiaodong Cui
Wei Zhang
Abdullah Kayi
Mingrui Liu
Ulrich Finkler
Brian Kingsbury
G. Saon
David S. Kung
63
3
0
21 Oct 2021
Dual Encoding U-Net for Spatio-Temporal Domain Shift Frame Prediction
Dual Encoding U-Net for Spatio-Temporal Domain Shift Frame Prediction
Jay Santokhi
Dylan Hillier
Yiming Yang
Joned Sarwar
A. Jordán
Emil Hewage
AI4CE
56
1
0
21 Oct 2021
AdamD: Improved bias-correction in Adam
AdamD: Improved bias-correction in Adam
J. S. John
ODL
10
0
0
20 Oct 2021
Layer-wise Adaptive Model Aggregation for Scalable Federated Learning
Layer-wise Adaptive Model Aggregation for Scalable Federated Learning
Sunwoo Lee
Tuo Zhang
Chaoyang He
Salman Avestimehr
FedML
87
51
0
19 Oct 2021
bert2BERT: Towards Reusable Pretrained Language Models
bert2BERT: Towards Reusable Pretrained Language Models
Cheng Chen
Yichun Yin
Lifeng Shang
Xin Jiang
Yujia Qin
Fengyu Wang
Zhi Wang
Xiao Chen
Zhiyuan Liu
Qun Liu
VLM
85
64
0
14 Oct 2021
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
  Multi-GPU Servers
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers
Yujing Ma
Florin Rusu
Kesheng Wu
A. Sim
104
3
0
13 Oct 2021
Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese
Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese
Zhuosheng Zhang
Hanqing Zhang
Keming Chen
Yuhang Guo
Jingyun Hua
Yulong Wang
Ming Zhou
VLM
110
72
0
13 Oct 2021
Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave
  Functions
Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions
Nicholas Gao
Stephan Günnemann
86
40
0
11 Oct 2021
Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and
  Few-Shot Learning
Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning
Shaohua Wu
Xudong Zhao
Tong Yu
Rongguo Zhang
C. Shen
...
Feng Li
Hong Zhu
Jiangang Luo
Liang Xu
Xuanwei Zhang
ALM
71
61
0
10 Oct 2021
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
  Parameter Pretraining
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining
Junyang Lin
An Yang
Jinze Bai
Chang Zhou
Le Jiang
...
Jie Zhang
Yong Li
Wei Lin
Jingren Zhou
Hongxia Yang
MoE
163
43
0
08 Oct 2021
Speeding up Deep Model Training by Sharing Weights and Then Unsharing
Speeding up Deep Model Training by Sharing Weights and Then Unsharing
Shuo Yang
Le Hou
Xiaodan Song
Qiang Liu
Denny Zhou
150
9
0
08 Oct 2021
EF21 with Bells & Whistles: Six Algorithmic Extensions of Modern Error Feedback
EF21 with Bells & Whistles: Six Algorithmic Extensions of Modern Error Feedback
Ilyas Fatkhullin
Igor Sokolov
Eduard A. Gorbunov
Zhize Li
Peter Richtárik
132
47
0
07 Oct 2021
ProTo: Program-Guided Transformer for Program-Guided Tasks
ProTo: Program-Guided Transformer for Program-Guided Tasks
Zelin Zhao
Karan Samel
Binghong Chen
Le Song
ViTLM&Ro
100
30
0
02 Oct 2021
Layer-wise and Dimension-wise Locally Adaptive Federated Learning
Layer-wise and Dimension-wise Locally Adaptive Federated Learning
Belhal Karimi
Ping Li
Xiaoyun Li
FedML
122
3
0
01 Oct 2021
ResNet strikes back: An improved training procedure in timm
ResNet strikes back: An improved training procedure in timm
Ross Wightman
Hugo Touvron
Hervé Jégou
AI4TS
308
500
0
01 Oct 2021
Hierarchical Character Tagger for Short Text Spelling Error Correction
Hierarchical Character Tagger for Short Text Spelling Error Correction
Mengyi Gao
Canran Xu
Peng Shi
VLM3DV
82
6
0
29 Sep 2021
Stochastic Training is Not Necessary for Generalization
Stochastic Training is Not Necessary for Generalization
Jonas Geiping
Micah Goldblum
Phillip E. Pope
Michael Moeller
Tom Goldstein
179
76
0
29 Sep 2021
AdaInject: Injection Based Adaptive Gradient Descent Optimizers for
  Convolutional Neural Networks
AdaInject: Injection Based Adaptive Gradient Descent Optimizers for Convolutional Neural Networks
S. Dubey
S. H. Shabbeer Basha
S. Singh
B. B. Chaudhuri
ODL
110
9
0
26 Sep 2021
LOTR: Face Landmark Localization Using Localization Transformer
LOTR: Face Landmark Localization Using Localization Transformer
Ukrit Watchareeruetai
Benjaphan Sommanna
Sanjana Jain
Pavit Noinongyao
Ankush Ganguly
Aubin Samacoits
Samuel W. F. Earp
Nakarin Sritrakool
ViT
96
13
0
21 Sep 2021
MURAL: Multimodal, Multitask Retrieval Across Languages
MURAL: Multimodal, Multitask Retrieval Across Languages
Aashi Jain
Mandy Guo
Krishna Srinivasan
Ting-Li Chen
Sneha Kudugunta
Chao Jia
Yinfei Yang
Jason Baldridge
VLM
171
52
0
10 Sep 2021
Toward Communication Efficient Adaptive Gradient Method
Toward Communication Efficient Adaptive Gradient Method
Xiangyi Chen
Xiaoyun Li
P. Li
FedML
85
42
0
10 Sep 2021
On the validity of pre-trained transformers for natural language
  processing in the software engineering domain
On the validity of pre-trained transformers for natural language processing in the software engineering domain
Julian von der Mosel
Alexander Trautsch
Steffen Herbold
74
68
0
10 Sep 2021
Learning the Physics of Particle Transport via Transformers
Learning the Physics of Particle Transport via Transformers
O. Pastor-Serrano
Zoltán Perkó
MedIm
98
14
0
08 Sep 2021
SHAQ: Single Headed Attention with Quasi-Recurrence
SHAQ: Single Headed Attention with Quasi-Recurrence
Nashwin Bharwani
Warren Kushner
Sangeet Dandona
Ben Schreiber
32
0
0
18 Aug 2021
The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup
  for Training GPT Models
The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models
Conglong Li
Minjia Zhang
Yuxiong He
80
38
0
13 Aug 2021
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense
  Passage Retrieval
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval
Ruiyang Ren
Shangwen Lv
Yingqi Qu
Jing Liu
Wayne Xin Zhao
Qiaoqiao She
Hua Wu
Haifeng Wang
Ji-Rong Wen
212
94
0
13 Aug 2021
Logit Attenuating Weight Normalization
Logit Attenuating Weight Normalization
Aman Gupta
R. Ramanath
Jun Shi
Anika Ramachandran
Sirou Zhou
Mingzhou Zhou
S. Keerthi
81
1
0
12 Aug 2021
AMMUS : A Survey of Transformer-based Pretrained Models in Natural
  Language Processing
AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing
Katikapalli Subramanyam Kalyan
A. Rajasekharan
S. Sangeetha
VLMLM&MA
113
270
0
12 Aug 2021
Online Evolutionary Batch Size Orchestration for Scheduling Deep
  Learning Workloads in GPU Clusters
Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters
Chen Sun
Shenggui Li
Jinyue Wang
Jun Yu
114
48
0
08 Aug 2021
Large-Scale Differentially Private BERT
Large-Scale Differentially Private BERT
Rohan Anil
Badih Ghazi
Vineet Gupta
Ravi Kumar
Pasin Manurangsi
96
139
0
03 Aug 2021
LICHEE: Improving Language Model Pre-training with Multi-grained
  Tokenization
LICHEE: Improving Language Model Pre-training with Multi-grained Tokenization
Weidong Guo
Mingjun Zhao
Lusheng Zhang
Di Niu
Jinwen Luo
Zhenhua Liu
Zhenyang Li
J. Tang
55
8
0
02 Aug 2021
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Andrew Jaegle
Sebastian Borgeaud
Jean-Baptiste Alayrac
Carl Doersch
Catalin Ionescu
...
Olivier J. Hénaff
M. Botvinick
Andrew Zisserman
Oriol Vinyals
João Carreira
MLLMVLMGNN
188
585
0
30 Jul 2021
Pointer Value Retrieval: A new benchmark for understanding the limits of
  neural network generalization
Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization
Chiyuan Zhang
M. Raghu
Jon M. Kleinberg
Samy Bengio
OOD
113
32
0
27 Jul 2021
Go Wider Instead of Deeper
Go Wider Instead of Deeper
Fuzhao Xue
Ziji Shi
Futao Wei
Yuxuan Lou
Yong Liu
Yang You
ViTMoE
100
84
0
25 Jul 2021
Chimera: Efficiently Training Large-Scale Neural Networks with
  Bidirectional Pipelines
Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
Shigang Li
Torsten Hoefler
GNNAI4CELRM
130
138
0
14 Jul 2021
Previous
123...1011121389
Next