ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.00962
  4. Cited By
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
v1v2v3v4v5 (latest)

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

1 April 2019
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
    ODL
ArXiv (abs)PDFHTMLGithub (1698★)

Papers citing "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes"

50 / 611 papers shown
Title
A Study on Transformer Configuration and Training Objective
A Study on Transformer Configuration and Training Objective
Fuzhao Xue
Jianghai Chen
Aixin Sun
Xiaozhe Ren
Zangwei Zheng
Xiaoxin He
Yongming Chen
Xin Jiang
Yang You
92
9
0
21 May 2022
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
Sadhika Malladi
Kaifeng Lyu
A. Panigrahi
Sanjeev Arora
172
47
0
20 May 2022
Large Neural Networks Learning from Scratch with Very Few Data and
  without Explicit Regularization
Large Neural Networks Learning from Scratch with Very Few Data and without Explicit Regularization
C. Linse
T. Martinetz
SSLVLM
50
4
0
18 May 2022
On Distributed Adaptive Optimization with Gradient Compression
On Distributed Adaptive Optimization with Gradient Compression
Xiaoyun Li
Belhal Karimi
Ping Li
80
27
0
11 May 2022
A Communication-Efficient Distributed Gradient Clipping Algorithm for
  Training Deep Neural Networks
A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks
Mingrui Liu
Zhenxun Zhuang
Yunwei Lei
Chunyang Liao
79
20
0
10 May 2022
Large Scale Transfer Learning for Differentially Private Image
  Classification
Large Scale Transfer Learning for Differentially Private Image Classification
Harsh Mehta
Abhradeep Thakurta
Alexey Kurakin
Ashok Cutkosky
95
41
0
06 May 2022
Jam or Cream First? Modeling Ambiguity in Neural Machine Translation
  with SCONES
Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES
Felix Stahlberg
Shankar Kumar
UQLM
126
12
0
02 May 2022
MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
Zhen Zhang
Shuai Zheng
Yida Wang
Justin Chiu
George Karypis
Trishul Chilimbi
Mu Li
Xin Jin
92
39
0
30 Apr 2022
Instilling Type Knowledge in Language Models via Multi-Task QA
Instilling Type Knowledge in Language Models via Multi-Task QA
Shuyang Li
Mukund Sridhar
Chandan Prakash
Jin Cao
Wael Hamza
Julian McAuley
KELM
79
7
0
28 Apr 2022
Transformers in Time-series Analysis: A Tutorial
Transformers in Time-series Analysis: A Tutorial
Sabeen Ahmed
Ian E. Nielsen
Aakash Tripathi
Shamoon Siddiqui
Ghulam Rasool
R. Ramachandran
AI4TS
88
167
0
28 Apr 2022
ALBETO and DistilBETO: Lightweight Spanish Language Models
ALBETO and DistilBETO: Lightweight Spanish Language Models
J. Canete
S. Donoso
Felipe Bravo-Marquez
Andrés Carvallo
Vladimir Araujo
78
21
0
19 Apr 2022
DeiT III: Revenge of the ViT
DeiT III: Revenge of the ViT
Hugo Touvron
Matthieu Cord
Hervé Jégou
ViT
134
418
0
14 Apr 2022
CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10
  minutes on 1 GPU
CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPU
Zangwei Zheng
Peng Xu
Xuan Zou
Da Tang
Zhen Li
...
Xiangzhuo Ding
Fuzhao Xue
Ziheng Qing
Youlong Cheng
Yang You
VLM
86
7
0
13 Apr 2022
PICASSO: Unleashing the Potential of GPU-centric Training for
  Wide-and-deep Recommender Systems
PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems
Yuanxing Zhang
Langshi Chen
Siran Yang
Man Yuan
Hui-juan Yi
...
Yong Li
Dingyang Zhang
Wei Lin
Lin Qu
Bo Zheng
88
32
0
11 Apr 2022
Solving ImageNet: a Unified Scheme for Training any Backbone to Top
  Results
Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results
T. Ridnik
Hussam Lawen
Emanuel Ben-Baruch
Asaf Noy
114
11
0
07 Apr 2022
AdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio
AdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio
Jun Lu
ODL
36
4
0
02 Apr 2022
Supervised Robustness-preserving Data-free Neural Network Pruning
Supervised Robustness-preserving Data-free Neural Network Pruning
Mark Huasong Meng
Guangdong Bai
Sin Gee Teo
Jin Song Dong
AAML
98
4
0
02 Apr 2022
Uncertainty Determines the Adequacy of the Mode and the Tractability of
  Decoding in Sequence-to-Sequence Models
Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models
Felix Stahlberg
Ilia Kulikov
Shankar Kumar
UQLM
145
10
0
01 Apr 2022
Exploiting Explainable Metrics for Augmented SGD
Exploiting Explainable Metrics for Augmented SGD
Mahdi S. Hosseini
Mathieu Tuli
Konstantinos N. Plataniotis
AAML
68
3
0
31 Mar 2022
FLUTE: A Scalable, Extensible Framework for High-Performance Federated
  Learning Simulations
FLUTE: A Scalable, Extensible Framework for High-Performance Federated Learning Simulations
Mirian Hipolito Garcia
Andre Manoel
Daniel Madrigal Diaz
Fatemehsadat Mireshghallah
Robert Sim
Dimitrios Dimitriadis
FedML
92
57
0
25 Mar 2022
Reshaping Robot Trajectories Using Natural Language Commands: A Study of
  Multi-Modal Data Alignment Using Transformers
Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers
A. Bucker
Luis F. C. Figueredo
Sami Haddadin
Ashish Kapoor
Shuang Ma
Rogerio Bonatti
LM&Ro
113
49
0
25 Mar 2022
A DNN Optimizer that Improves over AdaBelief by Suppression of the
  Adaptive Stepsize Range
A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range
Guoqiang Zhang
Kenta Niwa
W. Kleijn
ODL
84
2
0
24 Mar 2022
Token Dropping for Efficient BERT Pretraining
Token Dropping for Efficient BERT Pretraining
Le Hou
Richard Yuanzhe Pang
Dinesh Manocha
Yuexin Wu
Xinying Song
Xiaodan Song
Denny Zhou
85
46
0
24 Mar 2022
Practical tradeoffs between memory, compute, and performance in learned
  optimizers
Practical tradeoffs between memory, compute, and performance in learned optimizers
Luke Metz
C. Freeman
James Harrison
Niru Maheswaranathan
Jascha Narain Sohl-Dickstein
157
32
0
22 Mar 2022
Unified Multivariate Gaussian Mixture for Efficient Neural Image
  Compression
Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression
Xiaosu Zhu
Jingkuan Song
Lianli Gao
Fengcai Zheng
Hengtao Shen
57
64
0
21 Mar 2022
Harnessing Hard Mixed Samples with Decoupled Regularizer
Harnessing Hard Mixed Samples with Decoupled Regularizer
Zicheng Liu
Siyuan Li
Ge Wang
Cheng Tan
Lirong Wu
Stan Z. Li
141
19
0
21 Mar 2022
CYBORGS: Contrastively Bootstrapping Object Representations by Grounding
  in Segmentation
CYBORGS: Contrastively Bootstrapping Object Representations by Grounding in Segmentation
Renhao Wang
Hang Zhao
Yang Gao
SSL
91
1
0
17 Mar 2022
Deep Learning without Shortcuts: Shaping the Kernel with Tailored
  Rectifiers
Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers
Guodong Zhang
Aleksandar Botev
James Martens
OffRL
83
28
0
15 Mar 2022
Augmenting Document Representations for Dense Retrieval with
  Interpolation and Perturbation
Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation
Soyeong Jeong
Jinheon Baek
Sukmin Cho
Sung Ju Hwang
Jong C. Park
57
12
0
15 Mar 2022
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
  Distillation
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
Wenliang Dai
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
Pascale Fung
VLM
104
94
0
12 Mar 2022
ELLE: Efficient Lifelong Pre-training for Emerging Data
ELLE: Efficient Lifelong Pre-training for Emerging Data
Yujia Qin
Jiajie Zhang
Yankai Lin
Zhiyuan Liu
Peng Li
Maosong Sun
Jie Zhou
125
74
0
12 Mar 2022
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot
  Hyperparameter Transfer
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Greg Yang
J. E. Hu
Igor Babuschkin
Szymon Sidor
Xiaodong Liu
David Farhi
Nick Ryder
J. Pachocki
Weizhu Chen
Jianfeng Gao
150
168
0
07 Mar 2022
SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained
  Language Models
SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained Language Models
Liang Wang
Wei Zhao
Zhuoyu Wei
Jingming Liu
109
187
0
04 Mar 2022
LiteTransformerSearch: Training-free Neural Architecture Search for
  Efficient Language Models
LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models
Mojan Javaheripi
Gustavo de Rosa
Subhabrata Mukherjee
S. Shah
Tomasz Religa
C. C. T. Mendes
Sébastien Bubeck
F. Koushanfar
Debadeepta Dey
68
20
0
04 Mar 2022
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
Shenggan Cheng
Xuanlei Zhao
Guangyang Lu
Bin-Rui Li
Zhongming Yu
Tian Zheng
R. Wu
Xiwen Zhang
Jian Peng
Yang You
AI4CE
86
32
0
02 Mar 2022
MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning
MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning
Fangkai Jiao
Yangyang Guo
Xuemeng Song
Liqiang Nie
LRM
61
37
0
01 Mar 2022
DeepTx: Deep Learning Beamforming with Channel Prediction
DeepTx: Deep Learning Beamforming with Channel Prediction
Janne M. J. Huttunen
D. Korpi
Mikko Honkala
85
15
0
16 Feb 2022
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training
  Benchmark
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Jiaxi Gu
Xiaojun Meng
Guansong Lu
Lu Hou
Minzhe Niu
...
Runhu Huang
Wei Zhang
Xingda Jiang
Chunjing Xu
Hang Xu
VLM
187
95
0
14 Feb 2022
Online Decision Transformer
Online Decision Transformer
Qinqing Zheng
Amy Zhang
Aditya Grover
OffRL
95
209
0
11 Feb 2022
Optimal Algorithms for Decentralized Stochastic Variational Inequalities
Optimal Algorithms for Decentralized Stochastic Variational Inequalities
D. Kovalev
Aleksandr Beznosikov
Abdurakhmon Sadiev
Michael Persiianov
Peter Richtárik
Alexander Gasnikov
98
36
0
06 Feb 2022
Robust Training of Neural Networks Using Scale Invariant Architectures
Robust Training of Neural Networks Using Scale Invariant Architectures
Zhiyuan Li
Srinadh Bhojanapalli
Manzil Zaheer
Sashank J. Reddi
Surinder Kumar
94
30
0
02 Feb 2022
Distributed SLIDE: Enabling Training Large Neural Networks on Low
  Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity
Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity
Minghao Yan
Nicholas Meisburger
Tharun Medini
Anshumali Shrivastava
88
6
0
29 Jan 2022
ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language
  Models via Efficient Large-Batch Adversarial Noise
ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise
Minjia Zhang
U. Niranjan
Yuxiong He
69
1
0
29 Jan 2022
Deep Learning Methods for Abstract Visual Reasoning: A Survey on Raven's
  Progressive Matrices
Deep Learning Methods for Abstract Visual Reasoning: A Survey on Raven's Progressive Matrices
Mikolaj Malkiñski
Jacek Mańdziuk
225
43
0
28 Jan 2022
Revisiting RCAN: Improved Training for Image Super-Resolution
Revisiting RCAN: Improved Training for Image Super-Resolution
Zudi Lin
Prateek Garg
Atmadeep Banerjee
Salma Abdel Magid
Deqing Sun
Yulun Zhang
Luc Van Gool
D. Wei
Hanspeter Pfister
SupR
103
59
0
27 Jan 2022
One Student Knows All Experts Know: From Sparse to Dense
One Student Knows All Experts Know: From Sparse to Dense
Fuzhao Xue
Xiaoxin He
Xiaozhe Ren
Yuxuan Lou
Yang You
MoMeMoE
97
21
0
26 Jan 2022
AutoDistill: an End-to-End Framework to Explore and Distill
  Hardware-Efficient Language Models
AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models
Xiaofan Zhang
Zongwei Zhou
Deming Chen
Yu Emma Wang
81
11
0
21 Jan 2022
Near-Optimal Sparse Allreduce for Distributed Deep Learning
Near-Optimal Sparse Allreduce for Distributed Deep Learning
Shigang Li
Torsten Hoefler
64
53
0
19 Jan 2022
Partial Model Averaging in Federated Learning: Performance Guarantees
  and Benefits
Partial Model Averaging in Federated Learning: Performance Guarantees and Benefits
Sunwoo Lee
Anit Kumar Sahu
Chaoyang He
Salman Avestimehr
FedML
69
19
0
11 Jan 2022
Augmenting Convolutional networks with attention-based aggregation
Augmenting Convolutional networks with attention-based aggregation
Hugo Touvron
Matthieu Cord
Alaaeldin El-Nouby
Piotr Bojanowski
Armand Joulin
Gabriel Synnaeve
Hervé Jégou
ViT
119
49
0
27 Dec 2021
Previous
123...789...111213
Next