ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.00962
  4. Cited By
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
v1v2v3v4v5 (latest)

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

1 April 2019
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
    ODL
ArXiv (abs)PDFHTMLGithub (1698★)

Papers citing "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes"

50 / 611 papers shown
Title
ParFormer: Vision Transformer Baseline with Parallel Local Global Token
  Mixer and Convolution Attention Patch Embedding
ParFormer: Vision Transformer Baseline with Parallel Local Global Token Mixer and Convolution Attention Patch Embedding
Novendra Setyawan
Ghufron Wahyu Kurniawan
Chi-Chia Sun
Jun-Wei Hsieh
Hui-Kai Su
W. Kuo
ViTMoE
94
0
0
22 Mar 2024
PETScML: Second-order solvers for training regression problems in
  Scientific Machine Learning
PETScML: Second-order solvers for training regression problems in Scientific Machine Learning
Stefano Zampini
Umberto Zerbinati
George Turkyyiah
David E. Keyes
67
5
0
18 Mar 2024
VIHE: Virtual In-Hand Eye Transformer for 3D Robotic Manipulation
VIHE: Virtual In-Hand Eye Transformer for 3D Robotic Manipulation
Weiyao Wang
Yutian Lei
Shiyu Jin
Gregory D. Hager
Liangjun Zhang
92
3
0
18 Mar 2024
ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic
  Manipulation
ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation
Guanxing Lu
Shiyi Zhang
Ziwei Wang
Changliu Liu
Jiwen Lu
Yansong Tang
117
57
0
13 Mar 2024
Intra-video Positive Pairs in Self-Supervised Learning for Ultrasound
Intra-video Positive Pairs in Self-Supervised Learning for Ultrasound
Blake Vanberlo
Alexander Wong
Jesse Hoey
R. Arntfield
71
2
0
12 Mar 2024
A Tutorial on the Pretrain-Finetune Paradigm for Natural Language
  Processing
A Tutorial on the Pretrain-Finetune Paradigm for Natural Language Processing
Yu Wang
Wen Qu
92
0
0
04 Mar 2024
Never-Ending Behavior-Cloning Agent for Robotic Manipulation
Never-Ending Behavior-Cloning Agent for Robotic Manipulation
Wenqi Liang
Gan Sun
Qian He
Yu Ren
Jiahua Dong
Yang Cong
LM&Ro
90
1
0
01 Mar 2024
Pre-training Differentially Private Models with Limited Public Data
Pre-training Differentially Private Models with Limited Public Data
Zhiqi Bu
Xinwei Zhang
Mingyi Hong
Sheng Zha
George Karypis
121
4
0
28 Feb 2024
Stable LM 2 1.6B Technical Report
Stable LM 2 1.6B Technical Report
Marco Bellagente
J. Tow
Dakota Mahan
Duy Phung
Maksym Zhuravinskyi
...
Paulo Rocha
Harry Saini
H. Teufel
Niccoló Zanichelli
Carlos Riquelme
OSLM
109
58
0
27 Feb 2024
Towards Optimal Learning of Language Models
Towards Optimal Learning of Language Models
Yuxian Gu
Li Dong
Y. Hao
Qingxiu Dong
Minlie Huang
Furu Wei
106
7
0
27 Feb 2024
Pfeed: Generating near real-time personalized feeds using precomputed
  embedding similarities
Pfeed: Generating near real-time personalized feeds using precomputed embedding similarities
B. Gebre
Karoliina Ranta
S. V. D. Elzen
Ernst Kuiper
Thijs Baars
Tom Heskes
82
1
0
25 Feb 2024
MegaScale: Scaling Large Language Model Training to More Than 10,000
  GPUs
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Ziheng Jiang
Yanghua Peng
Yinmin Zhong
Qi Huang
Yangrui Chen
...
Zhe Li
X. Jia
Jia-jun Ye
Xin Jin
Xin Liu
LRM
126
124
0
23 Feb 2024
Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer
Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer
Yanjun Zhao
Sizhe Dang
Haishan Ye
Guang Dai
Yi Qian
Ivor W.Tsang
179
13
0
23 Feb 2024
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
  Synthesis
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
Willi Menapace
Aliaksandr Siarohin
Ivan Skorokhodov
Ekaterina Deyneka
Tsai-Shien Chen
...
Yuwei Fang
A. Stoliar
Elisa Ricci
Jian Ren
Sergey Tulyakov
VGen
136
62
0
22 Feb 2024
Perceiving Longer Sequences With Bi-Directional Cross-Attention
  Transformers
Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers
Markus Hiller
Krista A. Ehinger
Tom Drummond
118
4
0
19 Feb 2024
AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods
AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods
Tim Tsz-Kit Lau
Han Liu
Mladen Kolar
ODL
85
6
0
17 Feb 2024
Switch EMA: A Free Lunch for Better Flatness and Sharpness
Switch EMA: A Free Lunch for Better Flatness and Sharpness
Siyuan Li
Zicheng Liu
Juanxi Tian
Ge Wang
Zedong Wang
...
Cheng Tan
Tao Lin
Yang Liu
Baigui Sun
Stan Z. Li
66
6
0
14 Feb 2024
Feature learning as alignment: a structural property of gradient descent
  in non-linear neural networks
Feature learning as alignment: a structural property of gradient descent in non-linear neural networks
Daniel Beaglehole
Ioannis Mitliagkas
Atish Agarwala
MLT
99
2
0
07 Feb 2024
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Quan-Sen Sun
Jinsheng Wang
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Xinlong Wang
VLMCLIPMLLM
139
49
0
06 Feb 2024
Breaking MLPerf Training: A Case Study on Optimizing BERT
Breaking MLPerf Training: A Case Study on Optimizing BERT
Yongdeok Kim
Jaehyung Ahn
Myeongwoo Kim
Changin Choi
Heejae Kim
...
Xiongzhan Linghu
Jingkun Ma
Lin Chen
Yuehua Dai
Sungjoo Yoo
65
0
0
04 Feb 2024
ALERT-Transformer: Bridging Asynchronous and Synchronous Machine
  Learning for Real-Time Event-based Spatio-Temporal Data
ALERT-Transformer: Bridging Asynchronous and Synchronous Machine Learning for Real-Time Event-based Spatio-Temporal Data
Carmen Martin-Turrero
Maxence Bouvier
Manuel Breitenstein
Pietro Zanuttigh
Vincent Parret
83
4
0
02 Feb 2024
Comparative Study of Large Language Model Architectures on Frontier
Comparative Study of Large Language Model Architectures on Frontier
Shantia Yarahmadian
A. Bose
Guojing Cong
Richard Yamada
Quentin Anthony
ELM
83
7
0
01 Feb 2024
Making Parametric Anomaly Detection on Tabular Data Non-Parametric Again
Making Parametric Anomaly Detection on Tabular Data Non-Parametric Again
Hugo Thimonier
Fabrice Popineau
Arpad Rimmel
Bich-Liên Doan
94
2
0
30 Jan 2024
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale
  Efficient Pretraining
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
Qingpei Guo
Furong Xu
Hanxiao Zhang
Wang Ren
Ziping Ma
Lin Ju
Jian Wang
Jingdong Chen
Ming Yang
VLMMLLM
68
3
0
29 Jan 2024
TraKDis: A Transformer-based Knowledge Distillation Approach for Visual
  Reinforcement Learning with Application to Cloth Manipulation
TraKDis: A Transformer-based Knowledge Distillation Approach for Visual Reinforcement Learning with Application to Cloth Manipulation
Wei Chen
Nicolás Rojas
98
7
0
24 Jan 2024
MADA: Meta-Adaptive Optimizers through hyper-gradient Descent
MADA: Meta-Adaptive Optimizers through hyper-gradient Descent
Kaan Ozkara
Can Karakus
Parameswaran Raman
Mingyi Hong
Shoham Sabach
Branislav Kveton
Volkan Cevher
101
4
0
17 Jan 2024
GD doesn't make the cut: Three ways that non-differentiability affects
  neural network training
GD doesn't make the cut: Three ways that non-differentiability affects neural network training
Siddharth Krishna Kumar
AAML
81
3
0
16 Jan 2024
(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum
(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum
Anh Dang
Reza Babanezhad
Sharan Vaswani
67
0
0
12 Jan 2024
FourCastNeXt: Optimizing FourCastNet Training for Limited Compute
FourCastNeXt: Optimizing FourCastNet Training for Limited Compute
Edison Guo
Maruf Ahmed
Yue Sun
Rui Yang
Harrison Cook
Tennessee Leeuwenburg
Ben Evans
45
1
0
10 Jan 2024
Robust Calibration For Improved Weather Prediction Under Distributional
  Shift
Robust Calibration For Improved Weather Prediction Under Distributional Shift
Sankalp Gilda
Neel Bhandari
Wendy Mak
Andrea Panizza
UQCVOOD
38
1
0
08 Jan 2024
Interpreting Adaptive Gradient Methods by Parameter Scaling for
  Learning-Rate-Free Optimization
Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization
Min-Kook Suh
Seung-Woo Seo
ODL
74
0
0
06 Jan 2024
Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices
Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices
A. Menon
Unnikrishnan Menon
Kailash Ahirwar
66
1
0
03 Jan 2024
Noise-free Optimization in Early Training Steps for Image
  Super-Resolution
Noise-free Optimization in Early Training Steps for Image Super-Resolution
MinKyu Lee
Jae-Pil Heo
69
5
0
29 Dec 2023
Black-Box Tuning of Vision-Language Models with Effective Gradient
  Approximation
Black-Box Tuning of Vision-Language Models with Effective Gradient Approximation
Zixian Guo
Yuxiang Wei
Ming-Yu Liu
Zhilong Ji
Jinfeng Bai
Yiwen Guo
Wangmeng Zuo
VLM
104
9
0
26 Dec 2023
Accelerated Convergence of Stochastic Heavy Ball Method under
  Anisotropic Gradient Noise
Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise
Boyao Wang
Yuxing Liu
Xiaoyu Wang
Tong Zhang
49
5
0
22 Dec 2023
Critic-Guided Decision Transformer for Offline Reinforcement Learning
Critic-Guided Decision Transformer for Offline Reinforcement Learning
Yuanfu Wang
Chao Yang
Yinghong Wen
Yu Liu
Yu Qiao
OffRL
104
12
0
21 Dec 2023
XLand-MiniGrid: Scalable Meta-Reinforcement Learning Environments in JAX
XLand-MiniGrid: Scalable Meta-Reinforcement Learning Environments in JAX
Alexander Nikulin
Vladislav Kurenkov
Ilya Zisman
Artem Agarkov
Viacheslav Sinii
Sergey Kolesnikov
125
30
0
19 Dec 2023
Sentiment analysis in Tourism: Fine-tuning BERT or sentence embeddings
  concatenation?
Sentiment analysis in Tourism: Fine-tuning BERT or sentence embeddings concatenation?
Ibrahim Bouabdallaoui
Fatima Guerouate
Samya Bouhaddour
C. Saadi
Mohammed Sbihi
57
0
0
12 Dec 2023
RankMatch: A Novel Approach to Semi-Supervised Label Distribution
  Learning Leveraging Inter-label Correlations
RankMatch: A Novel Approach to Semi-Supervised Label Distribution Learning Leveraging Inter-label Correlations
Kouzhiqiang Yucheng Xie
Jing Wang
Yuheng Jia
Boyu Shi
Xin Geng
61
1
0
11 Dec 2023
Analyzing and Improving the Training Dynamics of Diffusion Models
Analyzing and Improving the Training Dynamics of Diffusion Models
Tero Karras
M. Aittala
J. Lehtinen
Janne Hellsten
Timo Aila
S. Laine
153
204
0
05 Dec 2023
Industrial Internet of Things Intelligence Empowering Smart
  Manufacturing: A Literature Review
Industrial Internet of Things Intelligence Empowering Smart Manufacturing: A Literature Review
Member Ieee Yujiao Hu
Qingmin Jia
Yuao Yao
Yong Lee
Mengjie Lee
Chenyi Wang
Xiaomao Zhou
Senior Member Ieee Renchao Xie
Feng Yu
76
44
0
02 Dec 2023
Temperature Balancing, Layer-wise Weight Analysis, and Neural Network
  Training
Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training
Yefan Zhou
Tianyu Pang
Keqin Liu
Charles H. Martin
Michael W. Mahoney
Yaoqing Yang
148
12
0
01 Dec 2023
Generalisable Agents for Neural Network Optimisation
Generalisable Agents for Neural Network Optimisation
Kale-ab Tessera
C. Tilbury
Sasha Abramowitz
Ruan de Kock
Omayma Mahjoub
Benjamin Rosman
Sara Hooker
Arnu Pretorius
AI4CE
74
0
0
30 Nov 2023
RETSim: Resilient and Efficient Text Similarity
RETSim: Resilient and Efficient Text Similarity
Marina Zhang
Owen Vallis
Aysegul Bumin
Tanay Vakharia
Elie Bursztein
136
1
0
28 Nov 2023
Recognizing Conditional Causal Relationships about Emotions and Their
  Corresponding Conditions
Recognizing Conditional Causal Relationships about Emotions and Their Corresponding Conditions
Xinhong Chen
Zongxi Li
Yaowei Wang
Haoran Xie
Jianping Wang
Qing Li
49
0
0
28 Nov 2023
Model-aware 3D Eye Gaze from Weak and Few-shot Supervisions
Model-aware 3D Eye Gaze from Weak and Few-shot Supervisions
Nikola Popovic
Dimitrios Christodoulou
D. Paudel
Xi Wang
Luc Van Gool
94
0
0
20 Nov 2023
Using Stochastic Gradient Descent to Smooth Nonconvex Functions:
  Analysis of Implicit Graduated Optimization with Optimal Noise Scheduling
Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization with Optimal Noise Scheduling
Naoki Sato
Hideaki Iiduka
86
3
0
15 Nov 2023
ViR: Towards Efficient Vision Retention Backbones
ViR: Towards Efficient Vision Retention Backbones
Ali Hatamizadeh
Michael Ranzinger
Shiyi Lan
Jose M. Alvarez
Sanja Fidler
Jan Kautz
GNN
40
2
0
30 Oct 2023
Deep-learning-based decomposition of overlapping-sparse images:
  application at the vertex of neutrino interactions
Deep-learning-based decomposition of overlapping-sparse images: application at the vertex of neutrino interactions
Saúl Alonso-Monsalve
D. Sgalaberna
Xingyu Zhao
Adrien Molines
C. Mcgrew
A. Rubbia
70
0
0
30 Oct 2023
On the accuracy and efficiency of group-wise clipping in differentially
  private optimization
On the accuracy and efficiency of group-wise clipping in differentially private optimization
Zhiqi Bu
Ruixuan Liu
Yu Wang
Sheng Zha
George Karypis
VLM
78
4
0
30 Oct 2023
Previous
123456...111213
Next