Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1904.00962
Cited By
v1
v2
v3
v4
v5 (latest)
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
1 April 2019
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
ODL
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1698★)
Papers citing
"Large Batch Optimization for Deep Learning: Training BERT in 76 minutes"
50 / 611 papers shown
Title
An Adaptive Method Stabilizing Activations for Enhanced Generalization
Hyunseok Seung
Jaewoo Lee
Hyunsuk Ko
ODL
37
0
0
10 Jun 2025
Investigating Mask-aware Prototype Learning for Tabular Anomaly Detection
Ruiying Lu
Jinhan Liu
Chuan Du
D. Guo
OOD
AAML
68
0
0
03 Jun 2025
Taming LLMs by Scaling Learning Rates with Gradient Grouping
Siyuan Li
Juanxi Tian
Zedong Wang
Xin Jin
Zicheng Liu
Wentao Zhang
Dan Xu
52
0
0
01 Jun 2025
SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training
Yehonathan Refael
Guy Smorodinsky
Tom Tirer
Ofir Lindenbaum
48
0
0
30 May 2025
On the Convergence Analysis of Muon
Wei Shen
Ruichuan Huang
Minhui Huang
Cong Shen
Jiawei Zhang
64
0
0
29 May 2025
DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models
Alex Iacob
Lorenzo Sani
M. Safaryan
Paris Giampouras
Samuel Horváth
...
Meghdad Kurmanji
Preslav Aleksandrov
William F. Shen
Xinchi Qiu
Nicholas D. Lane
OffRL
112
0
0
28 May 2025
Deep Learning-Based Forecasting of Boarding Patient Counts to Address ED Overcrowding
Orhun Vural
Bunyamin Ozaydin
Khalid Y. Aram
James Booth
Brittany F. Lindsey
58
0
0
20 May 2025
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
80
2
0
19 May 2025
A Physics-Inspired Optimizer: Velocity Regularized Adam
Pranav Vaidhyanathan
Lucas Schorling
Natalia Ares
Michael A. Osborne
ODL
81
0
0
19 May 2025
On the
O
(
d
K
1
/
4
)
O(\frac{\sqrt{d}}{K^{1/4}})
O
(
K
1/4
d
)
Convergence Rate of AdamW Measured by
ℓ
1
\ell_1
ℓ
1
Norm
Huan Li
Yiming Dong
Zhouchen Lin
79
0
0
17 May 2025
Pretraining Large Brain Language Model for Active BCI: Silent Speech
Jinzhao Zhou
Zehong Cao
Yiqun Duan
Connor Barkley
Daniel Leong
...
Ziyi Zhao
T. Do
Yu-Cheng Chang
Sheng-Fu Liang
Chin-Teng Lin
112
1
0
29 Apr 2025
Learning Energy-Based Generative Models via Potential Flow: A Variational Principle Approach to Probability Density Homotopy Matching
Junn Yong Loo
Michelle Adeline
Julia Kaiwen Lau
Fang Yu Leong
Hwa Hui Tew
Arghya Pal
Vishnu Monn Baskaran
Chee-Ming Ting
Raphaël C.-W. Phan
BDL
109
0
0
22 Apr 2025
AlphaGrad: Non-Linear Gradient Normalization Optimizer
Soham Sane
ODL
151
0
0
22 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
331
9
0
17 Apr 2025
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training
Mingyu Liang
Hiwot Tadese Kassa
Wenyin Fu
Brian Coutinho
Louis Feng
Christina Delimitrou
40
0
0
12 Apr 2025
Low-Bit Integerization of Vision Transformers using Operand Reodering for Efficient Hardware
Ching-Yi Lin
Sahil Shah
MQ
140
0
0
11 Apr 2025
Neural Encoding and Decoding at Scale
Yizi Zhang
Yanchen Wang
Mehdi Azabou
Alexandre Andre
Zixuan Wang
Hanrui Lyu
International Brain Laboratory
Eva L. Dyer
Liam Paninski
Cole Hurwitz
AI4CE
170
1
0
11 Apr 2025
The Efficacy of Semantics-Preserving Transformations in Self-Supervised Learning for Medical Ultrasound
Blake Vanberlo
Alexander Wong
Jesse Hoey
R. Arntfield
90
0
0
10 Apr 2025
Spectral-Adaptive Modulation Networks for Visual Perception
Guhnoo Yun
J. Yoo
Kijung Kim
Jeongho Lee
Paul Hongsuck Seo
Dong Hwan Kim
129
0
0
31 Mar 2025
ASGO: Adaptive Structured Gradient Optimization
Kang An
Yuxing Liu
Boyao Wang
Shiqian Ma
Shiqian Ma
Tong Zhang
Tong Zhang
ODL
157
5
0
26 Mar 2025
Show and Segment: Universal Medical Image Segmentation via In-Context Learning
Yunhe Gao
Di Liu
Zhuowei Li
You Li
DongDong Chen
Mu Zhou
Dimitris N. Metaxas
VLM
88
0
0
25 Mar 2025
Structured Preconditioners in Adaptive Optimization: A Unified Analysis
Shuo Xie
Tianhao Wang
Sashank J. Reddi
Sanjiv Kumar
Zhiyuan Li
87
4
0
13 Mar 2025
ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation
Tobias Christian Nauen
Brian B. Moser
Federico Raue
Stanislav Frolov
Andreas Dengel
ViT
187
0
0
12 Mar 2025
Variational Bayesian Pseudo-Coreset
Hyungi Lee
Seanie Lee
Juho Lee
BDL
73
0
0
28 Feb 2025
Self-Adjust Softmax
Chuanyang Zheng
Yihang Gao
Guoxuan Chen
Han Shi
Jing Xiong
Xiaozhe Ren
Chao Huang
Xin Jiang
Zhiyu Li
Yu Li
90
1
0
25 Feb 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
109
8
0
21 Feb 2025
Preconditioned Inexact Stochastic ADMM for Deep Model
Shenglong Zhou
Ouya Wang
Ziyan Luo
Yongxu Zhu
Geoffrey Ye Li
97
0
0
15 Feb 2025
Gradient Multi-Normalization for Stateless and Scalable LLM Training
M. Scetbon
Chao Ma
Wenbo Gong
Edward Meeds
196
1
0
10 Feb 2025
Model Diffusion for Certifiable Few-shot Transfer Learning
Fady Rezk
Royson Lee
Henry Gouk
Timothy M. Hospedales
Minyoung Kim
150
0
0
10 Feb 2025
Importance Sampling via Score-based Generative Models
Heasung Kim
Taekyun Lee
Hyeji Kim
Gustavo de Veciana
MedIm
DiffM
210
0
0
07 Feb 2025
Celo: Training Versatile Learned Optimizers on a Compute Diet
A. Moudgil
Boris Knyazev
Guillaume Lajoie
Eugene Belilovsky
454
0
0
22 Jan 2025
A Hessian-informed hyperparameter optimization for differential learning rate
Shiyun Xu
Zhiqi Bu
Yiliang Zhang
Ian Barnett
131
1
0
12 Jan 2025
AdaPRL: Adaptive Pairwise Regression Learning with Uncertainty Estimation for Universal Regression Tasks
Fuhang Liang
Rucong Xu
Deng Lin
OOD
95
0
0
10 Jan 2025
Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models
Bahman Torkamandi
AI4CE
101
0
0
08 Jan 2025
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
Tim Tsz-Kit Lau
Weijian Li
Chenwei Xu
Han Liu
Mladen Kolar
473
0
0
30 Dec 2024
Self-supervised Spatial-Temporal Learner for Precipitation Nowcasting
Haotian Li
A. Siebes
S. Mehrkanoon
SSL
92
0
0
20 Dec 2024
A stochastic first-order method with multi-extrapolated momentum for highly smooth unconstrained optimization
Chuan He
143
0
0
19 Dec 2024
Mojito: Motion Trajectory and Intensity Control for Video Generation
Xuehai He
Shuohang Wang
Jianwei Yang
Xiaoxia Wu
Yansen Wang
Kuan-Chieh Wang
Z. Zhan
Olatunji Ruwase
Yelong Shen
Xinze Wang
VGen
242
2
0
12 Dec 2024
AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation
Guanxing Lu
Tengbo Yu
Haoyuan Deng
Season Si Chen
Yansong Tang
Ziwei Wang
171
3
0
09 Dec 2024
Convolutional Neural Networks Do Work with Pre-Defined Filters
C. Linse
Erhardt Barth
T. Martinetz
149
5
0
27 Nov 2024
Improving OOD Generalization of Pre-trained Encoders via Aligned Embedding-Space Ensembles
Shuman Peng
Arash Khoeini
Sharan Vaswani
Martin Ester
159
0
0
20 Nov 2024
Adaptive Consensus Gradients Aggregation for Scaled Distributed Training
Yoni Choukroun
Shlomi Azoulay
P. Kisilev
93
0
0
06 Nov 2024
LASER: Attention with Exponential Transformation
Sai Surya Duvvuri
Inderjit Dhillon
55
1
0
05 Nov 2024
Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models
Junjiao Tian
Chengyue Huang
Z. Kira
65
2
0
03 Nov 2024
Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers
Kai Yan
Alex Schwing
Yu-Xiong Wang
OffRL
OnRL
83
0
0
31 Oct 2024
Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training
Atli Kosson
Bettina Messmer
Martin Jaggi
AI4CE
73
5
0
31 Oct 2024
LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization
Jui-Nan Yen
Si Si
Zhao Meng
Felix X. Yu
Sai Surya Duvvuri
Inderjit Dhillon
Cho-Jui Hsieh
Sanjiv Kumar
90
5
0
27 Oct 2024
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading
Avinash Maurya
Jie Ye
M. Rafique
Franck Cappello
Bogdan Nicolae
78
1
0
26 Oct 2024
Leaky ReLUs That Differ in Forward and Backward Pass Facilitate Activation Maximization in Deep Neural Networks
C. Linse
Erhardt Barth
Thomas Martinetz
70
1
0
22 Oct 2024
Rethinking generalization of classifiers in separable classes scenarios and over-parameterized regimes
Julius Martinetz
C. Linse
Thomas Martinetz
91
0
0
22 Oct 2024
1
2
3
4
...
11
12
13
Next