ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.16793
  4. Cited By
Adam-mini: Use Fewer Learning Rates To Gain More

Adam-mini: Use Fewer Learning Rates To Gain More

24 June 2024
Yushun Zhang
Congliang Chen
Ziniu Li
Tian Ding
Chenwei Wu
Yinyu Ye
Zhi-Quan Luo
Ruoyu Sun
ArXivPDFHTML

Papers citing "Adam-mini: Use Fewer Learning Rates To Gain More"

34 / 34 papers shown
Title
ShiQ: Bringing back Bellman to LLMs
ShiQ: Bringing back Bellman to LLMs
Pierre Clavier
Nathan Grinsztajn
Raphaël Avalos
Yannis Flet-Berliac
Irem Ergun
...
Eugene Tarassov
Olivier Pietquin
Pierre Harvey Richemond
Florian Strub
Matthieu Geist
OffRL
12
0
0
16 May 2025
Gaussian Weight Sampling for Scalable, Efficient and Stable Pseudo-Quantization Training
Gaussian Weight Sampling for Scalable, Efficient and Stable Pseudo-Quantization Training
Myeonghwan Ahn
Sungjoo Yoo
MQ
17
0
0
16 May 2025
Towards Quantifying the Hessian Structure of Neural Networks
Towards Quantifying the Hessian Structure of Neural Networks
Zhaorui Dong
Yushun Zhang
Zhi-Quan Luo
Jianfeng Yao
Ruoyu Sun
31
0
0
05 May 2025
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics
Cong Xu
Wenbin Liang
Mo Yu
Anan Liu
Kaipeng Zhang
Lizhuang Ma
Yufei Guo
Jun Wang
Wenbo Zhang
MQ
57
0
0
01 May 2025
ASGO: Adaptive Structured Gradient Optimization
ASGO: Adaptive Structured Gradient Optimization
Kang An
Yuxing Liu
Rui Pan
Shiqian Ma
D. Goldfarb
Tong Zhang
ODL
97
2
0
26 Mar 2025
When Can You Get Away with Low Memory Adam?
When Can You Get Away with Low Memory Adam?
Dayal Singh Kalra
John Kirchenbauer
M. Barkeshli
Tom Goldstein
74
0
0
03 Mar 2025
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
Jinbo Wang
Mingze Wang
Zhanpeng Zhou
Junchi Yan
Weinan E
Lei Wu
89
1
0
26 Feb 2025
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs
Liming Liu
Zhenghao Xu
Zixuan Zhang
Hao Kang
Zichong Li
Chen Liang
Weizhu Chen
T. Zhao
146
1
0
24 Feb 2025
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
Tianjin Huang
Haotian Hu
Zhenyu (Allen) Zhang
Gaojie Jin
Xianrui Li
...
Tianlong Chen
Lu Liu
Qingsong Wen
Zhangyang Wang
Shiwei Liu
MQ
39
0
0
24 Feb 2025
Gradient Multi-Normalization for Stateless and Scalable LLM Training
Gradient Multi-Normalization for Stateless and Scalable LLM Training
M. Scetbon
Chao Ma
Wenbo Gong
Edward Meeds
99
1
0
10 Feb 2025
SubTrack your Grad: Gradient Subspace Tracking for Memory and Time Efficient Full-Parameter LLM Training
SubTrack your Grad: Gradient Subspace Tracking for Memory and Time Efficient Full-Parameter LLM Training
Sahar Rajabi
Nayeema Nonta
Sirisha Rambhatla
90
0
0
03 Feb 2025
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
Akiyoshi Tomihari
Issei Sato
ODL
61
1
0
31 Jan 2025
360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation
360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation
Hamed Firooz
Maziar Sanjabi
Adrian Englhardt
Aman Gupta
Ben Levine
...
Xiaoling Zhai
Ya Xu
Yu Wang
Yun Dai
Yun Dai
ALM
47
3
0
27 Jan 2025
FOCUS: First Order Concentrated Updating Scheme
FOCUS: First Order Concentrated Updating Scheme
Yizhou Liu
Ziming Liu
Jeff Gore
ODL
108
1
0
21 Jan 2025
A Survey on Memory-Efficient Large-Scale Model Training in AI for Science
A Survey on Memory-Efficient Large-Scale Model Training in AI for Science
Kaiyuan Tian
Linbo Qiao
Baihui Liu
Gongqingjian Jiang
Dongsheng Li
36
0
0
21 Jan 2025
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Tianjin Huang
Ziquan Zhu
Gaojie Jin
Lu Liu
Zhangyang Wang
Shiwei Liu
44
1
0
12 Jan 2025
GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection
GaLore+++: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection
Xutao Liao
Shaohui Li
Yuhui Xu
Zhi Li
Y. Liu
You He
VLM
59
3
0
31 Dec 2024
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and
  Post-LN
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN
Pengxiang Li
Lu Yin
Shiwei Liu
70
4
0
18 Dec 2024
No More Adam: Learning Rate Scaling at Initialization is All You Need
No More Adam: Learning Rate Scaling at Initialization is All You Need
Minghao Xu
Lichuan Xiang
Xu Cai
Hongkai Wen
80
2
0
16 Dec 2024
FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for
  Scalable Training
FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training
Philip Zmushko
Aleksandr Beznosikov
Martin Takáč
Samuel Horváth
44
0
0
12 Nov 2024
Gradient Methods with Online Scaling
Gradient Methods with Online Scaling
Wenzhi Gao
Ya-Chi Chu
Yinyu Ye
Madeleine Udell
38
1
0
04 Nov 2024
Understanding Adam Requires Better Rotation Dependent Assumptions
Understanding Adam Requires Better Rotation Dependent Assumptions
Lucas Maes
Tianyue H. Zhang
Alexia Jolicoeur-Martineau
Ioannis Mitliagkas
Damien Scieur
Simon Lacoste-Julien
Charles Guille-Escuret
38
3
0
25 Oct 2024
LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics
LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics
Thomas Robert
M. Safaryan
Ionut-Vlad Modoranu
Dan Alistarh
ODL
36
2
0
21 Oct 2024
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
Weronika Ormaniec
Felix Dangel
Sidak Pal Singh
38
7
0
14 Oct 2024
HyperINF: Unleashing the HyperPower of the Schulz's Method for Data
  Influence Estimation
HyperINF: Unleashing the HyperPower of the Schulz's Method for Data Influence Estimation
Xinyu Zhou
Simin Fan
Martin Jaggi
TDI
31
0
0
07 Oct 2024
Simulating Dynamic Tumor Contrast Enhancement in Breast MRI using Conditional Generative Adversarial Networks
Simulating Dynamic Tumor Contrast Enhancement in Breast MRI using Conditional Generative Adversarial Networks
Richard Osuala
Smriti Joshi
Apostolia Tsirikoglou
Lidia Garrucho
Walter H. L. Pinaya
Daniel M. Lang
J. Schnabel
Oliver Díaz
Karim Lekadir
MedIm
38
0
0
27 Sep 2024
MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning
MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning
Yupeng Chen
Senmiao Wang
Zhihang Lin
Zhihang Lin
Yushun Zhang
Tian Ding
Ruoyu Sun
Ruoyu Sun
CLL
80
1
0
30 Jul 2024
Parameter-Efficient Fine-Tuning via Circular Convolution
Parameter-Efficient Fine-Tuning via Circular Convolution
Aochuan Chen
Jiashun Cheng
Zijing Liu
Ziqi Gao
Fugee Tsung
Yu Li
Jia Li
56
2
0
27 Jul 2024
Deconstructing What Makes a Good Optimizer for Language Models
Deconstructing What Makes a Good Optimizer for Language Models
Rosie Zhao
Depen Morwani
David Brandfonbrener
Nikhil Vyas
Sham Kakade
50
17
0
10 Jul 2024
Understanding Adam Optimizer via Online Learning of Updates: Adam is
  FTRL in Disguise
Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise
Kwangjun Ahn
Zhiyu Zhang
Yunbum Kook
Yan Dai
45
11
0
02 Feb 2024
Neglected Hessian component explains mysteries in Sharpness
  regularization
Neglected Hessian component explains mysteries in Sharpness regularization
Yann N. Dauphin
Atish Agarwala
Hossein Mobahi
FAtt
46
7
0
19 Jan 2024
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on
  Transformers, but Sign Descent Might Be
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
Frederik Kunstner
Jacques Chen
J. Lavington
Mark W. Schmidt
40
67
0
27 Apr 2023
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
339
12,003
0
04 Mar 2022
Learning by Turning: Neural Architecture Aware Optimisation
Learning by Turning: Neural Architecture Aware Optimisation
Yang Liu
Jeremy Bernstein
M. Meister
Yisong Yue
ODL
43
26
0
14 Feb 2021
1