Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

9 March 2020

Papers citing "Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule"

10 / 10 papers shown

Title
Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation Jingxuan Wei Linzhuang Sun Yichong Leng Xu Tan Bihui Yu Ruifeng Guo 53 3 0 23 Apr 2024
Large Learning Rates Improve Generalization: But How Large Are We Talking About? E. Lobacheva Eduard Pockonechnyy M. Kodryan Dmitry Vetrov AI4CE 16 0 0 19 Nov 2023
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models Jean Kaddour Oscar Key Piotr Nawrot Pasquale Minervini Matt J. Kusner 32 41 0 12 Jul 2023
Relaxed Attention for Transformer Models Timo Lohrenz Björn Möller Zhengyang Li Tim Fingscheidt KELM 29 11 0 20 Sep 2022
Distance Learner: Incorporating Manifold Prior to Model Training Aditya Chetan Nipun Kwatra 21 1 0 14 Jul 2022
Efficient Multi-Purpose Cross-Attention Based Image Alignment Block for Edge Devices Bahri Batuhan Bilecen Alparslan Fisne Mustafa Ayazoglu 22 2 0 01 Jun 2022
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework Zhiyuan Li Tianhao Wang Sanjeev Arora MLT 90 99 0 13 Oct 2021
Ranger21: a synergistic deep learning optimizer Less Wright Nestor Demeure ODL AI4CE 30 86 0 25 Jun 2021
A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay L. Smith 208 1,020 0 26 Mar 2018
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima N. Keskar Dheevatsa Mudigere J. Nocedal M. Smelyanskiy P. T. P. Tang ODL 312 2,896 0 15 Sep 2016