Improving Layer-wise Adaptive Rate Methods using Trust Ratio Clipping

27 November 2020

Papers citing "Improving Layer-wise Adaptive Rate Methods using Trust Ratio Clipping"

2 / 2 papers shown

Title
The Disharmony between BN and ReLU Causes Gradient Explosion, but is Offset by the Correlation between Activations Inyoung Paik Jaesik Choi 26 0 0 23 Apr 2023
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 245 1,836 0 17 Sep 2019