
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
Papers citing "Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers"
50 / 52 papers shown
Title |
---|
![]() Mistral 7B Albert Q. Jiang Alexandre Sablayrolles A. Mensch Chris Bamford Devendra Singh Chaplot ...Teven Le Scao Thibaut Lavril Thomas Wang Timothée Lacroix William El Sayed |