ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.09679
  4. Cited By
AntDT: A Self-Adaptive Distributed Training Framework for Leader and
  Straggler Nodes

AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes

15 April 2024
Youshao Xiao
Lin Ju
Zhenglei Zhou
Siyuan Li
Zhaoxin Huan
Dalong Zhang
Rujie Jiang
Lin Wang
Xiaolu Zhang
Lei Liang
Jun Zhou
ArXivPDFHTML

Papers citing "AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes"

10 / 10 papers shown
Title
An Adaptive Placement and Parallelism Framework for Accelerating RLHF
  Training
An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training
Youshao Xiao
Weichang Wu
Zhenglei Zhou
Fagui Mao
Shangchun Zhao
Lin Ju
Lei Liang
Xiaolu Zhang
Jun Zhou
42
5
0
19 Dec 2023
Taming Resource Heterogeneity In Distributed ML Training With Dynamic
  Batching
Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching
S. Tyagi
Prateek Sharma
54
22
0
20 May 2023
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep
  Learning
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
Aurick Qiao
Sang Keun Choe
Suhas Jayaram Subramanya
Willie Neiswanger
Qirong Ho
Hao Zhang
G. Ganger
Eric Xing
VLM
45
180
0
27 Aug 2020
A Survey on Distributed Machine Learning
A Survey on Distributed Machine Learning
Joost Verbraeken
Matthijs Wolting
Jonathan Katzy
Jeroen Kloppenburg
Tim Verbelen
Jan S. Rellermeyer
OOD
81
703
0
20 Dec 2019
PyTorch: An Imperative Style, High-Performance Deep Learning Library
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
James Bradbury
...
Sasank Chilamkurthy
Benoit Steiner
Lu Fang
Junjie Bai
Soumith Chintala
ODL
319
42,038
0
03 Dec 2019
Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in
  Distributed SGD
Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD
Sanghamitra Dutta
Gauri Joshi
Soumyadip Ghosh
Parijat Dube
P. Nagpurkar
53
196
0
03 Mar 2018
Deep Gradient Compression: Reducing the Communication Bandwidth for
  Distributed Training
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
Chengyue Wu
Song Han
Huizi Mao
Yu Wang
W. Dally
114
1,399
0
05 Dec 2017
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision
  Applications
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G. Howard
Menglong Zhu
Bo Chen
Dmitry Kalenichenko
Weijun Wang
Tobias Weyand
M. Andreetto
Hartwig Adam
3DH
1.1K
20,747
0
17 Apr 2017
TensorFlow: A system for large-scale machine learning
TensorFlow: A system for large-scale machine learning
Martín Abadi
P. Barham
Jianmin Chen
Zhiwen Chen
Andy Davis
...
Vijay Vasudevan
Pete Warden
Martin Wicke
Yuan Yu
Xiaoqiang Zhang
GNN
AI4CE
370
18,300
0
27 May 2016
Revisiting Distributed Synchronous SGD
Revisiting Distributed Synchronous SGD
Jianmin Chen
Xinghao Pan
R. Monga
Samy Bengio
Rafal Jozefowicz
66
799
0
04 Apr 2016
1