Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge

Frequent node and link changes in edge AI clusters disrupt distributed training, while traditional checkpoint-based recovery and cloud-centric autoscaling are too slow for scale-out and ill-suited to chaotic and self-governed edge. This paper proposes Chaos, a resilient and scalable edge distributed training system with built-in self-healing and autoscaling. It speeds up scale-out by using multi-neighbor replication with fast shard scheduling, allowing a new node to pull the latest training state from nearby neighbors in parallel while balancing the traffic load between them. It also uses a cluster monitor to track resource and topology changes to assist scheduler decisions, and handles scaling events through peer negotiation protocols, enabling fully self-governed autoscaling without a central admin. Extensive experiments show that Chaos consistently achieves much lower scale-out delays than Pollux, EDL, and Autoscaling, and handles scale-in, connect-link, and disconnect-link events within 1 millisecond, making it smoother to handle node joins, exits, and failures. It also delivers the lowest idle time, showing superior resource use and scalability as the cluster grows.
View on arXiv@article{feng2025_2505.12815, title={ Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge }, author={ Wenjiao Feng and Rongxing Xiao and Zonghang Li and Hongfang Yu and Gang Sun and Long Luo and Mohsen Guizani and Qirong Ho }, journal={arXiv preprint arXiv:2505.12815}, year={ 2025 } }