SemiFL: Communication Efficient Semi-Supervised Federated Learning with
Unlabeled Clients
- FedML
Federated Learning (FL) allows training machine learning models by using the computation and private data resources of many distributed clients such as smartphones and IoT devices. Most existing works on FL assume the clients have ground-truth labels. However, in many practical scenarios, clients do not have task-specific labels, e.g., due to a lack of expertise. This work considers a server that hosts a labeled dataset and wishes to leverage clients with unlabeled data for supervised learning. We propose a new FL framework referred to as SemiFL to address Semi-Supervised Federated Learning (SSFL). In SemiFL, clients have completely unlabeled data, while the server has a small amount of labeled data. SemiFL is communication efficient since it allows to train client-side unsupervised data for multiple local epochs. We demonstrate several strategies in SemiFL to enhance efficiency and prediction, and develop intuitions of why they work. In particular, we provide a theoretical analysis of the use of strong data augmentation for semi-supervised learning, which can be interesting in its own right. Extensive empirical evaluations demonstrate that SemiFL can significantly improve the performance of a labeled server with unlabeled clients. Moreover, SemiFL can outperform many existing SSFL methods, and perform competitively with the state-of-the-art FL and centralized SSL results. For instance, in the standard communication efficient scenarios, SemiFL gives 93% accuracy on the CIFAR10 dataset with only 4000 labeled samples at the server. Such accuracy is 2% away from the result trained from 50000 fully labeled data, and improves 30% upon existing SSFL methods.
View on arXiv