Federated Instruction Tuning of LLMs with Domain Coverage Augmentation

30 September 2024

Zezhou Wang

Yaxin Du

Zhuzhong Qian

Siheng Chen

FedML

ArXiv (abs)PDF HTML Github

Main:7 Pages

10 Figures

Bibliography:3 Pages

8 Tables

Appendix:11 Pages

Abstract

Federated Domain-specific Instruction Tuning (FedDIT) utilizes limited cross-client private data alongside server-side public data for instruction augmentation, ultimately enhancing model performance within specific domains. While the factors affecting FedDIT remain unclear and existing instruction augmentation methods mainly focus on the centralized setting without considering the distributed environment. Our experiments reveal that the cross-client domain coverage, rather than data heterogeneity, drives model performance in FedDIT. In response, we propose FedDCA, which optimizes domain coverage through greedy client center selection and retrieval-based augmentation. To alleviate client-side computational burdens, FedDCA $^*$ uses heterogeneous encoders with server-side feature alignment. Extensive experiments across four distinct domains (code, medical, financial, and mathematical) substantiate the effectiveness of both methods. Additionally, we investigate privacy preservation against memory extraction attacks utilizing varying amounts of public data. Results show no significant correlation between the volume of public data and the privacy-preserving capability. However, as the fine-tuning round increases, the risk of privacy leakage reduces or converges.

View on arXiv

Comments on this paper