30
2

Differentially Private Non-Convex Optimization under the KL Condition with Optimal Rates

Abstract

We study private empirical risk minimization (ERM) problem for losses satisfying the (γ,κ)(\gamma,\kappa)-Kurdyka-{\L}ojasiewicz (KL) condition. The Polyak-{\L}ojasiewicz (PL) condition is a special case of this condition when κ=2\kappa=2. Specifically, we study this problem under the constraint of ρ\rho zero-concentrated differential privacy (zCDP). When κ[1,2]\kappa\in[1,2] and the loss function is Lipschitz and smooth over a sufficiently large region, we provide a new algorithm based on variance reduced gradient descent that achieves the rate O~((dnρ)κ)\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^\kappa\big) on the excess empirical risk, where nn is the dataset size and dd is the dimension. We further show that this rate is nearly optimal. When κ2\kappa \geq 2 and the loss is instead Lipschitz and weakly convex, we show it is possible to achieve the rate O~((dnρ)κ)\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^\kappa\big) with a private implementation of the proximal point method. When the KL parameters are unknown, we provide a novel modification and analysis of the noisy gradient descent algorithm and show that this algorithm achieves a rate of O~((dnρ)2κ4κ)\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^{\frac{2\kappa}{4-\kappa}}\big) adaptively, which is nearly optimal when κ=2\kappa = 2. We further show that, without assuming the KL condition, the same gradient descent algorithm can achieve fast convergence to a stationary point when the gradient stays sufficiently large during the run of the algorithm. Specifically, we show that this algorithm can approximate stationary points of Lipschitz, smooth (and possibly nonconvex) objectives with rate as fast as O~(dnρ)\tilde{O}\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big) and never worse than O~((dnρ)1/2)\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^{1/2}\big). The latter rate matches the best known rate for methods that do not rely on variance reduction.

View on arXiv
Comments on this paper