20
0

IDKM: Memory Efficient Neural Network Quantization via Implicit, Differentiable k-Means

Abstract

Compressing large neural networks with minimal performance loss is crucial to enabling their deployment on edge devices. (Cho et al., 2022) proposed a weight quantization method that uses an attention-based clustering algorithm called differentiable kk-means (DKM). Despite achieving state-of-the-art results, DKM's performance is constrained by its heavy memory dependency. We propose an implicit, differentiable kk-means algorithm (IDKM), which eliminates the major memory restriction of DKM. Let tt be the number of kk-means iterations, mm be the number of weight-vectors, and bb be the number of bits per cluster address. IDKM reduces the overall memory complexity of a single kk-means layer from O(tm2b)\mathcal{O}(t \cdot m \cdot 2^b) to O(m2b)\mathcal{O}( m \cdot 2^b). We also introduce a variant, IDKM with Jacobian-Free-Backpropagation (IDKM-JFB), for which the time complexity of the gradient calculation is independent of tt as well. We provide a proof of concept of our methods by showing that, under the same settings, IDKM achieves comparable performance to DKM with less compute time and less memory. We also use IDKM and IDKM-JFB to quantize a large neural network, Resnet18, on hardware where DKM cannot train at all.

View on arXiv
Comments on this paper