14
21

An Efficient kk-modes Algorithm for Clustering Categorical Datasets

Abstract

Mining clusters from datasets is an important endeavor in many applications. The kk-means algorithm is a popular and efficient, distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The kk-modes algorithm addresses this lacuna by replacing the Euclidean distance with the Hamming distance and the means with the modes in the kk-means objective function. We provide a novel, computationally efficient implementation of kk-modes, called OTQT. We prove that OTQT finds updates, undetectable to existing kk-modes algorithms, that improve the objective function. Thus, although slightly slower per iteration owing to its algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. As a result, we recommend OTQT as the preferred, default algorithm for all kk-modes implementations. We also examine five initialization methods and three types of KK-selection methods, many of them novel or novel applications to kk-modes. By examining performance on real and simulated datasets, we show that simple random initialization is the best initializer and that a novel KK-selection method is more accurate than methods adapted from kk-means.

View on arXiv
Comments on this paper