14
21

An Efficient kk-modes Algorithm for Clustering Categorical Datasets

Abstract

Mining clusters from datasets is an important endeavor in many applications. The kk-means algorithm is a popular and efficient distribution-free approach for clustering numerical-valued data but can not be applied to categorical-valued observations. The kk-modes algorithm addresses this lacuna by taking the kk-means objective function, replacing the dissimilarity measure and using modes instead of means in the modified objective function. Unlike many other clustering algorithms, both kk-modes and kk-means are scalable, because they do not require calculation of all pairwise dissimilarities. We provide a fast and computationally efficient implementation of kk-modes, OTQT, and prove that it can find superior clusterings to existing algorithms. We also examine five initialization methods and three types of KK-selection methods, many of them novel, and all appropriate for kk-modes. By examining the performance on real and simulated datasets, we show that simple random initialization is the best intializer, a novel KK-selection method is more accurate than two methods adapted from kk-means, and that the new OTQT algorithm is more accurate and almost always faster than existing algorithms.

View on arXiv
Comments on this paper