An Efficient $k$ -modes Algorithm for Clustering Categorical Datasets

6 June 2020

Abstract

Mining clusters from datasets is an important endeavor in many applications. The $k$ -means algorithm is a popular and efficient distribution-free approach for clustering numerical-valued data but can not be applied to categorical-valued observations. The $k$ -modes algorithm addresses this lacuna by taking the $k$ -means objective function, replacing the dissimilarity measure and using modes instead of means in the modified objective function. Unlike many other clustering algorithms, both $k$ -modes and $k$ -means are scalable, because they do not require calculation of all pairwise dissimilarities. We provide a fast and computationally efficient implementation of $k$ -modes, OTQT, and prove that it can find superior clusterings to existing algorithms. We also examine five initialization methods and three types of $K$ -selection methods, many of them novel, and all appropriate for $k$ -modes. By examining the performance on real and simulated datasets, we show that simple random initialization is the best intializer, a novel $K$ -selection method is more accurate than two methods adapted from $k$ -means, and that the new OTQT algorithm is more accurate and almost always faster than existing algorithms.

View on arXiv

Comments on this paper

An Efficient kkk-modes Algorithm for Clustering Categorical Datasets

An Efficient $k$ -modes Algorithm for Clustering Categorical Datasets