12
21
v1v2v3 (latest)

An Efficient kk-modes Algorithm for Clustering Categorical Datasets

Abstract

Mining clusters from data is an important endeavor in many applications. The kk-means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The kk-modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the kk-means objective function. We provide a novel, computationally efficient implementation of kk-modes, called OTQT. We prove that OTQT finds updates to improve the objective function that are undetectable to existing kk-modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for kk-modes optimization.

View on arXiv
Comments on this paper