41
0
v1v2v3v4 (latest)

Expectation Distance-based Distributional Clustering for Noise-Robustness

Abstract

This paper presents a clustering technique that reduces the susceptibility to data noise by learning and clustering the data-distribution and then assigning the data to the cluster of its distribution. In the process, it reduces the impact of noise on clustering results. This method involves introducing a new distance among distributions, namely the expectation distance (denoted, ED), that goes beyond the state-of-art distribution distance of optimal mass transport (denoted, W2W_2 for 22-Wasserstein): The latter essentially depends only on the marginal distributions while the former also employs the information about the joint distributions. Using the ED, the paper extends the classical KK-means and KK-medoids clustering to those over data-distributions (rather than raw-data) and introduces KK-medoids using W2W_2. The paper also presents the closed-form expressions of the W2W_2 and ED distance measures. The implementation results of the proposed ED and the W2W_2 distance measures to cluster real-world weather data as well as stock data are also presented, which involves efficiently extracting and using the underlying data distributions -- Gaussians for weather data versus lognormals for stock data. The results show striking performance improvement over classical clustering of raw-data, with higher accuracy realized for ED. Also, not only does the distribution-based clustering offer higher accuracy, but it also lowers the computation time due to reduced time-complexity.

View on arXiv
Comments on this paper