15
9

Semi-supervised clustering for de-duplication

Abstract

Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters. We introduce a framework which we call promise correlation clustering. Given a complete graph GG with the edges labelled 00 and 11, the goal is to find a clustering that minimizes the number of 00 edges within a cluster plus the number of 11 edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph GG^* with edges corresponding to points in the same cluster being labelled 00 and other edges being labelled 11. Under the promise that the edge difference between GG and GG^* is "small", we prove that finding the optimal clustering (or GG^*) is still NP-Hard. [Ashtiani et. al, 2016] introduced the framework of semi-supervised clustering, where the learning algorithm has access to an oracle, which answers whether two points belong to the same or different clusters. We further prove that even with access to a same-cluster oracle, the promise version is NP-Hard as long as the number queries to the oracle is not too large (o(n)o(n) where nn is the number of vertices). Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class F\mathcal F of clusterings. We offer a semi-supervised algorithmic approach to solve the restricted variant with success guarantees.

View on arXiv
Comments on this paper