Published: October 22, 2006

**Keywords:**clustering, approximation algorithm, random sampling, polynomial-time approximation scheme

**Categories:**algorithms, approximation algorithms, polynomial time approximation scheme, random sampling, clustering

**ACM Classification:**F.2.2, G.1.2, G.1.6

**AMS Classification:**68W25, 05C85

**Abstract:**
[Plain Text Version]

We continue the investigation of problems concerning
*correlation clustering* or *clustering with qualitative
information*, which is a clustering formulation that has been
studied recently (Bansal, Blum, Chawla (2004), Charikar, Guruswami,
Wirth (FOCS'03), Charikar, Wirth (FOCS'04), Alon et al. (STOC'05)).
In this problem, we are given a complete graph on $n$ nodes (which
correspond to nodes to be clustered) whose edges are labeled $+$
(for similar pairs of items) and $-$ (for dissimilar pairs of
items). Thus our input consists of only qualitative information on
similarity and no quantitative distance measure between items. The
quality of a clustering is measured in terms of its number of
agreements, which is simply the number of edges it correctly
classifies, that is the sum of number of $-$ edges whose endpoints
it places in different clusters plus the number of $+$ edges both of
whose endpoints it places within the same cluster.

In this paper, we study the problem of finding clusterings that
maximize the number of agreements, and the complementary
minimization version where we seek clusterings that minimize the
number of disagreements. We focus on the situation when the number
of clusters is stipulated to be a *small constant* $k$. Our
main result is that for every $k$, there is a polynomial time
approximation scheme for both maximizing agreements and minimizing
disagreements. (The problems are NP-hard for every $k \geq 2$.) The
main technical work is for the minimization version, as the PTAS for
maximizing agreements follows along the lines of the property tester
for Max $k$-CUT by Goldreich, Goldwasser, Ron (1998).

In contrast, when the number of clusters is not specified, the problem of minimizing disagreements was shown to be APX-hard (Chawla, Guruswami, Wirth (FOCS'03)), even though the maximization version admits a PTAS.