A Constant-Factor Approximation Algorithm for Co-clustering ∗

: Co-clustering is the simultaneous partitioning of the rows and columns of a matrix such that the blocks induced by the row/column partitions are good clusters. Motivated by several applications in text mining, market-basket analysis, and bioinformatics, this problem has attracted a lot of attention in the past few years. Unfortunately, to date, most of the algorithmic work on this problem has been heuristic in nature. In this work we obtain the ﬁrst approximation algorithms for the co-clustering problem. Our algorithms are simple and provide constant-factor approximations to the optimum. We also show that co-clustering is NP -hard, thereby complementing our algorithmic result.


Introduction
Clustering is a fundamental primitive in many data-analysis applications, including information retrieval, databases, text and data mining, bioinformatics, market-basket analysis, and so on [12,20].The central objective in clustering is the following: given a set of points and a pairwise distance measure, partition the set into clusters such that points that are close to each other according to the distance measure occur together in a cluster and points that are far away from each other occur in different clusters.This objective sounds straightforward, but it is not easy to state a universal desiderata for clustering-Kleinberg showed in a reasonable axiomatic framework that clustering is an impossible problem to solve [22].In general, the clustering objectives tend to be application specific, exploiting the underlying structure in the data and imposing additional structure on the clusters themselves.
In several applications, the data itself has a lot of structure, which may be hard to capture using a traditional clustering objective.Consider the example of a Boolean matrix, whose rows correspond to keywords and whose columns correspond to advertisers, and an entry is one if and only if the advertiser has placed a bid on the keyword.The goal is to cluster both the advertisers and the keywords.One way to accomplish this would be to independently cluster the advertisers and keywords using the standard notion of clustering-cluster similar advertisers and cluster similar keywords.However (even though for some criteria this might be a reasonable solution, as we argue subsequently in this work), such an endeavor might fail to elicit subtle structures that might exist in the data: perhaps, there are two disjoint sets of advertisers A 1 , A 2 and keywords K 1 , K 2 such that each advertiser in A i bids on each keyword in K j if and only if i = j.In an extreme case, maybe there is a combinatorial decomposition of the matrix into blocks such that each block is either almost full or almost empty.To be able to discover such things, the clustering objective has to simultaneously intertwine information about both the advertisers and keywords that is present in the matrix.This is precisely achieved by co-clustering [17,7]; other terms for co-clustering include biclustering, bidimensional clustering, and subspace clustering.
In the simplest version of (k, )-co-clustering, we are given a matrix of numbers, and two integers k and .The goal is to partition the rows into k clusters and the columns into clusters such that the sumsquared deviation from the mean within each "block" induced by the row-column partition is minimized.This definition, along with different objectives, is made precise in Section 2. Co-clustering has received a lot of attention in recent years, with several applications in text mining [10,15], market-basket data analysis, image, speech, and video analysis, and bioinformatics [7,8,23]; see the paper by Banerjee et al. [3] and the survey by Madeira and Oliveira [25].
Even though co-clustering has been extensively studied in many application areas, very little is known about it from an algorithmic angle.Very special variants of co-clustering are known to be NP-hard [18].A natural generalization of the k-means algorithm to co-clustering is known to converge [3].Apart from these, most of the algorithmic work done on co-clustering has been heuristic in nature, with no proven guarantees of performance.

Main contributions
In this paper we address the problem of co-clustering from an algorithmic point of view.Our main contribution is the first constant-factor approximation algorithm for the (k, )-coclustering problem.Our algorithm is simple and builds upon approximation algorithms for a variant of the k-median problem, which we call k-means p .The algorithm works for a standard set of matrix norms and produces a 3α-approximate solution, where α is the approximation factor for the k-means p problem; for the latter, we obtain a constant-factor approximation by extending known results on the k-median problem.We next consider the important special case of the Frobenius norm and obtain a ( √ 2 + ε)-approximation algorithm when k and are fixed by exploiting the geometry of the space and the results on the k-means problem.
We complement these results by considering the extreme cases of = 1 and = n ε , where the matrix is of size m × n and ε is any fixed value in (0, 1).We show that while the (k, 1)-co-clustering problem can be solved exactly in time O(mn + m 2 k), the (k, n ε )-co-clustering problem is NP-hard, for k ≥ 2 under the 1 and 2 norms.
In the data-mining and machine-learning literature there exists a variety of co-clustering problems depending on what is the precise objective function one is trying to optimize; we give some examples in Section 2. The version that we address in this paper is a natural one and at the same time it possesses such a structure that allows us to approximately solve it by decomposing it into single-dimensional clustering problems.This is not true for other types of objectives.For instance, if one considers the residue given by (2.2) (for more discussion about the relevant concepts refer to Section 2) then one can create counter-examples showing that the problem is not decomposable; solving such cases is interesting future work.

Related work
Research on clustering has a long and varied history, with work ranging from approximation algorithms to axiomatic developments of the objective functions [19,12,22,20,36,16].The problem of co-clustering itself has found growing applications in several practical fields, for example, simultaneously clustering words and documents in information retrieval [10], clustering genes and expression data for biological data analysis [7,34], clustering users and products for recommendation systems [1], and so on.The exact objective function, and the corresponding definition of co-clustering varies, depending on the type of structure we want to extract from the data.The hardness of the co-clustering problem depends on the exact merit function to be used [18].Consequently, work on co-clustering has mostly focused on heuristics that work well in practice.Excellent references on such methods are the surveys by Madeira and Oliveira [25] and Tanay, Sharan, and Shamir [32].Banerjee et al. [3] unified a number of merit functions for the co-clustering problem under the general setting of Bregman divergences and gave a k-means style algorithm that is guaranteed to monotonically decrease the merit function.Our objective function for the p = 2 case is precisely the • F merit function for which their results apply.
There is little work along the lines of approximation algorithms for the co-clustering problems.The closest algorithmic work to this problem relates to finding cliques and dense bipartite subgraphs [28].These variants are often hard even to approximate to within a constant factor.Hassanpour [18] showed that a version of the co-clustering problem that finds out homogeneous submatrices is hard.Feige and Kogan [13] showed that the problem of finding the maximum biclique cannot be approximated to within 2 (log n) δ for some δ = δ (ε) > 0 assuming that 3-SAT cannot be solved in O(2 n 3/4+ε ) deterministic time for some fixed ε > 0.
Concurrently with the conference publication of our work, Puolamäki et al. [30] published results on the co-clustering problem for objective functions of the same form that we study.They analyze the same algorithm for two cases, the 1 norm for 0/1-valued matrices and the 2 norm for real-valued matrices.In the first case they obtain a better approximation factor than ours (2.414α as opposed to 3α, where α is the best approximation factor for one-sided clustering).On the other hand, our result is more general as it holds for any p norm and for real-valued matrices.Their 2 result is the same as ours ( √ 2α-approximation) and their proof is similar (although presented differently).Subsequent to the conference publication of our work, Jegelka et al. [21] showed that our algorithm can be extended to tensor clustering and to general metrics.

Preliminaries and problem definition
In this section we mention briefly some of the variants of the objective function that have been proposed in the co-clustering literature and are close to the ones we use in this work.Other commonly used objectives are based on information-theoretic quantities.
Cho et al. [8] define two different co-clustering objectives, based on two different definitions of residue.For every element a i j that belongs to the (I, J)-co-cluster define its residue h i j either as or as where a i j is the average of all the entries in the co-cluster, is the mean of all the entries in row i whose columns belong into J, and is the mean of all the entries in column j whose rows belong into I.
Having defined the (two different) residues, the goal is to minimize some norm of the residue matrix H = (h i j ).The norm most commonly used in the literature is the Frobenius norm, • F , defined as the square root of the sum of the squares of the elements: One can attempt to minimize some other norm; for example, Yang et al. [35] attempt to find (possibly overlapping) clusters that minimize the value 1 2), where I and J define a cluster.
More generally, one can define the norm Note that the Frobenius norm is a special case, where p = 2.
In this work we study the general case of norms of the form of (2.3), for p ≥ 1, using the residual definition of (2.1); designing approximation algorithms for residual definition corresponding to (2.2) is an open problem.Specifically, we assume that the data is given in the form of a matrix A in R m×n .We denote the ith row of A as A i and the jth column of A as A j .The aim in co-clustering is to simultaneously cluster the rows and columns of A, so as to optimize the difference between A and the clustered matrix.Formally, we want to compute a k-partitioning I = {I 1 , . . ., I k } of the set of rows {1, . . ., m} and an -partitioning J = {J 1 , . . ., J } of the set of columns {1, . . ., n}.For any such partition I = {I 1 , . . ., I k } of rows (or columns) a convenient way for us to describe it will be in terms of a clustering index matrix, say R = {r i j } ∈ R m×k , such that each row in R essentially corresponds to the index vector of the corresponding part in the partition I, that is, r iI = 1 if i ∈ I and 0 otherwise (see Figure 1).(Note that we slightly abuse notation and we use the partition as an index.For example, the element r i I s refers to the element r is of matrix R and it equals 1 if i ∈ I s and 0 otherwise.)We can define a similar index matrix C = {c i j } ∈ R ×n for the column partition J in a similar fashion: c J j = 1, if j ∈ J and 0 otherwise.The co-clustering solution is then completely defined by the three matrices R ∈ R m×k , M ∈ R k× (to be defined below), and C ∈ R ×n (see Figure 1).For each row cluster and column cluster tuple (I, J), we refer to the set of indices in I × J to be a block.
The clustering error associated with the co-clustering (I, J) is defined to be the quantity where M = {µ i j } is defined as the matrix in R k× that minimizes Let m I = |I| be the size of the row cluster I and n J = |J| denote the size of the column cluster J.By the definition of | • | p , we can write (see also Figure 1) where each

and 0 otherwise
(which means that if I refers to the jth row cluster then the jth column of R I is 1), each C J = {c J } ∈ R ×n J , and equals (c J ) i j = 1 if J = J i , and 0 otherwise (which means that if J refers to the ith column cluster then the ith row of C J is 1), each µ IJ ∈ R is the mean element associated with co-cluster (I, J), and 1 m I ×n J is the 2-dimensional matrix of size m I × n J with all the elements equal to 1. Two special cases that are of interest to us are p = 1, 2. For the p = 2 case, the matrix norm | • | p corresponds to the the well known Frobenius norm • F , and the value µ IJ corresponds to a simple average of the corresponding block.For the p = 1 case, the norm corresponds to a simple sum over the absolute values of the entries of the matrix, and the corresponding µ IJ values would be the median of the entries in that block.
In general the number of row and column clusters, k and , are parts of the input.In some cases, for k and constants one can obtain stronger results; when this is the case we will mention it explicitly.

Algorithm
In this section we give a simple algorithm for co-clustering for the objective function defined in (2.4).We first present the algorithm and then show that for the general | • | p norm, the algorithm gives a constant-factor approximation.We then do a tighter analysis for the simpler case of | • | 2 (i.e., the Frobenius norm), to show that we get a ( √ 2 + ε)-approximation when k, are constant.Before analyzing the algorithm we need to show how we can solve the subproblems of steps 1 and 2. For this we use some results for "standard" one-sided clustering and we develop some new ones needed to solve the problem for any p-norm.

One-sided clustering
In the standard clustering problem, we are given n points in a metric space, possibly R d , and an objective function that measures the quality of any given clustering of the points.Various such objective functions have been extensively used in practice, and have been analyzed in the theoretical computer science literature (k-center, k-median, k-means, etc.).As an aid to our co-clustering algorithm, we are particularly interested in the following setting of the problem, which we call k-means p .Given a set of vectors a 1 , a 2 , . . .a n ∈ R d , the standard p distance metric • p , and an integer k, we first define the cost of a partitioning I = {I 1 , . . ., I k } of {1, . . ., n} as follows.For each cluster I ∈ I, the center of the cluster I is defined to be the vector µ I such that The cost of the clustering I is then defined to be the sum of distances of each point to the corresponding cluster center, raised to the power of 1/p, that is, The goal for the k-means p problem is to minimize this cost.In general the number of clusters k is part of the input.We also consider the case that k is constant, where we can obtain stronger results.When this is the case we mention it explicitly.
In matrix notation, the points define a matrix A = [a 1 , . . ., a n ] T ∈ R n×d .We will represent each clustering I = {I 1 , . . ., I k } of n points in R d by a clustering index matrix R ∈ R n×k .Each column of matrix R will essentially be the index vector of the corresponding cluster, R iI = 1 if a i belongs to cluster I, and 0 otherwise (see Figure 2).Similarly, the matrix M ∈ R k×d is defined to be the set of centers of the clusters, that is, M = [µ 1 , . . ., µ k ] T .Thus, the aim is to find out the clustering index matrix R that minimizes where M is defined as the matrix in R k×d that minimizes Let m I be the size of the row cluster I, A I ∈ R m I ×d the corresponding submatrix of A, and R I ∈ R m I ×k the corresponding submatrix of R. Also let A i be the ith row vector of A. We can write THEORY OF COMPUTING, Volume 8 (2012), pp.597-622 Of particular interest to us is the value p = 2.In that case, the center µ I for each cluster is the average of all the points A i in that cluster (the average minimizes the expression (3.1) for p = 2).It is commonly known as the k-means problem and it admits a polynomial-time (1 + ε)-approximation algorithm when k is fixed.

Theorem 3.1 ([24]
).For any fixed values of k and ε > 0 there is a polynomial-time algorithm that achieves a (1 + ε)-factor approximation for the k-means problem, if dimension is part of the input.
In Theorem 3.4 we show that there exists a constant approximation algorithm for the k-means p problem for any p (and k).Our proof will be based on the analysis of the similar k-median problem by Charikar et al. [6].In the k-median problem we are given a set A = {a 1 , . . ., a n } of points, the cost d(a i , a j ) of assigning point a i to point a j , and an integer k > 0, and the goal is to find a k-partitioning I of {1, . . ., n} to minimize the cost where the k-median center µ med I of cluster I is given by In the metric case, d(•, •) is assumed to be symmetric and to satisfy the triangle inequality.Unfortunately, in our case we use the norm p p and the triangle inequality does not hold for p > 1.However, as we show next in Lemma 3.3, it does hold approximately.This allows us to use the following result of Charikar et al. [6], which is a simple extension of their constant-factor approximation for the metric k-median problem. 1

Theorem 3.2 ([6]
).There is a polynomial-time algorithm that achieves an 8δ 2 -approximation to the metric k-median problem if the costs satisfy a δ -approximate triangle inequality, for δ ≥ 1, that is, d(a i , a o ) ≤ δ (d(a i , a j ) + d(a j , a o )) for each i, j, o.
Next we prove Lemma 3.3, which shows that the distance measure induced by p p satisfies an approximate version of the triangle inequality.
Proof.Note that the function f (x) = |x| p is convex for p ≥ 1 and hence Let u = (u 1 , . . ., u d ), and similarly for v, w.We have Now we can state and prove the theorem for the one-sided clustering.
Theorem 3.4.There is a polynomial-time algorithm that achieves a 16-approximation to the k-means p problem for p ≥ 1.
Proof.The proof is based on the result by Charikar et al., described in Theorem 3.2.We note two issues that make our problem different from the standard metric k-median: (i) we have d(a i , a j ) = a i − a j p p , but it does not satisfy the triangle inequality and (ii) the centers in k-median are chosen from the given set A (i. e., µ med I ∈ A) whereas the centers in k-means p are not necessarily from A. Note also a small asymmetry between the cost functions of the k-means p and the k-median objectives, as in the former the sum of the distances is raised to the power 1/p, unlike in the latter.
To address issue (i), as we showed in Lemma 3.3, the distance measure • p p satisfies an approximate triangle inequality and hence we appeal to Theorem 3.2.We address issue (ii) as follows.Let I * be the optimal clustering according to the k-means p objective (i.e., the cluster centers can be arbitrary) and let J * be the optimal clustering according to the k-median objective (i.e., the cluster centers belong to A).In Lemma 3.5, proven directly after this proof, we show that the loss of the requirement of cluster centers belonging to A can be bounded.
Define We now prove the lemma used in the preceding theorem.
Lemma 3.5.If the costs are in p p , then c(J * ) ≤ 2 p c(I * ).
Proof.First, we claim that for each cluster I ∈ I * , there is some a ∈ A such that The proof of the claim is by a standard averaging argument over the points in cluster I.By Lemma 3. it means that there exists an element a j ∈ I for which and thus we have found the desired element a ∈ A for cluster I.By summing this over all the clusters in I * , we have shown that each cluster center in I * can be replaced by some point in A without increasing the clustering cost by too much.The proof then follows from the optimality of J * .

Constant-factor approximation
We now show that the co-clustering returned by Algorithm 1 is a constant-factor approximation to the optimum.
Theorem 3.6.Given an algorithm for obtaining an α-approximation to the k-means p problem, the algorithm Co-cluster (Algorithm 1) returns a co-clustering that is a 3α-approximation to an optimal co-clustering of A.
Proof.Let (I * , J * ) be an optimal co-clustering solution.Define the corresponding index matrices to be R * and C * respectively.Furthermore, let Î * be an optimal row clustering and Ĵ * be an optimal column clustering.Define the index matrix R * from the clustering Î * , and the index matrix Ĉ * from the clustering Ĵ * .This means that there is a matrix M * R ∈ R k×n such that is minimized over all such index matrices representing k clusters.Similarly, there is a a matrix M * is minimized over all such index matrices representing clusters.The algorithm Co-cluster uses approximate solutions for the one-sided row and column clustering problems to compute partitionings Î and Ĵ.Let R be the clustering index matrix corresponding to this row clustering and MR be the set of centers.Similarly, let Ĉ, MC be the corresponding matrices for the column clustering constructed by Co-cluster.By the assumptions of the theorem we have that and, similarly, For the co-clustering ( MR , MC ) that the algorithm computes, define the center matrix M ∈ R k× as follows.Each entry µ IJ is defined to be Now we will show that the co-clustering ( Î, Ĵ) with the center matrix M will be a 3α-approximate solution.First, we lower bound the cost of the optimal co-clustering solution by the optimal row clustering and optimal column clustering.Since ( R * , M * R ) is the optimal row clustering, we have that Similarly, since ( Ĉ * , M * C ) is the optimal column clustering, Let us consider a particular block (I, J) ∈ Î × Ĵ.Note that ( R MR ) i j = ( R MR ) i j for i, i ∈ I.We denote rI j = ( R MR ) i j .Let μIJ be the value x that minimizes We also denote ĉiJ = ( MC Ĉ) i j .Then for all i ∈ I we have .9) where the last inequality is just application of the triangle inequality.
Then we get By combining the above with Theorem 3.4 we obtain the following.
Corollary 3.7.There is a polynomial-time algorithm that returns a (k, )-co-clustering that is a 48-factor approximation to the optimum under the | • | p norm.

A tighter analysis for the Frobenius norm
A commonly used instance of our objective function is the case of p = 2, the Frobenius norm.The results of the previous section give us a (3 + ε)-approximation in this particular case, when k, are constants.But it turns out that in this case, we can actually exploit the particular structure of the Frobenius norm and give a better approximation factor.
To restate the problem, we want to compute clustering matrices R ∈ R m×k ,C ∈ R ×n , such that R i,I = 1, if i ∈ I and 0 otherwise, and C J, j = 1, if j ∈ J and 0 otherwise (see Section 2 for more details) such that A − RMC F is minimized, where M ∈ R k× and M contains the averages of the cluster, that is, M = {µ IJ } where where m I is the size of row cluster I and n J is the size of column cluster J.We show the following theorem.
Theorem 3.8.Given an α-approximation algorithm for the k-means clustering problem, the algorithm Co-cluster (Algorithm 1) computes a √ 2α-approximate solution to the co-clustering problem with the • F objective function.
Proof.Define R = {r i j } ∈ R m×k similarly to R, but with the values scaled down according to the clustering.Specifically, ri,I = 1/ √ m I , if i ∈ I and 0 otherwise.Likewise, define C = { ci j } ∈ R ×n , with cJ, j = 1/ √ n J , if j ∈ J and 0 otherwise.Then notice that we can write RMC = R RT A CT C.
If we consider also the one-sided clusterings (RM R and M C C) then we can also write RM R = R RT A and M C C = A CT C.
We define P R = R RT .Then P R is a projection matrix.To see why this is the case, notice first that R has orthogonal columns: and ( RT • R) IJ = 0, for I = J, thus RT • R = I k .Therefore P R P R = P R , hence P R is a projection matrix.Define as P ⊥ R = (I m − P R ) the projection orthogonal to P R .Similarly we define the projection matrices P C = CT C and P ⊥ C = (I n − P C ).In general, in the rest of the section, P X and P ⊥ X refer to the projection matrices that correspond to clustering matrix X.
We can then state the problem as finding the projections of the form P R = R RT and P C = CT C that minimize A − P R AP C 2 F , under the constraint that R and C are of the form that we described previously.THEORY OF COMPUTING, Volume 8 (2012), pp.597-622 Let R * and C * be the optimal co-clustering solution, R * and Ĉ * be the optimal one-sided clusterings, and R and Ĉ be the one-sided row and column clusterings that are α-approximate to the optimal ones.We have We can write , where (a) follows from the Pythagorean theorem (we apply it to every column separately and the square of the Frobenius norm is just the sum of the column lengths squared) and the fact that the projection matrices P R and P ⊥ R are orthogonal to each other, and (b) again from the Pythagorean theorem and the orthogonality of P Ĉ and P ⊥ Ĉ .Without loss of generality we assume that (otherwise we can consider A T ).Then, where the first equality follows once again from the Pythagorean theorem.By combining (3.11) and (3.12) we get By (3.8) we know that the error of the optimal one-sided clustering is bounded by the error of the optimal co-clustering.Considering (3.8) for p = 2, taking the square, and rewriting with the notation of this section we have that THEORY OF COMPUTING, Volume 8 (2012), pp.597-622 Proof.The proof is based on Brucker's algorithm for one-dimensional clustering [5]; we present it for completeness.Since A is just a set of real values, then (k, 1) clustering A corresponds to the partition of those values into k clusters.The proof is based on the following fact.If a cluster in the optimal solution contains values a i and a j then it should contain also all the values in between.The proof is by contradiction.Assume that there are two clusters such that the values are not separated.Then we can replace these two clusters with two clusters with the values separated, and the cost will be lower.This fact implies that we can solve the problem using dynamic programming.Assume that the sorted values of A are {a 1 , a 2 , . . ., a m }.We define C(i, r) to be the optimal r-clustering solution of {a 1 , . . ., a i }.Then, knowing C( j, r − 1) for j ≤ i allows us to compute C(i, r), by considering all the possible positions that the rth cluster begins: where c( j, i) is the cost of the single cluster containing the values {a j , a j+1 , . . ., a i }.
We need to compute O(mk) values C(i, r) so the space needed is O(mk), and to compute each of them we require O(m) time, so the total time required is O(m 2 k).
We now use this lemma to solve optimally for general A, under the Frobenius norm.The algorithm is simple.Assume that be the mean of row i.Also write a i j = µ i + ε i j , and note that for all i we have ∑ n j=1 ε i j = 0.The algorithm then runs the dynamic programming algorithm on the vector of the means and returns the clustering produced.
Algorithm 2: Co-cluster-DP(A, k) Require: Matrix A ∈ R m×n , number of row clusters k.
1: Create the vector v = (µ 1 , µ 2 , . . ., µ m ), where µ i = 1 n ∑ n j=1 a i j .2: Use the dynamic programming algorithm of Lemma 3.11 and let I be the resulting k-clustering.3: return (I, {1, . . ., n}).Theorem 3.12.Let A ∈ R m×n .Let I be the clustering produced under the • F norm by Algorithm 2. Then I has optimal cost.The running time of the algorithm is O(mn + m 2 k).
Proof.Let us see the cost of a given cluster I, with |I| = m I rows.The mean µ I of the cluster equals THEORY OF COMPUTING, Volume 8 (2012), pp.597-622 The cost of the cluster is since ∑ n j=1 ε i j = 0, for all i, as we mentioned previously.Therefore, the cost of the entire clustering I = {I 1 . . ., Consider now the one-dimensional problem of (k, 1) clustering only the row means µ i .The cost of a given cluster I is: Thus the cost of the clustering is Compare the cost of this clustering with that of (3.15).Note that in both cases the optimal row clustering is the one that maximizes the term ∑ I∈I m I µ 2 I , as all the other terms are independent of the clustering.Thus we can optimally solve the problem for A ∈ R m×n by solving the problem simply on the means vector.The time needed to create the vector of means is O(mn), and by applying Lemma 3.11 we conclude that we can solve the problem in time O(mn + m 2 k).

Hardness of the objective function
In this section we show that the problem of co-clustering an m × n matrix A is NP-hard when the number of clusters on the column side is at least n ε for any fixed 0 < ε < 1.While there are several results in the literature that show hardness of similar problems [31,18,4,29], we are not aware of any previous result that proves the hardness of the co-clustering for the objectives that we study in this paper.
Theorem 4.1.The problem of finding an optimal (k, ) co-clustering under the 1 norm for a matrix A ∈ R m×n is NP-hard for (k, ) = (k, n ε ), for any k ≥ 2 and any fixed 0 < ε < 1.
Proof.The proof contains several steps.First we reduce the one-sided s-median problem (where s = n/3 + o(n)) under the 1 norm to the (2, n/3 + o(n))-co-clustering when A ∈ R 2×n .We reduce the latter problem to the case of A ∈ R m×n and (k, n/3 + o(n)), and this, finally, to the case of (k, n ε )-coclustering.We now proceed with the details.
Hardness of (2, n/3 + o(n)) Megiddo and Supowit [27] show that the (one-sided) continuous (i.e., cluster centers can be arbitrary points in space) p-median problem is NP-hard under the 1 norm in R 2 .By looking carefully at the pertinent proof we can see that the problem is hard even if we restrict the number of clusters to be n/3 + o(n); here, n is the number of points.Let us assume that we have such a problem instance of n points {b j = (b j1 , b j2 ); j = 1, . . ., n} and we want to assign them into clusters, = n/3 + o(n), so as to minimize the 1 norm.Specifically, we want to compute a partition J = {J 1 , . . ., J } of {1, . . ., n}, and points µ 1 , . . ., µ ∈ R 2 such that the objective is minimized.We construct a co-clustering instance by constructing the matrix A where we set A i j = b ji , for i = 1, 2 and j = 1, . . ., n: which we want to (2, )-co-cluster.Solving this problem is equivalent to solving the one-sided clustering problem.To provide all the details, there is only one row clustering, I = {{1}, {2}} (strictly speaking, there is also the clustering {{1, 2}, {}}, which has higher cost unless b j1 = b j2 for all j), and consider the column clustering J = {J 1 , . . ., J } and the corresponding center matrix M ∈ R 2× .The cost of the solution equals Note that this expression is minimized when the point (M 1J , M 2J ) is the median of the points b j = (b j1 , b j2 ), j ∈ J, in which case the cost equals to that of (4.1), if the partitioning J is the same.Thus a solution to the co-clustering problem induces a solution to the one-sided problem.Therefore, solving the (2, )-co-clustering problem in R 2×n is NP-hard.
Hardness of (k, n/3 + o(n)), k > 2 The next step is to show that it is NP-hard to (k, )-co-cluster any matrix of dimensions m × n for any m, n, k > 2 and = n/3 + o(n).To prove it we will use the previous hardness result for the (2, )-co-clustering in R 2×n .We start with an instance of the (2, )-co-clustering problem in R 2×n given by a matrix A as described previously and we create a matrix Ã = { Ãi j } ∈ R m×n , by distorting one of the two dimensions and by adding to the previous matrix A, m − 2 rows of some value 4B, where B is sufficiently large (say B > 8m max{ A i j }): .
Indeed, we can achieve a solution with the same cost as (4.2) by the same column partitioning J and a row partitioning that puts each of rows 1 and 2 to its own cluster and cluster the rest of the rows (where all the values equal 4B) arbitrarily.Notice that this is an optimal solution since any other row cluster is of the following form: either it puts rows 1 and 2 in the same cluster, or it puts either row 1 or 2 with any of the other rows in the same cluster.In the first case, let I with 1, 2 ∈ I be the row cluster and J be any column cluster.Then, if M IJ is the median value of the entries in the cluster, we have that M IJ > B/2, since half of the entries are at least B. Thus, there will be a cell (1, j), j ∈ J for which the residue will be since B > 4 max{ A i j }.In the second case, since half of the elements of a cluster have value at least 4B, the median is at least 2B, therefore there is a residue of a cell Ãi j for i ∈ {1, 2}, j ∈ J that is at least Thus, in both cases, we incur a penalty of at least B/4, which is larger than that of (4.2), for B > 8m max{ A i j }.
Hardness of (k, n ε ) The final step is to reduce a problem instance of finding a (k, )-co-clustering of a matrix A ∈ R m×n , with = n /3 + o(n ) to a problem instance of finding a (k, )-co-clustering of a matrix A ∈ R m×n , with = n ε , for any fixed 0 < ε < 1.The construction is similar as before.Let A = {A i j }.Define n = ( + 1) 1/ε and let A ∈ R m×n .For n > 3 1/(1−ε) , n > n .Furthermore, n = O(n 1/ε ), hence the reduction is polynomial time.For 1 ≤ j ≤ n , define A i j = A i j and for any j > n , define A i j = B , where B is some sufficiently large value (e. g., B > 4mn max|A i j |): . Now, we only need to prove that the optimal solution of a (k, + 1) = (k, n ε )-co-clustering of A corresponds to the optimal solution of the (k, )-co-clustering of A .Assume that the optimal solution for matrix A is given by the partitions I = {I 1 , . . ., I k } and J = {J 1 , . . ., J }.The cost of the solution is where M IJ is defined as the median of the values {A i j ; i ∈ I, j ∈ J}.
Let us compute the optimal solution for the (k, + 1)-co-clustering of A. First note that we can compute a solution (I, J) with cost C (I , J ).We let I = I , and for J = {J 1 , . . ., J +1 ) we set J j = J j for j ≤ , and J +1 = {n + 1, n + 2, . . ., n}.For matrix M we have M IJ j = M IJ j for j ≤ and M IJ +1 = B . with respect to the optimum.We also show that the co-clustering problem is NP-hard, for a wide range of the input parameters.Finally, as a byproduct, we introduce the k-means p problem, which generalizes the k-median and k-means problems, and give a constant-factor approximation algorithm.
Our work leads to several interesting questions.In Section 4 we showed that the co-clustering problem is NP-hard if = Ω(n ε ) under the 1 and 2 norms.The restriction on arises from the fact that the base result on k-median hardness by Megiddo and Supowit [27] only works for Ω(n) clusters.Recent results by Dasgupta [9] and Aloise et al. [2] showed that k-means is hard for k = 2 when the number of dimensions is part of the input.Mahajan et al. [26] and Vattani [33] showed hardness for constant dimension and k being part of the input.Similar results for k-median are unresolved.For co-clustering, even the hardness questions for the (2, 2) or the (O(1), O(1)) cases are, as far as we know, unresolved.While we conjecture that these cases are hard, we do not have yet a proof for this.It should also be interesting to extend the hardness results to any p norm, p ≥ 1; relevant to this are the hardness result for the k-center [14].Similarly it would be interesting to see if we can obtain PTAS for co-clustering for some norms.
Another question is whether the problem becomes easy for matrices A having a particular structure.For instance, if A is symmetric, and k = , is it the case that the optimal co-clustering is also symmetric?The answer turns out to be negative, even if we are restricted to 0/1-matrices, and the counterexample reveals some of the difficulty in co-clustering.Consider the matrix We are interested in a (2, 2)-co-clustering, say using • F .There are are three symmetric solutions, I = J = {{1, 2}, {3}}, I = J = {{2, 3}, {1}}, and I = J = {{1, 3}, {2}}, and all have a cost of 1. Instead, the nonsymmetric solution (I, J) = ({{1}, {2, 3}}, {{1, 2}, {3}}), has cost of √ 3/2.Therefore, even for symmetric matrices, one-sided clustering cannot be used to obtain the optimal co-clustering.
Another interesting direction is to find approximation algorithms for other commonly used objective functions for the co-clustering problem.It appears that our techniques cannot be directly applied to any of those.As we mentioned before, the work by Banerjee et al. [3] unifies a number of such objectives and gives an expectation maximization style heuristic for such merit functions.It would be interesting to see if given an approximation algorithm for solving the clustering problem for a Bregman divergence, we can construct a co-clustering approximation algorithm from it.Another objective function for which our approach is not immediately applicable is (2.3) using the residual definition of (2.2).For several problem domains this class of objective functions might be more appropriate than the one that we analyze here.

Figure 1 :
Figure 1: An example of co-clustering, where we have rows and columns that appear in the same cluster next to each other.Here, A IJ is represented byR I MC J = µ IJ • 1 m I ×n J as in (2.5), for instance, the 2 × 2 submatrix A I 2 J 3 is represented by R I 2 MC J 3 = µ I 2 J 3 • 1 2×2 .

Algorithm 1 : 2 :
Co-cluster(A, k, ) Require: Matrix A ∈ R m×n , number of row clusters k, number of column clusters .1: Compute an α-approximate clustering of the row vectors with k clusters.Let Î be the resulting clustering.Compute an α-approximate clustering of the column vectors with clusters.Let Ĵ be the resulting clustering.3: return ( Î, Ĵ).

Figure 2 :
Figure 2: An example of a row clustering, where we have rows and columns that appear in the same cluster next to each other.Here, A I is represented by R I • M as in (3.2).

I
c(I * ) to be the cost c(I * ) = ∑ I∈I * ∑ i∈I a i − µ I p p , THEORY OF COMPUTING, Volume 8 (2012), pp.597-622 with µ I defined as in (3.1).Notice that this is the cost of the k-means p objective raised to the pth power.Define also c(J * ) to be the cost of the k-median objective: c(J * ) = ∑ I∈I * defined as in (3.3).Using δ = 2 p−1 from Lemma 3.3, the algorithm in Theorem 3.2 outputs a clustering J whose kmeans p cost is bounded by 2 2p+1 c(J * ) 1/p .Using Lemma 3.5, we then conclude that the k-means p cost of I * is bounded by 2 3p+1 c(I * ) 1/p ≤ 16 (c(I * )) 1/p .
ANIRBAN DASGUPTA did his undergraduate studies at the Computer Science department of IIT Kharagpur, and joined the Cornell CS department as a graduate student in 2000.Anirban finished his Ph.D. in 2006 under the supervision of John Hopcroft, having worked on spectral methods for learning mixtures of distributions.Since then, Anirban has been employed as a scientist at Yahoo! Research.His research interests span linear algebraic techniques for information retrieval, algorithmic game theory, modeling of and algorithms for social networks, and the design and analysis of randomized and approximation algorithms in general.RAVI KUMAR has been a senior staff research scientist at Google since June 2012.Prior to this, he was a research staff member at the IBM Almaden Research Center and a principal research scientist at Yahoo! Research.He obtained his Ph.D. in Computer Science from Cornell University in 1998, under the guidance of Ronitt Rubinfeld; his thesis topic was on program checking.His primary interests are web and data mining, social networks, algorithms for large data sets, and theory of computation.