Fast Nonnegative Matrix Tri-Factorization for Large-Scale Data Co-Clustering Fast Nonnegative Matrix Tri-Factorization for Large-Scale Data Co-Clustering Hua Wang, Feiping Nie, Heng Huang, Fillia Makedon Department of Computer Science and Engineering University of Texas at Arlington, Arlington, Texas 76019, USA huawangcs@gmail.com, feipingnie@gmail.com, heng@uta.edu, makedon@uta.edu Abstract Nonnegative Matrix Factorization (NMF) based co- clustering methods have attracted increasing atten- tion in recent years because of their mathematical elegance and encouraging empirical results. How- ever, the algorithms to solve NMF problems usu- ally involve intensive matrix multiplications, which make them computationally inefficient. In this paper, instead of constraining the factor matrices of NMF to be nonnegative as existing methods, we propose a novel Fast Nonnegative Matrix Tri- factorization (FNMTF) approach to constrain them to be cluster indicator matrices, a special type of nonnegative matrices. As a result, the optimiza- tion problem of our approach can be decoupled, which results in much smaller size subproblems requiring much less matrix multiplications, such that our approach works well for large-scale input data. Moreover, the resulted factor matrices can di- rectly assign cluster labels to data points and fea- tures due to the nature of indicator matrices. In ad- dition, through exploiting the manifold structures in both data and feature spaces, we further intro- duce the Locality Preserved FNMTF (LP-FNMTF) approach, by which the clustering performance is improved. The promising results in extensive ex- perimental evaluations validate the effectiveness of the proposed methods. 1 Introduction Clustering, which partitions a data set into different groups unsupervisedly, is one of the most fundamental topic in statis- tical learning. Most traditional clustering algorithms are de- signed for one-side clustering [Nie et al., 2009], i.e. cluster ei- ther data points or features. However, in many real world ap- plications, the clustering based analysis is interested in two- side clustering results, i.e. group the data points and features simultaneously, e.g., “documents” and “words” in document analysis, “users” and “items” in collaborative filtering, “sam- ples” and “genes” in microarray data analysis, etc. Typically, instead of being independent, the different clustering tasks on data and features are closely correlated, and it is challeng- ing for traditional clustering algorithms to utilize the data and features interdependence efficiently. Consequently, co- clustering techniques, which aim to cluster both data and fea- tures simultaneously by leveraging the interrelations between them, have been proposed in recent researches. To name a few, Dhillon [Dhillon, 2001] introduced a bipartite spectral graph partition approach to co-cluster words and documents; Cho et al. [Cho et al., 2004] suggested to co-cluster the exper- imental conditions and genes for microarray data by minimiz- ing the sum-squared-residue; Long et al. [Long et al., 2006] presented a relation summary network model to co-cluster the heterogeneous data on a k-partite graph, and so on. More recently, Ding et al. [Ding et al., 2005] explored the relationships between Nonnegative Matrix Factorization (NMF) [Lee and Seung, 1999; 2001] and K-means/spectral clustering, and proposed to use Nonnegative Matrix Tri- factorization (NMTF) [Ding et al., 2006] to co-cluster words and documents at the same time. Due to its mathematical elegance and encouraging empirical results, NMTF method has been further developed to address various aspects of co- clustering [Wang et al., 2008; Li et al., 2010; Gu and Zhou, 2009; Ding et al., 2010]. However, a notorious bottleneck of NMTF based co-clustering approaches is the slow com- putational speed because of intensive matrix multiplications involved in each iteration step of the solution algorithms, which makes these approaches hard to be applied to large- scale data in real world applications. In this paper, we pro- pose a novel Fast Nonnegative Matrix Tri-factorization (FN- MTF) approach to efficiently conduct co-clustering on large- scale data. Our new algorithms are interesting from the fol- lowing perspectives: • Instead of enforcing traditional nonnegative constraints on the factor matrices of NMTF, we constrain them to be cluster indicator matrices, a special type of nonnega- tive matrices. As a result, the clustering results of our approach are readily stored in the resulted factor ma- trices. However, existing NMF based methods require an extra post-processing step to extract cluster struc- tures from the factor matrices, which often leads to non- unique clustering results. • Due to the nature of indicator matrices, the optimization problems of our approach can be decoupled into sub- problems with much smaller sizes, and the decoupled subproblems involve much less matrix multiplications. Therefore, our approach is computationally efficient and 1553 Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence scale well to large-scale input data. • Taking into account the manifold structures in both data and feature spaces, we further develop a Locality Pre- served FNMTF (LP-FNMTF) approach to incorporate manifold regularizations. Efficient algorithm to opti- mize the objective with quick convergence is presented. • Promising experimental results on five benchmark data sets show that our approaches not only are faster than state-of-the-art co-clustering methods but also have competitive clustering performance. Notations and problem formalization. Throughout this pa- per, we write matrices as boldface uppercase letters and vec- tors as boldface lowercase letters. Given a matrix M = (mij), its i-th row and j-th column are denoted as mi· and m·j respectively. Traditional clustering methods focus on one-side cluster- ing, i.e., clustering the data side based on the similarities along the feature side. In the co-clustering problem, we clus- ter data points based on the distributions of features, mean- while cluster features based on the distributions of the data points. Formally, given a data set X = { x·i ∈ R d }n i=1 , we write X = [x·1, . . . , x·n] = [ xT1·, . . . , x T d· ]T . Our goal is to group the data points {x·1, . . . , x·n} into c clusters {Cj} c j=1 , and simultaneously group the features {x1·, . . . , xd·} into m clusters {Wj} m j=1 . We use a partition matrix G = [ gT 1· , . . . , gTn· ]T ∈ {0, 1}n×c to represent clustering result of data points, such that gij = 1 if x·i belongs to cluster Cj and gij = 0 oth- erwise. Similarly, we use another partition matrix F =[ fT 1· , . . . , fTd· ]T ∈ {0, 1}d×m to represent the clustering re- sults of features. Here, we call F and G as cluster indicator matrices, because each row of them, i.e., fi·(1 ≤ i ≤ d) or gi·(1 ≤ i ≤ n), has one and only one element equal to 1 to indicate the cluster membership, while the rest elements are 0. We denote the set of all cluster indicator matrices as Ψ. 2 Fast nonnegative matrix tri-factorization (FNMTF) for co-clustering In this section, we first briefly review the background of co- clustering using NMTF, which motivates the optimization ob- jective of our approach. After that, an efficient algorithm to solve our objective will be introduced. 2.1 Objective of FNMTF approach K-means clustering is a standard clustering method in statis- tical learning, which minimizes the following objective: J1 = c∑ j=1 ∑ x ·i∈Cj ‖x·i − c·j‖ 2 = c∑ j=1 n∑ i=1 gij ‖x·i − c·j‖ 2 , s.t. G ∈ Ψn×c, (1) where ‖·‖ denotes the Frobenius norm of a matrix, c·j is the j-th centroid of the data set. Because G ∈ Ψ is a cluster indicator matrix, minimizing J1 is a combinato- rial optimization problem, which is hard to be resolved in general. Therefore, the minimization of J1 is often re- laxed to maximize the following objective [Zha et al., 2001; Ding and He, 2004]: J2 = tr ( G T X T XG ) , s.t. GT G = I . (2) Note that, G in J2 is no longer an indicator matrix, but an arbitrary orthonormal matrix. Recently, Ding et al. [Ding et al., 2005] proved the equiva- lence between the relaxed objective of K-means clustering in Eq. (2) and the NMF objective when orthonormal constraints are enforced on the factor matrices, which minimizes: J3 = ‖X − FG T ‖2, s.t. F ≥ 0, G ≥ 0, FT F = I, GT G = I, (3) where X ∈ Rd×n+ , F ∈ R d×c + and G ∈ R n×c + , and J3 aims to approximate the nonnegative data matrix X by the prod- uct of F and G. The orthonormal constraints here ensure the uniqueness (up to a permutation) of the solution, and together with the nonnegative constraints make the resulted F and G approximate the K-means clustering results on both features and data points (called as “soft labels”) [Ding et al., 2005; 2006]. The latter, simultaneously clustering the rows (fea- tures) and the columns (data points) of an input data matrix, is one of the main strength of NMF defined in Eq. (3). Because the two-factor NMF in Eq. (3) is restrictive, which often gives a rather poor low-rank matrix approximation, one more factor S ∈ Rm×c+ was introduced to absorb the different scales of X, F and G. This leads to NMTF [Ding et al., 2006]: X ≈ FSGT , where F ∈ Rd×m + and G ∈ Rn×c + . S provides increased degrees of freedom such that the low- rank matrix representation remains accurate, while F gives row clusters and G gives column clusters. In order to achieve additional flexibility, in clustering scenarios, the nonnegative constraint on X (thereby the nonnegative constraint on S) can be relaxed [Ding et al., 2010], which leads to the semi-NMTF problem minimizing the following objective: J4 = ‖X − FSG T ‖2, s.t. F ≥ 0, G ≥ 0, FT F = I, GT G = I . (4) Despite its mathematical elegance, Eq. (4) suffers from two problems that impede its practical use. First, similar to Eq. (2), the relaxations on F and G make the immediate out- puts of Eq. (4) are not cluster labels, which require an addi- tional post-processing step and often lead to non-unique solu- tions. Second, and more important, Eq. (4) is usually solved by alternately iterative algorithms, and in each iteration step the intensive matrix multiplications are involved [Ding et al., 2005; 2006; 2010; Wang et al., 2008; Li et al., 2010; Gu and Zhou, 2009]. As a result, it is infeasible to apply such algorithms to large-scale real world data due to the expensive computational cost. In order to tackle these difficulties, instead of solving the relaxed clustering problems as in Eqs. (2–4), we solve the original clustering problem similar to Eq. (1). Specifically, we constrain the factor matrices of NMTF to be cluster indi- cator matrices and minimize the following objective: J5 = ‖X−FSG T ‖2 s.t. F ∈ Ψd×m, G ∈ Ψn×c . (5) 1554 We call Eq. (5) as the proposed Fast Nonnegative Matrix Tri- factorization (FNMTF) approach. Note that, the orthonormal constraints on F and G are re- moved in our objective, as their purposes (unique solution and labeling approximation) are automatically accomplished by the new constraints. Surprisingly, with these new constraints, though more stringent, as shown theoretically shortly in this section and empirically later in Section 4, the computational speed of our approach can be significantly improved. 2.2 An efficient optimization algorithm Following the standard optimization procedures, we alter- nately solve the three variables F, S and G of J5 in Eq. (5). First, fixing F, G, and setting the derivative of J5 with respect to S as zero, we have S = ( F T F )−1 F T XG ( G T G )−1 . (6) Second, when F and S are fixed, the optimization problem to obtain G can be decoupled and we solve the following simpler problem for each i (1 ≤ i ≤ n): min G∈Ψ ‖x·i − FSg T i·‖ 2 . (7) Because gi· (1 ≤ i ≤ n) ∈ Ψ 1×c is a cluster indicator vector in which one and only one element is 1 and the rest are zeros, the solution to Eq. (7) can be easily obtained by: gij = { 1 j = arg mink ‖x·i − f̃·k‖ 2, 0 otherwise, (8) where F̃ = FS and f̃·k is the k-th column of F̃. Note that, Eq. (8) simply enumerates the c vector norms and seeks the maximum one, without involving any matrix multiplication. Finally, when G and S are fixed, the optimization prob- lem to obtain F can be similarly decoupled and we solve the following simpler problem for each j (1 ≤ j ≤ d): min F∈Ψ ‖xj· − fj·SG T ‖2 . (9) Again, since fj· (1 ≤ i ≤ d) ∈ Ψ 1×m is a cluster indicator vector for feature side, the solution to Eq. (9) is: fij = { 1 i = arg minl ‖xj· − g̃l·‖ 2, 0 otherwise, (10) where G̃T = SGT and g̃l· is the l-th row of G̃ T . The procedures to solve J5 are summarized in Algorithm 1. Due to the nature of alternating optimization, Algorithm 1 is guaranteed to converge to a local minima (existing NMF al- gorithms [Ding et al., 2005; 2006; 2010] also converges to a local minima because the objectives J3 and J4 are not convex in both variables F and G), and the proof is skipped due to space limit. As can be seen, in step 2 and step 3, the solutions are obtained by enumerating the vector norms, which is def- initely much faster than the matrix multiplication used in the existing NMF methods. Upon solution, G gives the cluster- ing results of data points, and F gives the clustering results of features directly. Algorithm 1: Algorithm to solve J5 in Eq. (5). Input: Data matrix X = [x·1, . . . , x·n] ∈ R d×n. Initialize G ∈ Ψn×c and F ∈ Ψd×m with arbitrary class indicator matrices; repeat 1. calculate S by Eq. (6) ; 2. calculate G by Eq. (8) ; 3. calculate F by Eq. (10) ; until converges; Output: Indicator matrices G for data point clustering and F for feature clustering. 3 Locality preserved FNMTF (LP-FNMTF) Recent researches showed that many real world data are sam- pled from the nonlinear manifolds which are embedded in the high dimensional ambient space [Belkin and Niyogi, 2002]. However, similar to traditional NMF and NMTF, the pro- posed FNMTF approach assumes that the data points and features are sampled from Euclidean spaces, and fails to dis- cover the intrinsic geometrical and discriminative data and feature structures. Therefore, we further develop our FN- MTF approach and propose the Locality Preserved FNMTF (LP-FNMTF) approach to enforce two geometrically based regularizers from both data and feature sides. 3.1 Manifold regularization and the optimization objective of LP-FNMTF method Because we co-cluster an input data matrix on both data and feature dimensions, we consider two undirected graphs, one constructed from data points, denoted as Gd, and the other one from features, denoted as Gf . The corresponding affin- ity matrices Wd and Wf could be either computed from the input data matrix X (e.g., as in [Gu and Zhou, 2009]), or obtained from prior knowledge. According to manifold as- sumption [Belkin and Niyogi, 2002], the regularization terms to measure the smoothness with respect to the intrinsic mani- folds of data points and features are given by [Cai et al., 2008; Gu and Zhou, 2009]: min G∈Ψ tr ( G T LdG ) , and min F∈Ψ tr ( F T Lf F ) , (11) where Ld = I − D − 1 2 d WdD − 1 2 d is the the normalized graph Laplacian of Gd, and Dd is the degree matrix of Gd; similarly, Lf = I − D − 1 2 f Wf D − 1 2 f , and Df is the degree matrix of Gf . Incorporating Eq. (11) into Eq. (5), the objective of LP- FNMTF approach is to minimize: J6 = ‖X − FSG T ‖2 + α tr ( G T LdG ) + β tr ( F T Lf F ) , s.t. F ∈ Ψd×m, G ∈ Ψn×c, (12) where α and β are regularization parameters to balance the reconstruction error of co-clustering in the first term and la- beling smoothness in the data point space and feature space in the second and third terms, respectively. Because F and G are constrained to be cluster indicator matrices, it is difficult to solve Eq. (12) in general. Hence we simplify this problem using the following proposition. 1555 Proposition 1 Given a symmetric matrix A and its eigen- decomposition A = PΣPT , where Σ ∈ Rc×c is a diagonal matrix with diagonal elements as the c largest eigenvalues, and P is the corresponding eigenvector matrix, the following two optimization problems are equivalent: (P1) : min C∈Ψ tr [ C T (I − A) C ] , (13) (P2) : min C∈Ψ, QT Q=I ‖C − BQ‖2, (14) where Q is an arbitrary orthonormal matrix and B = PΣ1/2 . (15) Proof. (P1) is equivalent to maxC∈Ψ tr ( CT AC ) that is further equivalent to minC∈Ψ ‖CC T − A‖2. By defini- tion the low-rank approximation of A is given by A = BQ (BQ) T , thus (P1) becomes minC∈Ψ,QT Q=I ‖CC T − BQ (BQ) T ‖2. C approximating BQ is equivalent to CCT approximating BQ (BQ) T . Hence, solving (P1) in Eq. (13) can be reasonably transformed to solve (P2) in Eq. (14), which completes the proof of Proposition 1. � Applying Proposition 1 in Eq. (12), the objective of our LP-FNMTF approach is transformed to minimize: J7 = ‖X − FSG T ‖2 + α‖G − BdQd‖ 2 + β‖F − BfQf ‖ 2, s.t. F ∈ Ψd×m, G ∈ Ψn×c, QTd Qd = I, Q T f Qf = I, (16) where Bd and Bf are computed from Ld and Lf following the procedures described in Proposition 1. 3.2 Optimization algorithm Again, we use the alternating iterative method to solve Eq. (16). We first introduce the following theorem. Theorem 1 Let H = BT C and the Singular Value Decom- position (SVD) of H be given by H = UΛVT . Fixing C and B, the optimum Q to problem (P2) defined in Eq. (14) is given by Q = UVT . Proof. When C is fixed, problem (P2) in Eq. (14) is equiva- lent to maxQT Q=I tr ( QT H ) . We have tr ( QT H ) = tr ( QUΛVT ) = tr ( ΛVT QU ) = tr (ΛZ) = ∑ i λiizii, where Z = V T QU and λii and zii are the (i, i)-th entry of Λ and Z, respectively. Note that Z is orthonormal, i.e., ZT Z = I, thus zii ≤ 1. On the other hand, λii ≥ 0 as λii is singular value of H. Therefore, tr ( QT H ) = ∑ i λiizii ≤ ∑ i λii, and when zii = 1 (1 ≤ i ≤ c), the equality holds. That is to say, tr ( QT H ) reaches its maximum when Z = I. Recall that Z = VT QU, the solution to maxQT Q=I tr ( QT H ) or (P2) is Q = UZT VT = UVT . Theorem 1 is proved. � Now, we solve Eq. (16). First, fixing F and G, by setting the derivative of J7 with respect to S as 0, we obtain S = ( F T F )−1 F T XG ( G T G )−1 . (17) Second, fixing F, G and S, we can decouple Eq. (16) into two following subproblems: min QT d Qd=I ‖G − BdQd‖ 2, min QT f Qf =I ‖F − Bf Qf‖ 2 . (18) Applying Theorem 1, Qd = UdV T d where Ud and Vd are obtained by SVD on BTd G; Qf = Uf V T f where Uf and Vf are obtained by SVD on BTf F. Third, we fix S, F and Qd to update G. Because G is a cluster indicator matrix, Eq. (16) is decoupled to the follow- ing simpler problems for each 1 ≤ i ≤ n: min G∈Ψ ‖x·i − F̃g T i·‖ 2 + α‖gi· − ( b̃d ) i· ‖2, (19) where F̃ = FS, ( b̃d ) i· denotes the i-th row of B̃d = BdQd. Thus, the solution can be obtained by gij = { 1 j = arg mink ( ‖x·i − f̃·k‖ 2 − 2α ( B̃d ) ik ) , 0 otherwise . (20) Finally, when fixing S, G and Qf , let B̃f = Bf Qf , G̃ T = SGT and g̃l· is the l-th row of G̃ T , we similarly obtain F as: fij = ⎧⎨ ⎩1 i = arg minl ( ‖xj· − g̃l·‖ 2 − 2β ( B̃f ) jl ) , 0 otherwise. (21) The procedures to solve J7 are summarized in Algorithm 2. Again, because step 4 and step 5 only involve vector norm enumeration without matrix multiplication, our algorithm is more computationally efficient. Empirical results show that the convergence of our algorithm is fast, which make it fea- sible to solve the large-scale real world problems via our ap- proach in practice. Algorithm 2: Algorithm to solve J7 in Eq. (16). Input: Data matrix X = [x·1, . . . , x·n] ∈ R d×n. 1. Initialize G ∈ Ψn×c and F ∈ Ψd×m with arbitrary class indicator matrices; 2. Calculate Bd and Bf from Ld and Lf following the description of Proposition 1; repeat 1. Calculate S by Eq. (17) ; 2. Calculate Qd = UdV T d where Ud and Vd are obtained by SVD on BTd G ; 3. Calculate Qf = Uf V T f where Uf and Vf are obtained by SVD on BTf F ; 4. Calculate G by Eq. (20) ; 5. Calculate F by Eq. (21) ; until Converges; Output: Class indicator matrices G and F for data and feature clustering tasks, respectively. 4 Experiments In this section, we evaluate the proposed FNMTF and LP- FNMTF approaches, and compare them against state-of-the- art (co-)clustering methods, including Semi-NMF (SNMF) [Ding et al., 2010], Orthogonal NMTF (ONMTF) [Ding et al., 2006], Graph regularized NMF (GNMF) [Cai et al., 2008] 1556 Table 1: Description of experimental data sets Data sets # sample # feature # classes Coil20 1140 1024 20 WebKB 4199 1000 4 WebACE 2340 1000 20 CSTR 476 1000 4 RCV1 193844 1979 103 and Dual Regularized Co-clustering (DRCC) [Gu and Zhou, 2009] methods. We also report the clustering results by K- means and NMF [Lee and Seung, 2001] methods as baselines. In our experiments, we use 5 data sets to evaluate the com- pared methods, which are summarized in Table 1. The first four are widely used as benchmarks in clustering literatures [Ding et al., 2006; Cai et al., 2008; Gu and Zhou, 2009; Ding et al., 2010]. The last one has very large sample size and feature size [Chen et al., 2010]. In order to run the exper- iments on contemporary computers, for RCV1 data set, fol- lowing previous studies, we remove the keywords (features) appearing less than 100 times in the corpus, which results in 1979 (out of 47236) keywords in our experiments. To evaluate the clustering results, we adopt three widely used standard metrics: clustering accuracy [Cai et al., 2008], normalized mutual information (NMI) [Cai et al., 2008] and cluster purity [Ding et al., 2006]. 4.1 Clustering results Experiments setup. Because each clustering algorithm has one or more parameters to be tuned, in order to compare fairly, we run these algorithms under different parameter set- tings, and select the best average result for each one. We set the number of clusters as the true number of classes for all clustering algorithms on all data sets. For co-clustering methods, including ONMTF, DRCC and our two methods, the number of feature clusters is set to be the same as that of data clusters, i.e., m = c. For manifold regularized methods, including GNMF, DRCC and our LP-FNMTF methods, we construct nearest- neighbor graph following [Gu and Zhou, 2009], where the neighborhood size for graph construction is set by searching the grid of {1, 2, . . . , 10}, and the regularization parameters (i.e., α and β in Eq. (16)) are set by searching the grid of {0.1, 1, 10, 100, 500, 1000}. Given the number of clusters, no parameter selection is needed for K-means, NMF and SNMF methods. Results. Under each parameter setting of every method in comparison, we repeat clustering 50 times, and the average result is computed. We report the best average result for each method on five data sets in Table 2. From the results in Table 2, we can see that the proposed FNMTF and LP-FNMTF methods consistently outperform the other compared methods, sometimes very significantly, which demonstrate the advantage of our approaches in terms of clustering performance. A more careful examination on the results shows that, the co-clustering methods, including ONMTF, DDRC and our proposed FNMTF and LP-FNMTF methods, generally achieve better clustering results, which Table 3: Average iteration numbers to converge of compared clustering methods. Data Coil20 WebKB WebACE CSTR RCV1 Kmeans 26.4 29.2 28.3 24.2 87.1 NMF 30.3 40.2 38.5 30.1 92.5 SNMF 37.7 45.5 44.1 36.3 98.6 ONMTF 41.2 50.3 49.2 40.3 104.4 GNMF 50.2 62.1 61.1 46.8 112.5 DRCC 51.6 64.4 62.3 47.9 129.1 FNMTF 14.1 16.5 15.9 14.3 45.2 LP-FNMTF 15.2 17.2 16.3 15.6 48.1 0 2 4 6 8 x 10 4 Km ea ns NM F SN MF ON MT F GN MF DR CC FN MT F LP −F NM TF (a) WebKb data set. 0 0.5 1 1.5 2 2.5 x 10 6 Km ea ns NM F SN MF ON MT F GN MF DR CC FN MT F LP −F NM TF (b) RCV1 data set. Figure 1: Convergence time (ms) of compared clustering methods on WebKB and RCV1 data sets. are consistent with the widely accepted hypothesis that clus- tering of features can help clustering of data points. Finally, LP-FNMTF method is superior to FNMTF method on all the data sets except CSTR. This indicates that exploiting the ge- ometric structures in data and feature spaces indeed can im- prove the cluster performance, which verifies manifold as- sumption and confirms the correctness of our algorithms. 4.2 Studies of computational speeds In this subsection, we evaluate the computational speeds of the compared (co-)clustering methods. All our experiments are performed on a Dell PowerEdge 2900 server, which has two quad-core Intel Xeon 5300 sequence CPU processors at 3.0 GHz and 48G bytes memory. We first examine the convergence rates of the compared methods. We repeat clustering 50 times by each method with its optimal parameters on each data set. The average itera- tion numbers of each method on each data set are reported in Table 3. The results show that the proposed FNMTF and LP- FNMTF approaches require much less iterations to converge, therefore they are more computationally efficient. In addition, we also report the average convergence time of the compared methods on WebKB data and RCV1 data sets as in Figure 1. From the results, we can see that our FNMTF and LP-FNMTF methods are only slower than K-means method while much faster than all other state-of-the-art clustering methods. These results are consistent with our theoretical analysis that our methods are implemented on subproblems with much smaller sizes and use much less matrix multipli- cations. Therefore, our approaches are suitable for clustering on large-scale data. The results on three other smaller data sets are not shown due to space limit, from which the same observations can be seen. 1557 Table 2: Clustering results measured by accuracy/NMI/purity of the compared methods. Data Metrics Kmeans NMF SNMF ONMTF GNMF DRCC FNMTF LP-FNMTF Coil20 Accuracy 0.495 0.487 0.527 0.635 0.665 0.680 0.696 0.725 NMI 0.489 0.479 0.511 0.561 0.549 0.566 0.584 0.621 Purity 0.441 0.437 0.468 0.478 0.481 0.483 0.497 0.512 WebKB Accuracy 0.698 0.668 0.621 0.685 0.717 0.725 0.774 0.792 NMI 0.467 0.427 0.418 0.455 0.458 0.487 0.501 0.522 Purity 0.601 0.595 0.604 0.664 0.672 0.674 0.696 0.712 WebACE Accuracy 0.526 0.514 0.527 0.635 0.665 0.680 0.696 0.714 NMI 0.519 0.512 0.538 0.587 0.556 0.571 0.604 0.611 Purity 0.479 0.481 0.491 0.487 0.511 0.493 0.517 0.532 CSRT Accuracy 0.763 0.759 0.699 0.771 0.742 0.812 0.894 0.847 NMI 0.654 0.668 0.614 0.673 0.635 0.681 0.753 0.722 Purity 0.612 0.587 0.614 0.645 0.637 0.656 0.701 0.682 RCV1 Accuracy 0.168 0.156 0.173 0.196 0.201 0.213 0.238 0.241 NMI 0.274 0.274 0.283 0.301 0.312 0.318 0.339 0.342 Purity 0.134 0.121 0.139 0.146 0.152 0.159 0.172 0.179 5 Conclusions In this work, we proposed a novel Fast Nonnegative Matrix Tri-factorization (FNMTF) method to simultaneously cluster both data side and feature side of an input data matrix. We adopt the idea of NMF/NMTF based co-clustering methods, but constrain the factor matrices to be cluster indicator matri- ces, a special type of nonnegative matrices. Through the new constraints, the optimization problem of our method is decou- pled into a number of much smaller subproblems that require much less matrix multiplication than the existing NMF based co-clustering algorithms, which makes our approaches of par- ticular use for real world large-scale data. We further de- veloped our method to incorporate manifold information and proposed Locality Preserved FNMTF (LP-FNMTF) method. We conducted extensive experiments on benchmark data sets that demonstrate promising results of our methods, which is consistent with our theoretical analysis. Acknowledgments This research was supported by NSF-CNS 0923494, NSF-IIS 1041637, NSF-CNS 1035913. References [Belkin and Niyogi, 2002] M. Belkin and P. Niyogi. Lapla- cian eigenmaps and spectral techniques for embedding and clustering. In NIPS, 2002. [Cai et al., 2008] D. Cai, X. He, X. Wu, and J. Han. Non- negative matrix factorization on manifold. In ICDM, 2008. [Chen et al., 2010] W.Y. Chen, Y. Song, H. Bai, C.J. Lin, and E.Y. Chang. Parallel spectral clustering in distributed sys- tems. IEEE TPAMI, 2010. [Cho et al., 2004] H. Cho, I.S. Dhillon, Y. Guan, and S. Sra. Minimum sum-squared residue co-clustering of gene ex- pression data. In SDM, 2004. [Dhillon, 2001] I.S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In SIGKDD, 2001. [Ding and He, 2004] C. Ding and X. He. K-means clustering via principal component analysis. In ICML, 2004. [Ding et al., 2005] C. Ding, X. He, and H.D. Simon. On the equivalence of nonnegative matrix factorization and spec- tral clustering. In SDM, 2005. [Ding et al., 2006] C. Ding, T. Li, W. Peng, and H. Park. Or- thogonal nonnegative matrix tri-factorizations for cluster- ing. In SIGKDD, 2006. [Ding et al., 2010] C. Ding, T. Li, and M.I. Jordan. Convex and semi-nonnegative matrix factorizations. IEEE TPAMI, 32(1):45–55, 2010. [Gu and Zhou, 2009] Q. Gu and J. Zhou. Co-clustering on manifolds. In SIGKDD, 2009. [Lee and Seung, 1999] D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999. [Lee and Seung, 2001] D.D. Lee and H.S. Seung. Algo- rithms for non-negative matrix factorization. In NIPS, 2001. [Li et al., 2010] T. Li, V. Sindhwani, C. Ding, and Y. Zhang. Bridging Domains with Words: Opinion Analysis with Matrix Tri-factorizations. In SDM, 2010. [Long et al., 2006] B. Long, X. Wu, Z.M. Zhang, and P.S. Yu. Unsupervised learning on k-partite graphs. In SIGKDD, 2006. [Nie et al., 2009] Feiping Nie, Dong Xu, Ivor W. Tsang, and Changshui Zhang. Spectral embedded clustering. In IJ- CAI, pages 1181–1186, 2009. [Wang et al., 2008] F. Wang, T. Li, and C. Zhang. Semi- supervised clustering via matrix factorization. In SDM, 2008. [Zha et al., 2001] Hongyuan Zha, Xiaofeng He, Chris Ding, Horst Simon, and Ming Gu. Spectral relaxation for k- means clustering. In NIPS, 2001. 1558