Fast Nonnegative Matrix Tri-Factorization for Large-Scale Data Co-Clustering


Fast Nonnegative Matrix Tri-Factorization
for Large-Scale Data Co-Clustering

Hua Wang, Feiping Nie, Heng Huang, Fillia Makedon

Department of Computer Science and Engineering

University of Texas at Arlington, Arlington, Texas 76019, USA

huawangcs@gmail.com, feipingnie@gmail.com, heng@uta.edu, makedon@uta.edu

Abstract

Nonnegative Matrix Factorization (NMF) based co-
clustering methods have attracted increasing atten-
tion in recent years because of their mathematical
elegance and encouraging empirical results. How-
ever, the algorithms to solve NMF problems usu-
ally involve intensive matrix multiplications, which
make them computationally inefficient. In this
paper, instead of constraining the factor matrices
of NMF to be nonnegative as existing methods,
we propose a novel Fast Nonnegative Matrix Tri-
factorization (FNMTF) approach to constrain them
to be cluster indicator matrices, a special type of
nonnegative matrices. As a result, the optimiza-
tion problem of our approach can be decoupled,
which results in much smaller size subproblems
requiring much less matrix multiplications, such
that our approach works well for large-scale input
data. Moreover, the resulted factor matrices can di-
rectly assign cluster labels to data points and fea-
tures due to the nature of indicator matrices. In ad-
dition, through exploiting the manifold structures
in both data and feature spaces, we further intro-
duce the Locality Preserved FNMTF (LP-FNMTF)
approach, by which the clustering performance is
improved. The promising results in extensive ex-
perimental evaluations validate the effectiveness of
the proposed methods.

1 Introduction

Clustering, which partitions a data set into different groups
unsupervisedly, is one of the most fundamental topic in statis-
tical learning. Most traditional clustering algorithms are de-
signed for one-side clustering [Nie et al., 2009], i.e. cluster ei-
ther data points or features. However, in many real world ap-
plications, the clustering based analysis is interested in two-
side clustering results, i.e. group the data points and features
simultaneously, e.g., “documents” and “words” in document
analysis, “users” and “items” in collaborative filtering, “sam-
ples” and “genes” in microarray data analysis, etc. Typically,
instead of being independent, the different clustering tasks on
data and features are closely correlated, and it is challeng-
ing for traditional clustering algorithms to utilize the data

and features interdependence efficiently. Consequently, co-
clustering techniques, which aim to cluster both data and fea-
tures simultaneously by leveraging the interrelations between
them, have been proposed in recent researches. To name a
few, Dhillon [Dhillon, 2001] introduced a bipartite spectral
graph partition approach to co-cluster words and documents;
Cho et al. [Cho et al., 2004] suggested to co-cluster the exper-
imental conditions and genes for microarray data by minimiz-
ing the sum-squared-residue; Long et al. [Long et al., 2006]

presented a relation summary network model to co-cluster the
heterogeneous data on a k-partite graph, and so on.

More recently, Ding et al. [Ding et al., 2005] explored
the relationships between Nonnegative Matrix Factorization
(NMF) [Lee and Seung, 1999; 2001] and K-means/spectral
clustering, and proposed to use Nonnegative Matrix Tri-
factorization (NMTF) [Ding et al., 2006] to co-cluster words
and documents at the same time. Due to its mathematical
elegance and encouraging empirical results, NMTF method
has been further developed to address various aspects of co-
clustering [Wang et al., 2008; Li et al., 2010; Gu and Zhou,
2009; Ding et al., 2010]. However, a notorious bottleneck
of NMTF based co-clustering approaches is the slow com-
putational speed because of intensive matrix multiplications
involved in each iteration step of the solution algorithms,
which makes these approaches hard to be applied to large-
scale data in real world applications. In this paper, we pro-
pose a novel Fast Nonnegative Matrix Tri-factorization (FN-
MTF) approach to efficiently conduct co-clustering on large-
scale data. Our new algorithms are interesting from the fol-
lowing perspectives:

• Instead of enforcing traditional nonnegative constraints
on the factor matrices of NMTF, we constrain them to
be cluster indicator matrices, a special type of nonnega-
tive matrices. As a result, the clustering results of our
approach are readily stored in the resulted factor ma-
trices. However, existing NMF based methods require
an extra post-processing step to extract cluster struc-
tures from the factor matrices, which often leads to non-
unique clustering results.

• Due to the nature of indicator matrices, the optimization
problems of our approach can be decoupled into sub-
problems with much smaller sizes, and the decoupled
subproblems involve much less matrix multiplications.
Therefore, our approach is computationally efficient and

1553

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence


scale well to large-scale input data.
• Taking into account the manifold structures in both data

and feature spaces, we further develop a Locality Pre-
served FNMTF (LP-FNMTF) approach to incorporate
manifold regularizations. Efficient algorithm to opti-
mize the objective with quick convergence is presented.

• Promising experimental results on five benchmark data
sets show that our approaches not only are faster than
state-of-the-art co-clustering methods but also have
competitive clustering performance.

Notations and problem formalization. Throughout this pa-
per, we write matrices as boldface uppercase letters and vec-
tors as boldface lowercase letters. Given a matrix M =
(mij), its i-th row and j-th column are denoted as mi· and
m·j respectively.

Traditional clustering methods focus on one-side cluster-
ing, i.e., clustering the data side based on the similarities
along the feature side. In the co-clustering problem, we clus-
ter data points based on the distributions of features, mean-
while cluster features based on the distributions of the data
points. Formally, given a data set X =

{
x·i ∈ R

d
}n
i=1

, we

write X = [x·1, . . . , x·n] =
[
xT1·, . . . , x

T
d·

]T
. Our goal is to

group the data points {x·1, . . . , x·n} into c clusters {Cj}
c
j=1

,

and simultaneously group the features {x1·, . . . , xd·} into m
clusters {Wj}

m
j=1

.

We use a partition matrix G =
[
gT
1·
, . . . , gTn·

]T
∈

{0, 1}n×c to represent clustering result of data points, such
that gij = 1 if x·i belongs to cluster Cj and gij = 0 oth-
erwise. Similarly, we use another partition matrix F =[
fT
1·
, . . . , fTd·

]T
∈ {0, 1}d×m to represent the clustering re-

sults of features. Here, we call F and G as cluster indicator
matrices, because each row of them, i.e., fi·(1 ≤ i ≤ d) or
gi·(1 ≤ i ≤ n), has one and only one element equal to 1 to
indicate the cluster membership, while the rest elements are
0. We denote the set of all cluster indicator matrices as Ψ.

2 Fast nonnegative matrix tri-factorization

(FNMTF) for co-clustering

In this section, we first briefly review the background of co-
clustering using NMTF, which motivates the optimization ob-
jective of our approach. After that, an efficient algorithm to
solve our objective will be introduced.

2.1 Objective of FNMTF approach

K-means clustering is a standard clustering method in statis-
tical learning, which minimizes the following objective:

J1 =

c∑
j=1

∑
x
·i∈Cj

‖x·i − c·j‖
2
=

c∑
j=1

n∑
i=1

gij ‖x·i − c·j‖
2
,

s.t. G ∈ Ψn×c, (1)

where ‖·‖ denotes the Frobenius norm of a matrix, c·j is
the j-th centroid of the data set. Because G ∈ Ψ is
a cluster indicator matrix, minimizing J1 is a combinato-
rial optimization problem, which is hard to be resolved in

general. Therefore, the minimization of J1 is often re-
laxed to maximize the following objective [Zha et al., 2001;
Ding and He, 2004]:

J2 = tr
(
G

T
X

T
XG

)
, s.t. GT G = I . (2)

Note that, G in J2 is no longer an indicator matrix, but an
arbitrary orthonormal matrix.

Recently, Ding et al. [Ding et al., 2005] proved the equiva-
lence between the relaxed objective of K-means clustering in
Eq. (2) and the NMF objective when orthonormal constraints
are enforced on the factor matrices, which minimizes:

J3 = ‖X − FG
T ‖2,

s.t. F ≥ 0, G ≥ 0, FT F = I, GT G = I,
(3)

where X ∈ Rd×n+ , F ∈ R
d×c
+ and G ∈ R

n×c
+ , and J3 aims

to approximate the nonnegative data matrix X by the prod-
uct of F and G. The orthonormal constraints here ensure the
uniqueness (up to a permutation) of the solution, and together
with the nonnegative constraints make the resulted F and G
approximate the K-means clustering results on both features
and data points (called as “soft labels”) [Ding et al., 2005;
2006]. The latter, simultaneously clustering the rows (fea-
tures) and the columns (data points) of an input data matrix,
is one of the main strength of NMF defined in Eq. (3).

Because the two-factor NMF in Eq. (3) is restrictive, which
often gives a rather poor low-rank matrix approximation, one
more factor S ∈ Rm×c+ was introduced to absorb the different
scales of X, F and G. This leads to NMTF [Ding et al.,

2006]: X ≈ FSGT , where F ∈ Rd×m
+

and G ∈ Rn×c
+

.
S provides increased degrees of freedom such that the low-
rank matrix representation remains accurate, while F gives
row clusters and G gives column clusters. In order to achieve
additional flexibility, in clustering scenarios, the nonnegative
constraint on X (thereby the nonnegative constraint on S) can
be relaxed [Ding et al., 2010], which leads to the semi-NMTF
problem minimizing the following objective:

J4 = ‖X − FSG
T ‖2,

s.t. F ≥ 0, G ≥ 0, FT F = I, GT G = I .
(4)

Despite its mathematical elegance, Eq. (4) suffers from two
problems that impede its practical use. First, similar to
Eq. (2), the relaxations on F and G make the immediate out-
puts of Eq. (4) are not cluster labels, which require an addi-
tional post-processing step and often lead to non-unique solu-
tions. Second, and more important, Eq. (4) is usually solved
by alternately iterative algorithms, and in each iteration step
the intensive matrix multiplications are involved [Ding et
al., 2005; 2006; 2010; Wang et al., 2008; Li et al., 2010;
Gu and Zhou, 2009]. As a result, it is infeasible to apply such
algorithms to large-scale real world data due to the expensive
computational cost.

In order to tackle these difficulties, instead of solving the
relaxed clustering problems as in Eqs. (2–4), we solve the
original clustering problem similar to Eq. (1). Specifically,
we constrain the factor matrices of NMTF to be cluster indi-
cator matrices and minimize the following objective:

J5 = ‖X−FSG
T ‖2 s.t. F ∈ Ψd×m, G ∈ Ψn×c . (5)

1554


We call Eq. (5) as the proposed Fast Nonnegative Matrix Tri-
factorization (FNMTF) approach.

Note that, the orthonormal constraints on F and G are re-
moved in our objective, as their purposes (unique solution and
labeling approximation) are automatically accomplished by
the new constraints. Surprisingly, with these new constraints,
though more stringent, as shown theoretically shortly in this
section and empirically later in Section 4, the computational
speed of our approach can be significantly improved.

2.2 An efficient optimization algorithm

Following the standard optimization procedures, we alter-
nately solve the three variables F, S and G of J5 in Eq. (5).

First, fixing F, G, and setting the derivative of J5 with
respect to S as zero, we have

S =
(
F

T
F
)−1

F
T
XG

(
G

T
G
)−1

. (6)

Second, when F and S are fixed, the optimization problem
to obtain G can be decoupled and we solve the following
simpler problem for each i (1 ≤ i ≤ n):

min
G∈Ψ

‖x·i − FSg
T

i·‖
2
. (7)

Because gi· (1 ≤ i ≤ n) ∈ Ψ
1×c is a cluster indicator vector

in which one and only one element is 1 and the rest are zeros,
the solution to Eq. (7) can be easily obtained by:

gij =

{
1 j = arg mink ‖x·i − f̃·k‖

2,

0 otherwise,
(8)

where F̃ = FS and f̃·k is the k-th column of F̃. Note that,
Eq. (8) simply enumerates the c vector norms and seeks the
maximum one, without involving any matrix multiplication.

Finally, when G and S are fixed, the optimization prob-
lem to obtain F can be similarly decoupled and we solve the
following simpler problem for each j (1 ≤ j ≤ d):

min
F∈Ψ

‖xj· − fj·SG
T ‖2 . (9)

Again, since fj· (1 ≤ i ≤ d) ∈ Ψ
1×m is a cluster indicator

vector for feature side, the solution to Eq. (9) is:

fij =

{
1 i = arg minl ‖xj· − g̃l·‖

2,

0 otherwise,
(10)

where G̃T = SGT and g̃l· is the l-th row of G̃
T .

The procedures to solve J5 are summarized in Algorithm 1.
Due to the nature of alternating optimization, Algorithm 1 is
guaranteed to converge to a local minima (existing NMF al-
gorithms [Ding et al., 2005; 2006; 2010] also converges to a
local minima because the objectives J3 and J4 are not convex
in both variables F and G), and the proof is skipped due to
space limit. As can be seen, in step 2 and step 3, the solutions
are obtained by enumerating the vector norms, which is def-
initely much faster than the matrix multiplication used in the
existing NMF methods. Upon solution, G gives the cluster-
ing results of data points, and F gives the clustering results of
features directly.

Algorithm 1: Algorithm to solve J5 in Eq. (5).

Input: Data matrix X = [x·1, . . . , x·n] ∈ R
d×n.

Initialize G ∈ Ψn×c and F ∈ Ψd×m with arbitrary
class indicator matrices;
repeat

1. calculate S by Eq. (6) ;
2. calculate G by Eq. (8) ;
3. calculate F by Eq. (10) ;

until converges;
Output: Indicator matrices G for data point

clustering and F for feature clustering.

3 Locality preserved FNMTF (LP-FNMTF)

Recent researches showed that many real world data are sam-
pled from the nonlinear manifolds which are embedded in the
high dimensional ambient space [Belkin and Niyogi, 2002].
However, similar to traditional NMF and NMTF, the pro-
posed FNMTF approach assumes that the data points and
features are sampled from Euclidean spaces, and fails to dis-
cover the intrinsic geometrical and discriminative data and
feature structures. Therefore, we further develop our FN-
MTF approach and propose the Locality Preserved FNMTF
(LP-FNMTF) approach to enforce two geometrically based
regularizers from both data and feature sides.

3.1 Manifold regularization and the optimization
objective of LP-FNMTF method

Because we co-cluster an input data matrix on both data and
feature dimensions, we consider two undirected graphs, one
constructed from data points, denoted as Gd, and the other
one from features, denoted as Gf . The corresponding affin-
ity matrices Wd and Wf could be either computed from the
input data matrix X (e.g., as in [Gu and Zhou, 2009]), or
obtained from prior knowledge. According to manifold as-
sumption [Belkin and Niyogi, 2002], the regularization terms
to measure the smoothness with respect to the intrinsic mani-
folds of data points and features are given by [Cai et al., 2008;
Gu and Zhou, 2009]:

min
G∈Ψ

tr
(
G

T
LdG

)
, and min

F∈Ψ
tr

(
F

T
Lf F

)
, (11)

where Ld = I − D
−

1

2

d WdD
−

1

2

d is the the normalized graph
Laplacian of Gd, and Dd is the degree matrix of Gd; similarly,

Lf = I − D
−

1

2

f Wf D
−

1

2

f , and Df is the degree matrix of Gf .
Incorporating Eq. (11) into Eq. (5), the objective of LP-

FNMTF approach is to minimize:

J6 = ‖X − FSG
T ‖2 + α tr

(
G

T
LdG

)
+ β tr

(
F

T
Lf F

)
,

s.t. F ∈ Ψd×m, G ∈ Ψn×c, (12)

where α and β are regularization parameters to balance the
reconstruction error of co-clustering in the first term and la-
beling smoothness in the data point space and feature space
in the second and third terms, respectively.

Because F and G are constrained to be cluster indicator
matrices, it is difficult to solve Eq. (12) in general. Hence we
simplify this problem using the following proposition.

1555


Proposition 1 Given a symmetric matrix A and its eigen-
decomposition A = PΣPT , where Σ ∈ Rc×c is a diagonal
matrix with diagonal elements as the c largest eigenvalues,
and P is the corresponding eigenvector matrix, the following
two optimization problems are equivalent:

(P1) : min
C∈Ψ

tr

[
C

T (I − A) C
]
, (13)

(P2) : min
C∈Ψ, QT Q=I

‖C − BQ‖2, (14)

where Q is an arbitrary orthonormal matrix and

B = PΣ1/2 . (15)

Proof. (P1) is equivalent to maxC∈Ψ tr
(
CT AC

)
that is

further equivalent to minC∈Ψ ‖CC
T − A‖2. By defini-

tion the low-rank approximation of A is given by A =

BQ (BQ)
T

, thus (P1) becomes minC∈Ψ,QT Q=I ‖CC
T −

BQ (BQ)
T
‖2. C approximating BQ is equivalent to CCT

approximating BQ (BQ)
T

. Hence, solving (P1) in Eq. (13)
can be reasonably transformed to solve (P2) in Eq. (14),
which completes the proof of Proposition 1. �

Applying Proposition 1 in Eq. (12), the objective of our
LP-FNMTF approach is transformed to minimize:

J7 = ‖X − FSG
T ‖2 + α‖G − BdQd‖

2 + β‖F − BfQf ‖
2,

s.t. F ∈ Ψd×m, G ∈ Ψn×c, QTd Qd = I, Q
T
f Qf = I, (16)

where Bd and Bf are computed from Ld and Lf following
the procedures described in Proposition 1.

3.2 Optimization algorithm

Again, we use the alternating iterative method to solve
Eq. (16). We first introduce the following theorem.

Theorem 1 Let H = BT C and the Singular Value Decom-
position (SVD) of H be given by H = UΛVT . Fixing C
and B, the optimum Q to problem (P2) defined in Eq. (14) is
given by Q = UVT .

Proof. When C is fixed, problem (P2) in Eq. (14) is equiva-
lent to maxQT Q=I tr

(
QT H

)
.

We have tr
(
QT H

)
= tr

(
QUΛVT

)
= tr

(
ΛVT QU

)
=

tr (ΛZ) =
∑

i λiizii, where Z = V
T QU and λii and zii are

the (i, i)-th entry of Λ and Z, respectively.
Note that Z is orthonormal, i.e., ZT Z = I, thus zii ≤ 1.

On the other hand, λii ≥ 0 as λii is singular value of H.
Therefore, tr

(
QT H

)
=

∑
i λiizii ≤

∑
i λii, and when

zii = 1 (1 ≤ i ≤ c), the equality holds. That is to say,
tr

(
QT H

)
reaches its maximum when Z = I. Recall that

Z = VT QU, the solution to maxQT Q=I tr
(
QT H

)
or (P2)

is Q = UZT VT = UVT . Theorem 1 is proved. �
Now, we solve Eq. (16). First, fixing F and G, by setting

the derivative of J7 with respect to S as 0, we obtain

S =
(
F

T
F
)−1

F
T
XG

(
G

T
G
)−1

. (17)

Second, fixing F, G and S, we can decouple Eq. (16) into
two following subproblems:

min
QT

d
Qd=I

‖G − BdQd‖
2, min

QT
f
Qf =I

‖F − Bf Qf‖
2 . (18)

Applying Theorem 1, Qd = UdV
T
d where Ud and Vd are

obtained by SVD on BTd G; Qf = Uf V
T
f where Uf and Vf

are obtained by SVD on BTf F.

Third, we fix S, F and Qd to update G. Because G is a
cluster indicator matrix, Eq. (16) is decoupled to the follow-
ing simpler problems for each 1 ≤ i ≤ n:

min
G∈Ψ

‖x·i − F̃g
T
i·‖

2 + α‖gi· −
(
b̃d

)
i·
‖2, (19)

where F̃ = FS,
(
b̃d

)
i·

denotes the i-th row of B̃d = BdQd.

Thus, the solution can be obtained by

gij =

{
1 j = arg mink

(
‖x·i − f̃·k‖

2 − 2α
(
B̃d

)
ik

)
,

0 otherwise .
(20)

Finally, when fixing S, G and Qf , let B̃f = Bf Qf , G̃
T =

SGT and g̃l· is the l-th row of G̃
T , we similarly obtain F as:

fij =

⎧⎨
⎩1 i = arg minl

(
‖xj· − g̃l·‖

2 − 2β
(
B̃f

)
jl

)
,

0 otherwise.
(21)

The procedures to solve J7 are summarized in Algorithm 2.
Again, because step 4 and step 5 only involve vector norm
enumeration without matrix multiplication, our algorithm is
more computationally efficient. Empirical results show that
the convergence of our algorithm is fast, which make it fea-
sible to solve the large-scale real world problems via our ap-
proach in practice.

Algorithm 2: Algorithm to solve J7 in Eq. (16).

Input: Data matrix X = [x·1, . . . , x·n] ∈ R
d×n.

1. Initialize G ∈ Ψn×c and F ∈ Ψd×m with
arbitrary class indicator matrices;
2. Calculate Bd and Bf from Ld and Lf following
the description of Proposition 1;
repeat

1. Calculate S by Eq. (17) ;

2. Calculate Qd = UdV
T
d where Ud and Vd are

obtained by SVD on BTd G ;

3. Calculate Qf = Uf V
T
f where Uf and Vf are

obtained by SVD on BTf F ;

4. Calculate G by Eq. (20) ;
5. Calculate F by Eq. (21) ;

until Converges;
Output: Class indicator matrices G and F for data

and feature clustering tasks, respectively.

4 Experiments

In this section, we evaluate the proposed FNMTF and LP-
FNMTF approaches, and compare them against state-of-the-
art (co-)clustering methods, including Semi-NMF (SNMF)
[Ding et al., 2010], Orthogonal NMTF (ONMTF) [Ding et
al., 2006], Graph regularized NMF (GNMF) [Cai et al., 2008]

1556


Table 1: Description of experimental data sets

Data sets # sample # feature # classes

Coil20 1140 1024 20
WebKB 4199 1000 4
WebACE 2340 1000 20
CSTR 476 1000 4
RCV1 193844 1979 103

and Dual Regularized Co-clustering (DRCC) [Gu and Zhou,
2009] methods. We also report the clustering results by K-
means and NMF [Lee and Seung, 2001] methods as baselines.

In our experiments, we use 5 data sets to evaluate the com-
pared methods, which are summarized in Table 1. The first
four are widely used as benchmarks in clustering literatures
[Ding et al., 2006; Cai et al., 2008; Gu and Zhou, 2009;
Ding et al., 2010]. The last one has very large sample size
and feature size [Chen et al., 2010]. In order to run the exper-
iments on contemporary computers, for RCV1 data set, fol-
lowing previous studies, we remove the keywords (features)
appearing less than 100 times in the corpus, which results in
1979 (out of 47236) keywords in our experiments.

To evaluate the clustering results, we adopt three widely
used standard metrics: clustering accuracy [Cai et al., 2008],
normalized mutual information (NMI) [Cai et al., 2008] and
cluster purity [Ding et al., 2006].

4.1 Clustering results

Experiments setup. Because each clustering algorithm has
one or more parameters to be tuned, in order to compare
fairly, we run these algorithms under different parameter set-
tings, and select the best average result for each one. We set
the number of clusters as the true number of classes for all
clustering algorithms on all data sets.

For co-clustering methods, including ONMTF, DRCC and
our two methods, the number of feature clusters is set to be
the same as that of data clusters, i.e., m = c.

For manifold regularized methods, including GNMF,
DRCC and our LP-FNMTF methods, we construct nearest-
neighbor graph following [Gu and Zhou, 2009], where the
neighborhood size for graph construction is set by searching
the grid of {1, 2, . . . , 10}, and the regularization parameters
(i.e., α and β in Eq. (16)) are set by searching the grid of
{0.1, 1, 10, 100, 500, 1000}.

Given the number of clusters, no parameter selection is
needed for K-means, NMF and SNMF methods.

Results. Under each parameter setting of every method in
comparison, we repeat clustering 50 times, and the average
result is computed. We report the best average result for each
method on five data sets in Table 2.

From the results in Table 2, we can see that the proposed
FNMTF and LP-FNMTF methods consistently outperform
the other compared methods, sometimes very significantly,
which demonstrate the advantage of our approaches in terms
of clustering performance. A more careful examination on
the results shows that, the co-clustering methods, including
ONMTF, DDRC and our proposed FNMTF and LP-FNMTF
methods, generally achieve better clustering results, which

Table 3: Average iteration numbers to converge of compared
clustering methods.

Data Coil20 WebKB WebACE CSTR RCV1

Kmeans 26.4 29.2 28.3 24.2 87.1
NMF 30.3 40.2 38.5 30.1 92.5
SNMF 37.7 45.5 44.1 36.3 98.6
ONMTF 41.2 50.3 49.2 40.3 104.4
GNMF 50.2 62.1 61.1 46.8 112.5
DRCC 51.6 64.4 62.3 47.9 129.1
FNMTF 14.1 16.5 15.9 14.3 45.2
LP-FNMTF 15.2 17.2 16.3 15.6 48.1

0

2

4

6

8
x 10

4

Km
ea

ns
NM

F
SN

MF

ON
MT

F
GN

MF
DR

CC
FN

MT
F

LP
−F

NM
TF

(a) WebKb data set.

0

0.5

1

1.5

2

2.5
x 10

6

Km
ea

ns
NM

F
SN

MF

ON
MT

F
GN

MF
DR

CC
FN

MT
F

LP
−F

NM
TF

(b) RCV1 data set.

Figure 1: Convergence time (ms) of compared clustering
methods on WebKB and RCV1 data sets.

are consistent with the widely accepted hypothesis that clus-
tering of features can help clustering of data points. Finally,
LP-FNMTF method is superior to FNMTF method on all the
data sets except CSTR. This indicates that exploiting the ge-
ometric structures in data and feature spaces indeed can im-
prove the cluster performance, which verifies manifold as-
sumption and confirms the correctness of our algorithms.

4.2 Studies of computational speeds

In this subsection, we evaluate the computational speeds of
the compared (co-)clustering methods. All our experiments
are performed on a Dell PowerEdge 2900 server, which has
two quad-core Intel Xeon 5300 sequence CPU processors at
3.0 GHz and 48G bytes memory.

We first examine the convergence rates of the compared
methods. We repeat clustering 50 times by each method with
its optimal parameters on each data set. The average itera-
tion numbers of each method on each data set are reported in
Table 3. The results show that the proposed FNMTF and LP-
FNMTF approaches require much less iterations to converge,
therefore they are more computationally efficient.

In addition, we also report the average convergence time of
the compared methods on WebKB data and RCV1 data sets as
in Figure 1. From the results, we can see that our FNMTF and
LP-FNMTF methods are only slower than K-means method
while much faster than all other state-of-the-art clustering
methods. These results are consistent with our theoretical
analysis that our methods are implemented on subproblems
with much smaller sizes and use much less matrix multipli-
cations. Therefore, our approaches are suitable for clustering
on large-scale data. The results on three other smaller data
sets are not shown due to space limit, from which the same
observations can be seen.

1557


Table 2: Clustering results measured by accuracy/NMI/purity of the compared methods.

Data Metrics Kmeans NMF SNMF ONMTF GNMF DRCC FNMTF LP-FNMTF

Coil20
Accuracy 0.495 0.487 0.527 0.635 0.665 0.680 0.696 0.725
NMI 0.489 0.479 0.511 0.561 0.549 0.566 0.584 0.621
Purity 0.441 0.437 0.468 0.478 0.481 0.483 0.497 0.512

WebKB
Accuracy 0.698 0.668 0.621 0.685 0.717 0.725 0.774 0.792
NMI 0.467 0.427 0.418 0.455 0.458 0.487 0.501 0.522
Purity 0.601 0.595 0.604 0.664 0.672 0.674 0.696 0.712

WebACE
Accuracy 0.526 0.514 0.527 0.635 0.665 0.680 0.696 0.714
NMI 0.519 0.512 0.538 0.587 0.556 0.571 0.604 0.611
Purity 0.479 0.481 0.491 0.487 0.511 0.493 0.517 0.532

CSRT
Accuracy 0.763 0.759 0.699 0.771 0.742 0.812 0.894 0.847
NMI 0.654 0.668 0.614 0.673 0.635 0.681 0.753 0.722
Purity 0.612 0.587 0.614 0.645 0.637 0.656 0.701 0.682

RCV1
Accuracy 0.168 0.156 0.173 0.196 0.201 0.213 0.238 0.241
NMI 0.274 0.274 0.283 0.301 0.312 0.318 0.339 0.342
Purity 0.134 0.121 0.139 0.146 0.152 0.159 0.172 0.179

5 Conclusions

In this work, we proposed a novel Fast Nonnegative Matrix
Tri-factorization (FNMTF) method to simultaneously cluster
both data side and feature side of an input data matrix. We
adopt the idea of NMF/NMTF based co-clustering methods,
but constrain the factor matrices to be cluster indicator matri-
ces, a special type of nonnegative matrices. Through the new
constraints, the optimization problem of our method is decou-
pled into a number of much smaller subproblems that require
much less matrix multiplication than the existing NMF based
co-clustering algorithms, which makes our approaches of par-
ticular use for real world large-scale data. We further de-
veloped our method to incorporate manifold information and
proposed Locality Preserved FNMTF (LP-FNMTF) method.
We conducted extensive experiments on benchmark data sets
that demonstrate promising results of our methods, which is
consistent with our theoretical analysis.

Acknowledgments

This research was supported by NSF-CNS 0923494, NSF-IIS
1041637, NSF-CNS 1035913.

References
[Belkin and Niyogi, 2002] M. Belkin and P. Niyogi. Lapla-

cian eigenmaps and spectral techniques for embedding and
clustering. In NIPS, 2002.

[Cai et al., 2008] D. Cai, X. He, X. Wu, and J. Han. Non-
negative matrix factorization on manifold. In ICDM, 2008.

[Chen et al., 2010] W.Y. Chen, Y. Song, H. Bai, C.J. Lin, and
E.Y. Chang. Parallel spectral clustering in distributed sys-
tems. IEEE TPAMI, 2010.

[Cho et al., 2004] H. Cho, I.S. Dhillon, Y. Guan, and S. Sra.
Minimum sum-squared residue co-clustering of gene ex-
pression data. In SDM, 2004.

[Dhillon, 2001] I.S. Dhillon. Co-clustering documents and
words using bipartite spectral graph partitioning. In
SIGKDD, 2001.

[Ding and He, 2004] C. Ding and X. He. K-means clustering
via principal component analysis. In ICML, 2004.

[Ding et al., 2005] C. Ding, X. He, and H.D. Simon. On the
equivalence of nonnegative matrix factorization and spec-
tral clustering. In SDM, 2005.

[Ding et al., 2006] C. Ding, T. Li, W. Peng, and H. Park. Or-
thogonal nonnegative matrix tri-factorizations for cluster-
ing. In SIGKDD, 2006.

[Ding et al., 2010] C. Ding, T. Li, and M.I. Jordan. Convex
and semi-nonnegative matrix factorizations. IEEE TPAMI,
32(1):45–55, 2010.

[Gu and Zhou, 2009] Q. Gu and J. Zhou. Co-clustering on
manifolds. In SIGKDD, 2009.

[Lee and Seung, 1999] D.D. Lee and H.S. Seung. Learning
the parts of objects by non-negative matrix factorization.
Nature, 401(6755):788–791, 1999.

[Lee and Seung, 2001] D.D. Lee and H.S. Seung. Algo-
rithms for non-negative matrix factorization. In NIPS,
2001.

[Li et al., 2010] T. Li, V. Sindhwani, C. Ding, and Y. Zhang.
Bridging Domains with Words: Opinion Analysis with
Matrix Tri-factorizations. In SDM, 2010.

[Long et al., 2006] B. Long, X. Wu, Z.M. Zhang, and P.S.
Yu. Unsupervised learning on k-partite graphs. In
SIGKDD, 2006.

[Nie et al., 2009] Feiping Nie, Dong Xu, Ivor W. Tsang, and
Changshui Zhang. Spectral embedded clustering. In IJ-
CAI, pages 1181–1186, 2009.

[Wang et al., 2008] F. Wang, T. Li, and C. Zhang. Semi-
supervised clustering via matrix factorization. In SDM,
2008.

[Zha et al., 2001] Hongyuan Zha, Xiaofeng He, Chris Ding,
Horst Simon, and Ming Gu. Spectral relaxation for k-
means clustering. In NIPS, 2001.

1558