Variance enhanced K-medoid clustering
Expert Systems with Applications 38 (2011) 764–775
Contents lists available at ScienceDirect
Expert Systems with Applications
j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a
Variance enhanced K-medoid clustering q
Por-Shen Lai *, Hsin-Chia Fu
Department of Computer Science, National Chiao-Tung University, Hsin-Chu 300, Taiwan, ROC
a r t i c l e i n f o
Keywords:
Clustering
K-medoid
Data variance
Polygon model
Image similarity
0957-4174/$ - see front matter � 2010 Elsevier Ltd. A
doi:10.1016/j.eswa.2010.07.030
q This research was supported in part by the Nat
Grant NSC 95-2221-E-009-218.
* Corresponding author.
E-mail addresses: porshenlai@yahoo.com (P.-S.
(H.-C. Fu).
a b s t r a c t
This paper proposes new variance enhanced clustering methods to improve the popular K-medoid algo-
rithm by adapting variance information in data clustering. Since measuring similarity between data
objects is simpler than mapping data objects to data points in feature space, these pairwise similarity
based clustering algorithms can greatly reduce the difficulty in developing clustering based pattern rec-
ognition applications. A web-based image clustering system has been developed to demonstrate and
show the clustering power and significance of the proposed methods. Synthetic numerical data and
real-world image collection are applied to evaluate the performance of the proposed methods on the pro-
totype system. As shown as the web-demonstration, the proposed method, variance enhanced K-medoid
model, groups similar images in clusters with various variances according to the distribution of image
similarity values.
� 2010 Elsevier Ltd. All rights reserved.
1. Introduction
Clustering techniques have been widely used in many applica-
tions, such as pattern recognition, data mining (Pao, Chen, Lai,
Xu, & Fu, 2008), user interface design, and so on. Using clustering
methods (Fu et al., 2000; Haykin, 1994) to approximate the data
distribution of target objects (pattern) in feature space, the ob-
tained data clusters can be used to recognize target objects (pat-
tern) by checking whether a new object is in a cluster or not.
Besides, clustering methods (Guyon & Elisseeff, 2003; Schapire &
Singer, 1999) are also useful in finding patterns of labeled data
groups for data mining application. Using labeled training objects,
supervised clustering methods (Haykin, 1994) learn rules to parti-
tion data objects into clusters (Pao, Chuang, Xu, & Fu, 2008). On the
other hand, unsupervised clustering methods (Comaniciu & Meer,
2002; Hastie, Tibshirani, & Friedman, 2001b, chap. 14.3) partition
feature space to distribute data objects into disconnected groups.
Generally, most clustering methods apply the feature extraction
techniques to extract feature vectors, which usually represent the
original data objects in numerical form to represent data objects
as data points in feature space. Then, data points are partitioned
into clusters according to the distribution in feature space. There-
fore, the performance of data clustering is highly depending on the
quality of extracted features.
ll rights reserved.
ional Science Council under
Lai), hcfu@csie.nctu.edu.tw
K-means (Hastie, Tibshirani, & Friedman, 2001c, chap. 14.3.6), a
popular clustering algorithm, groups data objects into clusters
according to the norm between extracted feature vectors. In other
words, using K-means clustering method, data objects in a cluster
are represented by the mean of feature vectors of data points, the
cluster with the closest mean. Since K-means algorithm does not
use variance information to cluster data points, thus K-means
model is inappropriate to cluster data points or objects having
different variances in data distribution. Mixture Gaussian model
(Bilmes, 1997; Hastie, Tibshirani, & Friedman, 2001d, chap.
14.3.7) represents a data cluster by a mean vector and a covariance
matrix, which, respectively, illustrate the center location and the
variance of data distribution in each orientation. Therefore, the
mixture Gaussian model is capable to cluster data points of non-
uniform variances.
Representing data objects by their feature vectors make the de-
sign of clustering method simpler. However, extracting representa-
tive feature vectors for data objects is still difficulty. For instance,
although the similarity between two images can be measured
based on some visual features, such as color, texture, or spatial dis-
tribution of pixels, creating feature vectors to represent the pair-
wise similarities between images by the norm between two
feature vectors is still difficult. Therefore, several methods (Hastie,
Tibshirani, & Friedman, 2001a, chap. 14.3.10) are proposed to clus-
ter data objects using pairwise similarities between data objects
instead of measuring the distance between extracted feature vec-
tors of data objects.
K-medoid algorithm (Hastie et al., 2001a) has been widely used
to cluster data points according to their pairwise similarity values.
Given similarity values between every pair of data objects, an
http://dx.doi.org/10.1016/j.eswa.2010.07.030
mailto:porshenlai@yahoo.com
mailto:hcfu@csie.nctu.edu.tw
http://dx.doi.org/10.1016/j.eswa.2010.07.030
http://www.sciencedirect.com/science/journal/09574174
http://www.elsevier.com/locate/eswa
P.-S. Lai, H.-C. Fu / Expert Systems with Applications 38 (2011) 764–775 765
iterative learning process can group data objects into clusters to
maximize the sum of similarity values for each cluster. That is,
when the sum of similarity values is maximized, K-medoid algo-
rithm groups data objects of equal to or lesser than similarity value
in one cluster.
The decision boundaries that are generated by a given K-medoid
model are the perpendicular bisector hyperplane of the line segment
from the medoid of one cluster to another. That is, the variance of
data distribution in each cluster is assumed to be uniform. However,
the variances of different orientations of the data distribution in a
cluster may be different. In this paper, the concept of different vari-
ances is suggested to be included in the original K-medoid model.
Therefore, a new variance enhanced K-medoid model is proposed
to group data objects in clusters with variant variances.
In addition, in order to use the new K-medoid model to cluster
similar images, a similarity measurement between two images is
also proposed and briefed as follows. First, color information is used
to segment image into regions of similar color. Then, the similarity
between two images is measured by computing the bitmap
difference in coverage region. Based on the proposed methods, a
web-based prototype system (The demonstration of variance
enhanced K-medoid model, http://www.csie.nctu.edu.tw/�pslai/
ASKMDDemo), which displays variable-sized thumbnails of images,
is developed and tested. The prototype system groups similar
images in a cluster, and the image at the cluster center (medoid) is
shown in larger size to depict each group. And the other similar
images in one group are displayed together and shown in smaller
size to save display space in screen.
This paper is organized as follows. The K-medoid algorithm is
introduced in Section 2. Then, the variance enhanced K-medoid
clustering is proposed and presented in Section 3. Section 4 intro-
duces the proposed methods for the similarity measurement of a
image collection. The web-based prototype system is then demon-
strated and evaluated in Section 5. Finally, concluding remarks are
drawn in Section 6.
2. K-medoid
Given a data point set P, the medoid of P is the point pm with the
smallest sum of distance from pm to all the other points pi in P.
Medoid pm of P is mathematically defined as follows:
pm ¼ arg min
pi2P
X
pj2P
Nðpi; pjÞ;
where Nðpi; pjÞ is the norm between two points pi and pj.
For one-dimensional data, medoid of P is actually the median.
For a data point set P, let Q(x) be the sum of distance from point
x to all the other points pi in P, Q(x) can be formulated as follows:
QðxÞ¼
X
pi2P
Nðpi; xÞ:
Let md be the medoid of P, and pi be another data point in P. Accord-
ing to the definition of medoid, Q(md) � Q(pi) must be smaller or
equal to zero.
Fig. 1. The relation between medoid and median for one-dimensional data points.
Let m be median and md be medoid. Since medoid md is the minimal of Q(md),
medoid md is equal to median m.
Fig. 1 illustrates the relation between median and medoid. Let
m be the median, and md be the medoid of P. For each point pi in
P, we may observe the followings.
� When pi is located between median m and medoid md, then
Nðpi; mdÞ¼Nðm; mdÞ�Nðm; piÞ.
� When median m is located between pi and medoid
md; Nðpi; mdÞ¼Nðpi; mÞþNðm; mdÞ.
� When medoid md is located between median m and
pi; Nðmd; piÞ¼Nðm; piÞ�Nðm; mdÞ.
Q(md) � Q(m) can be rewritten as follows:
QðmdÞ� QðmÞ¼
X
p2ð�1;m�
Nðm; mdÞ�
X
p2½md;1Þ
Nðm; mdÞ
þ
X
p2ðm;mdÞ
Nðp; mdÞ�Nðm; pÞ
¼
X
p2ð�1;m�
Nðm; mdÞ�
X
p2½md;1Þ
Nðm; mdÞ
þ
X
p2ðm;mdÞ
Nðp; mdÞ�Nðm; mdÞþNðp; mdÞð Þ
¼
X
p2ð�1;m�
Nðm; mdÞ�
X
p2ðm;1Þ
Nðm; mdÞ
þ 2
X
p2ðm;mdÞ
Nðp; mdÞ
¼Nðm; mdÞþ 2
X
p2ðm;mdÞ
Nðp; mdÞ
Since Q(md) � Q(m) P 0, and Q(md) is the minima (the definition of
medoid), the medoid md must be equal to median m. The same re-
sults hold when md is smaller or equal to m. Therefore, for one-
dimensional data, medoid is equal to median.
Intuitively, the medoid has higher chance to be located in the
high density region in a data set than the mean can be. Therefore,
medoid may be a better representation of a data set.
Similar to the popular K-means algorithm, K-medoid (Hastie
et al., 2001a) can be used to cluster data points into K groups. Un-
like K-means algorithm, K-medoid does not depend on the coordi-
nates or values of each data points in feature space, the data
required for medoid estimation is the norm between every pair
of data elements.
Suppose data points in P = {p1, . . ., pi, . . ., pN} are partitioned into
K clusters. By assigning each data point pi to the cluster with min-
imal norm between pi and the medoid of each cluster, a data set P
can be iteratively partitioned into K clusters.
2.1. The learning algorithm of K-medoid clustering
The K-medoid cluster learning algorithm (Hastie et al., 2001a) is
briefly stated as follows. Let Nðpi; pjÞ be the norm between pi and pj
in data set P, and K is the number of medoid.
Step 1. Given a current set of cluster centers Mt ¼ mt1; . . . ; mtk
� �
,
the current cluster assignments Ct(pi) of each data points
pi 2 P are evaluated by computing the closest cluster cen-
ter for each data point pi as follows:
CtðpiÞ¼ arg min
mt
j
2Mt
Nðmtj ; piÞ;
where Nðmj; piÞ is the norm between mj and pi, and C
t(pi) is
the current cluster assignment for pi.
Step 2. Given a set of current cluster assignments Ct, the new clus-
ter centers Mtþ1 ¼ mtþ11 ; . . . ; m
tþ1
j ; . . . ; m
tþ1
k
n o
are estimated
by computing the medoid among the data points in the
same assignment as follows:
Fig. 2. A
K-medoid
represent
cluster ce
cluster ce
weight v
location c
(the dis-s
points are
766 P.-S. Lai, H.-C. Fu / Expert Systems with Applications 38 (2011) 764–775
mtþ1j ¼ arg min
CtðpxÞ¼mtj
X
CtðpyÞ¼mtj
Nðpx; pyÞ;
where Ct(pi) is the cluster assignment for pi.
Step 3. Iterate steps 1 and 2 until the assignments do not
unchange.
2.2. Self-growing K-medoid
Generally speaking, asking user to assign the number of re-
quired clusters is not practical since the number of clusters is usu-
ally the natural of data distribution and is often unknown to a user.
In order to perform data clustering without knowing cluster
number, the self-growing K-medoid clustering algorithm is pro-
posed. By giving the maximal acceptable dis-similarity between a
data point and its cluster center, the proposed method iteratively
increases the number of clusters until the dis-similarity between
any data point, and the associated cluster center is lower than a gi-
ven threshold.
The basic concepts of the proposed method can be illustrated by
the example shown in Fig. 2. The y-axis shows the value of weight
for each points, and the x-axis is the location of data points. Based
on K-medoid clustering algorithm, the point p3 is first selected to
be the first cluster center c1. The weights of each data point are up-
dated according to their norms with respect to the selected cluster
center. Specifically, the weights of data points that are close to the
selected cluster center will be decreased more than the weights of
data points that are far away from the selected cluster center. Then,
another data point locates in the middle of data points with larger
weights is selected as the second cluster center c2. After clustering
all data points with the two selected cluster centers, the weights of
data points are updated. The clustering processes are repeated to
decrease the weights of data points until all weights are lower than
to a given threshold.
To formulate this idea, a weighted medoid estimation is defined
as follows:
1-D data distribution example is used to illustrate the idea of self-growing
Clustering algorithm. In each diagram, the heights (w) of a sample points
s its weight values. (1) At first, the middle point p3 is selected as the first
nter c1. (2) By decreasing the weights of data points close to the first
nter c1, point p5, which is the middle point of data points with large
alues, is selected to be the new cluster center c2. (3) After refining the
luster centers by standard K-medoid clustering algorithm, the weights
imilarity between data point and the closest cluster center) of all data
less than a given threshold.
pn ¼ arg min
pj2P
X
pi2P
wj wiNðpj; piÞ
!
:
The weighted medoid estimation organizes a new cluster center for
data points with high weight values.
Combining the weighted medoid estimation and the standard
K-medoid algorithm, the self-growing K-medoid method algorithm
is introduced as follows:
Step 1. Initialize weight values wi of each data point pi to be 1.
Step 2. Use the weighted medoid estimation to create a new clus-
ter center.
Step 3. Apply the standard K-medoid clustering algorithm to
adjust the location of clusters centers pm.
Step 4. Recompute the weights wi of each data points pi as
follows:
a
Fig. 3. Th
model. Sh
boundary
centers. T
single var
between
from clus
decision b
between
wi ¼ 1 � 1 þ min
pm2M
Nðpi; pmÞ
� ��1
:
Step 5. Iteratively repeat steps 2–4 until all weight values wi are
smaller than a given threshold.
Using the proposed self-growing K-medoid algorithm, the num-
ber of required model (medoid) can be determined to let the dis-
similarity between data point and closest cluster center be less
than a given threshold.
2.3. Discussion
Based on the mathematical representation of two clusters, a
decision boundary can be determined to separate data points into
two parts. As shown in Fig. 3(a), the decision boundary between
two K-medoid clusters is the perpendicular bisectors of a line seg-
ment from one cluster center to another. Since the data variance is
not considered, these decision boundaries may not separate data
clusters properly. To achieve a better separation of the data points,
variance values should be considered in determining the decision
b
c
e comparison among three kinds of decision boundary for clustering
aded area are the distributed regions of data points. (a) The decision
is the perpendicular bisector of the line segment between two cluster
he variance of data distribution is not used in creating data clusters. (b) A
iance for every direction is used for each cluster. For two cluster, the ratio
the two variances of clusters is equal to the ratio between the distances
ter centers to decision boundary. (c) For each two clusters, the location of
oundary is decided according to the data variances along the line segment
two cluster centers.
P.-S. Lai, H.-C. Fu / Expert Systems with Applications 38 (2011) 764–775 767
boundaries. An variance enhanced K-medoid is illustrated in
Fig. 3(b). By aggregating the norm between data points and its
cluster center (medoid), the variance of each cluster is estimated
Again, since the variance orientation of data distribution is still
not considered, and the decision boundaries may not partition
the feature space well enough. For instance, without considering
variance orientation, a cluster with widely data distributed in
one direction and with vary little data distributed in a perpendic-
ular direction may result a large overall variance along the perpen-
dicular direction. Fig. 3(c) shows the decision boundaries in
corresponding to data cluster with multiple variances being con-
sidered along several orientations. Based on the location of data
points in feature space, a K-means model can be improved by
including the covariance matrix for each cluster, e.g., Gaussian
mixture model. Since the location information of data points is
not used in K-medoid model, improving the K-medoid model using
the covariance matrix method seems impracticable. How to in-
clude the oriented variance in K-medoid method and how to esti-
mate the variances in various orientations become essential issues
for variance enhanced K-medoid clustering algorithm. The detailed
description of this method will be illustrated in Section 3.
3. Variance enhanced K-medoid clustering
According to the discussion in Section 2.3, multiple variances
along several orientations are required for K-medoid based cluster-
ing method. However, since the locations of data point in feature
space are not available, estimating variance orientation using
covariance matrix seems impractical. Therefore, multiple variances
along line segments between cluster centers are proposed instead.
Let us see that the variance along a line segment between cluster
centers should be measured according to the data points located
along the line segment. As the example shown in Fig. 4, four cluster
centers M1, M2, M3, and M4 are included, and bi’s are the decision
boundaries between the line segments from M1 to Mi for i = 2 to
4. For instance, the variance along M1M3 associated with cluster
center M1 is estimated by the data points located in the region of
gray-scaled color. In general, decision boundaries segment the fea-
ture space into polygon-shaped regions for each cluster in a data
set. And, the polygon-shaped region for a cluster can be further di-
vided into several smaller pyramid-shape segments, whose tops
Fig. 4. Giving four clusters with centers at M1, M2, M3, and M4. For the center M1,
there are three associated decision boundaries b2, b3, and b4. The variance between
M1 and M3 is computed using the data points inside the pyramid-shape segment
whose top is located at the cluster center M1, and the bottom edge is along with
decision boundary b3.
are the cluster centers and bottom hyperplane are along the deci-
sion boundaries. The variance along the line segment M1 M3 is mea-
sured using data points located in the pyramid-shape segment,
whose bottom edge is along the decision boundary b3 which parti-
tions two data clusters with centers at M1 and M3, respectively.
To select appropriate data points for the computation of vari-
ance along line segments between cluster centers, a polygon-
shaped data model, called Polygon descriptor (Lai & Fu, 2007,
2010), is suggested and briefly described in Section 3.1.
In the rest of this section, Polygon descriptor is introduced in
Section 3.1. Then, the method that segments each cluster into
sub-clusters, called side-clusters, along each line segment between
cluster centers is depicted in Section 3.2. Using the projection
method described in Section 3.3, the projection distribution of data
points in each side cluster can be used to adjust decision bound-
aries between every pair clusters in the data points by the pro-
posed method in Section 3.4. Finally, the variance enhanced K-
medoid clustering method is presented in Section 3.5.
3.1. Polygon-shaped data model
As shown as Fig. 3, decision boundaries segment feature space
into several polygon-shaped regions (data clusters). Polygon
descriptor (Lai & Fu, 2007, 2010) was originally proposed for mod-
eling a polygon-shaped data distribution.
A Polygon descriptor uses a reference center and N normal vectors
to describe a polygon-shaped data distribution. A normal vector
represents: (1) the normal direction of a hyperplane that encloses
the convex unit and (2) the distance from the reference center to
each hyperplane. Fig. 5(a) shows an exemplar Polygon descriptor
for the representation of a 2D data distribution in a convex unit.
The reference center is located at (20, 20), and five normal vectors
are
10
0
� �
;
5
5
� �
;
�5
5
� �
;
�10
0
� �
; and
0
�10
� �
. By applying a
1-D probability function to each normal vectors, a polygon-shaped
probability model is created as shown in Fig. 5(b). The polygon-
shaped probability model (distribution) shown in Fig. 5 is centered
at (20, 20), and the variance corresponds to each normal vector is
the length of each normal vectors, respectively.
According to the Polygon descriptor, one center point and M
normal vectors are used to describe a M-side polygon. The center
point is used as the reference origin of the coordinate system of
a Polygon descriptor. And, M normal vectors are used to span the
polygon-shaped coverage area of data points. Since normal vectors
are orthogonal to the boundaries of a data distribution, the orien-
tation of the boundaries of a data distribution can be represented
by their normal vectors. Besides, the length of a normal vector indi-
cates the data distribution variance along the normal vector.
Fig. 6 shows the components in a polygon descriptor. Data
points are partitioned into clusters, called side-cluster, according
to the length and orientation of each normal vectors. As shown
in Fig. 6, for a data distribution containing M normal vectors, M
side-clusters are determined according to the M boundary edges.
The side-clusters corresponding to each boundary edges can be
used to measure the variance of data points along the correspond-
ing normal vectors.
3.2. Side-cluster segmentation
M normal vectors of a Polygon descriptor can be used to further
segment a data cluster into M side-clusters. Let ~a1; . . . ; ~aj; . . . ; ~aM
� �
be the M normal vectors of a Polygon descriptor. Given a set P of
data points pi, a sub-cluster assignment C(pi) for a data points pi
can be formulated as follows:
a
b
Fig. 5. (a) An example of a proposed polygon descriptor, where its center is at (20,20) and normal vectors (drawn as solid arrow) to each surrounding hyperplane are (10, 0)T,
(5, 5)T, (�5, 5)T, (�10, 0)T, and (0,�10)T. (b) The pentagon-shaped probability distribution for the polygon descriptor of (a).
Fig. 6. A diagram shows the components of an exemplar polygon descriptor. One
center point C is used as the reference origin of the coordinate system of a Polygon
descriptor. M(= 3) normal vectors are emitted from the origin to M directions.
According to the length and orientation of M normal vectors, M side-clusters are
determined corresponding to each boundary edges. That is, the clustering of a side-
cluster is determined by the length and orientation of normal vectors. The length
and orientation of normal vectors can be estimated according to the statistical
properties of data points in each clusters.
768 P.-S. Lai, H.-C. Fu / Expert Systems with Applications 38 (2011) 764–775
CðpiÞ :¼ arg max
M
j¼1
~pi � ~aj
~aj � ~aj
;
Fig. 7. Let ~a1 and ~a2 are normal vectors of side-clusters C1 and C2, respectively. The
dash line shows the decision boundary between C1 and C2. For data vector ~p1 which
is from cluster center O to data point p1 in C1, its projection ratio on ~a1 is larger than
the projection ration on ~a2 , i.e., ~p1 � ~a1ð Þ
�
~a1
�� ��2 > ~p1 � ~a2ð Þ� ~a2�� ��2 . Similarly, the
projection ratio of the data vector ~p2 in C2 on ~a2 is larger than the projection ratio on
~a1 , i.e., ~p2 � ~a1ð Þ
�
~a1
�� ��2 < ~p2 � ~a2ð Þ� ~a2�� ��2 .
where C(pi) is cluster assignment for pi, and ~pi is the vector from ref-
erence center to pi. According to the cluster assignments, there are
M side-clusters, and how data points are partitioned into side clus-
ters will be described in the following paragraph.
Fig. 7 illustrates the basic concept of how a side-cluster is
formed. Let ~a1 and ~a2 be normal vectors of boundary edges b1
adn b2, respectively. The decision boundary between two side-
clusters C1 and C2 is drawn in dot line, which is a hyperplane pass-
ing through the origin point O and the intersection point I of the
two boundary edges b1 and b2. For a vector ~p1 from reference cen-
ter to a point p1 in C1, its projection ratio on ~a1 is larger than the
projection ratio on ~a2 . That is,
P.-S. Lai, H.-C. Fu / Expert Systems with Applications 38 (2011) 764–775 769
~p1 � ~a1
~a1
�� ��2 >
~p1 � ~a2
~a2
�� ��2 :
By segmenting the data points of a cluster into side clusters corre-
sponding to the line segments between cluster centers and by pro-
jecting the data points in a side cluster onto the corresponding line
segment between cluster centers, an 1-D data distribution can be
created. The 1-D data distribution can be used to measure the var-
iance along the line segment between cluster centers or adjust the
location of decision boundary between clusters. However, since the
coordinates of data points for K-medoid model is absent, a new
method is proposed in Section 3.3 to estimate the projection length
without knowing the coordinate of data points.
3.3. Projections on the center-to-center link
Computing variance along normal vector using projection ratio
requires the coordinate values of data points in feature space. In or-
der to computing the projection ratio on normal vectors without
knowing the coordinates of data points in feature space, a new pro-
jection method is proposed.
As shown in Fig. 8, given a point p, and two clusters with center
points at m1 and m2. m1pi
!��� ���; m2pi!
��� ���; and m1 m2!��� ��� are the norms
from p to m1, p to m2, and m1 to m2, respectively. Let d be the pro-
jection of p on m1; m2
!
. m1 d
!
����
���� is the projection length of m1 pi! on
m1m2
!
. As shown in Fig. 8, the following relation holds:
m1pi
!��� ���2 ¼ m1d!
����
����
2
þ pi d
!
����
����
2
;
m2pi
!��� ���2 ¼ m1 m2!��� ���� m1 d!
����
����
� �2
þ pid
!
����
����
2
:
Then, the length of m1 d
!
����
���� can be derived as follows:
m1m2
!��� ���� m1 d!
����
����
� �2
¼ m2pi
!��� ���2 � m1pi!
��� ���2 þ m1d!
����
����
2
;
m1m2
!��� ���2 � 2 m1 m2!��� ��� m1d!
����
���� ¼ m2pi!
��� ���2 � m1 pi!
��� ���2;
m1d
!
����
���� ¼
m1m2
!��� ���2 � m2 pi!
��� ���2 þ m1 pi!
��� ���2
2 m1m2
!��� ��� :
By projecting data points onto the line segment between two clus-
ter centers, the 1-D data distribution along the line segment be-
comes available. Based on the 1-D data distribution, the decision
boundary can therefore be adjusted to maximize the decision mar-
gin between two estimated clusters.
3.4. Decision boundaries adjustment
Using the projection method proposed in Section 3.3, data
points in a cluster can be segmented into several side-clusters cor-
Fig. 8. m1 and m2 are two cluster centers. p is a data point. d is the projection point
on m1 m2 .
responding to the line segments between each pair of cluster cen-
ters. The projection of data points in a side-cluster on the
corresponding line segment between cluster centers forms an
1-D data distribution. Based on the 1-D data distribution, the var-
iance of data points along the line segment between cluster centers
can be measured. Since the purpose of measuring variance is to ad-
just the location of decision boundary, the estimation or the mea-
suring of variance can be avoid by directly adjusting the location of
decision boundary according to projected 1-D data distribution.
As shown in Fig. 9, given two cluster centers m1 and m2. By pro-
jecting data points pi (shown in black square dots) onto the line
segment m1 m2 , a series of 1-D location values (shown in black tri-
angle dots) are measured. The following iterative procedure is pro-
posed to find a decision boundary, such that the distance between
the means of two partitioned clusters can be maximized. These 1-D
location values are partitioned into two groups namely g1 and g2
according to the decision boundary b12. The decision boundary
b12 is determined to maximize the distance between l1 and l2,
where l1 and l2 are the means of data groups g1 and g2,
respectively.
Let a be the target location of decision boundary, and the pro-
jection ratio h(pi) for point pi is defined as
hðpiÞ¼
m1pi
!
�m1 m2
!
m1m2
!
�m1m2
! :
Actually, h(pi) is the projection length of vector m1pi
!
on m1m2 divid-
ing by the distance between two cluster centers. Then, a can be
evaluated according to the following formula:
a ¼ arg max
a
P
hðpiÞ>a
hðpiÞP
hðpiÞ>a
1
�
P
hðpiÞa
hðpiÞP
hðpiÞ>a
1
is the average projection ratio that is larger than a,
and
P
hðpiÞ i, data points in Ci and Cj
are projected onto mimj
!
, where mi and mj are medoids of Ci and
Cj, respectively. Then, the line segment between Ci and Cj is parti-
tioned into 10 intervals. By counting the number of points locating
in one interval, a histogram is drawn. Based on the histogram, the
decision boundary can be properly located so that the distance be-
tween two means of data sets, which are separated by the decision
boundary, is maximized. All data points are then redistributed
using the adjusted decision boundaries. The process repeats until
the location of decision boundaries converged, such that the dis-
tance between l1 and l2 is maximized.
Fig. 9. Data points pi are projected to the line segment between cluster centers m1
and m2. The decision boundary is located at the place which partition the projected
data points (triangle points) into two groups with farthest mean location l1 and l2.
Fig. 10. The flow chart of variance enhanced K-medoid clustering. At first, general
K-medoid algorithm is applied. According to the resulted clusters of general
K-medoid algorithm, the cluster-to-cluster variance is estimated. Then, the weights
for each data points are updated and new cluster is estimated based on the
weighted data points. These steps are repeated until all weights are smaller than a
given threshold q.
770 P.-S. Lai, H.-C. Fu / Expert Systems with Applications 38 (2011) 764–775
3.5. Variance enhanced K-medoid clustering method
The flow chart of variance enhanced K-medoid clustering is
shown in Fig. 10. At first, one medoid is estimated. Then, the num-
ber of medoid is gradually increased until the distance from each
data point to the closest cluster center is smaller than a given
threshold. For each medoid, its positions are estimated using gen-
eral K-medoid algorithm first. Then, the 1-D data distributions can
be established by projecting data points in side-clusters to the cor-
responding line segment between cluster centers. According to
these 1-D data distributions, the location of decision boundaries
is adjusted to redistributed data points to each clusters. After the
position of medoid converged, weights wi of each points pi is calcu-
lated as the following formula:
wi ¼ 1 � 1 þ min
pm2M
simðpi; pmÞ
� ��1
;
where M is a set of estimated medoids, and sim (pi, pm) is the pro-
jected similarity defined as follows:
simðpi; pmÞ¼ arg max
pj2M;pj – pm
~pi � ~pj � ~pm
�
a ~pj � ~pm
�
� a ~pj � ~pm
� ;
where a ~pj � ~pm
�
is the vector from pm to the decision boundary be-
tween the clusters associated to pm and pj. Based on these weights, a
new medoid is created to include high weighted data points. By
repeating these processes, the number of medoid gradually increas-
ing until the maximal weight is lower than a given threshold.
Using the proposed variance enhanced K-medoid, the location
of decision boundaries between each pair of cluster centers is re-
lated to the data distribution along a line segment between cluster
centers now. A real-world application, photo album preview with
variable-sized thumbnails, is proposed and implemented in Sec-
tion 4 to illustrate the performance of the proposed variance en-
hanced K-medoid clustering method.
4. Real-world application
In this section, variance enhanced K-medoid is used to intelli-
gently create image thumbnails, where ‘‘intelligently” means that
the size and the display order of image thumbnail are decided
without human intervention. Generally, image gallery (Pao et al.,
2008) uses thumbnail to provide a quick preview for each photo
(Pao et al., 2008). However, even though the thumbnail is down-
sized from the original photos, the display area is often still too
small to contain all thumbnails. To contain all thumbnails of same
size can be very difficult in selecting a proper size for all the
thumbnails. Sometimes, it may be too crowded to clearly represent
the original image, or it may be too large to put all thumbnails in a
page. Therefore, variable-sized thumbnail seems a better alterna-
tive to fit all thumbnails into a page without losing too much
details.
Since most of images in an image gallery may be similar to each
other, these images can be partitioned into groups according to
their similarity. For each group, a most representative image can
be displayed by a normal sized thumbnail, and the rest of similar
images can be shown by smaller thumbnails.
The detail description of the proposed system is depicted in the
following sections. Color information of an image is used to create
four bitmaps to show the distribution of dominant color compo-
nents. Based on these bitmaps, the similarity between two images
is measured. Then, the proposed variance enhanced K-medoid is
applied to cluster all images into groups.
4.1. Color clustering
Giving an image I, K’s dominant color components are selected
to establish K’s bitmaps which show the coverage of data distribu-
tion of dominant color components. K-means algorithm is used to
cluster the color value of each pixel px,y 2 I. Let K = 4, then four
dominant colors mi, for i = 1, . . ., 4, are estimated by K-means algo-
rithm. Then, four bitmaps I(x, y)i are created as follows:
Iðx; yÞi ¼
1 ; if arg min
mi
normðmi; ciÞ¼ mi
0 ; if arg min
mi
normðmi; ciÞ – mi
8><
>: for i ¼ 1; . . . ; 4;
where I(x, y)i is the bitmap created using dominant color mi, and ci is
the ith color component at (x, y).
An example is shown in Fig. 11, the size of each image is nor-
malized to 80 � 80 at first. Then, K-means algorithm is used to
cluster pixels of original image to four color component groups
according to the color values of pixels. And, four bitmaps are estab-
lished to show the distribution of data points in each groups. These
four bitmaps will be used to measure the similarity between
images (in Sections 4.2 and 4.3).
4.2. Similarity measurement of two bitmaps
Giving two normalized bitmaps that show the coverage area for
data distribution of specified color components, the similarity
S(A, B) between two bitmaps A and B is defined as follows:
SðA; BÞ¼
A\Bk k
A[Bk k
;
where A and B are the sets containing the pixels with value 1 in
bitmap A and B, respectively. As shown as Fig. 12, the similarity is
defined according to the ratio between overlapped (d) and overall
(a [ d [ b) area of A ða [ dÞ and B ðb [ dÞ, where a, b, and d repre-
sent the shaded areas in Fig. 12.
Since four bitmaps are extracted to represent the color charac-
teristics of an color image, the similarity between two images can
be estimated based on the group similarity Sg between two group
of bitmaps. The computation of group similarity can be performed
on a best match between the elements in bitmap groups. The best
match can be searched by the method proposed in Section 4.3.
According to the best match, the group similarity measurement
Sg is defined as follows:
Sg ¼
X
i
kAikþkBikP
j kAjkþkBjk
� � SðAi; BiÞ;
Fig. 11. Using K-means algorithm, pixels of original image, for example in (a), are clustered into four dominant groups according its color value. Then, four bitmaps, for
example (b), are established to show the data distribution of pixels in each dominant groups.
Fig. 12. Giving two coverage area of data distribution a and b, the similarity
between these two coverage area is defined according to how much portion of these
two coverage area are overlapped (d).
P.-S. Lai, H.-C. Fu / Expert Systems with Applications 38 (2011) 764–775 771
whereAi and Bi are the sets containing the pixels with value 1 in bit-
map Ai and Bi, respectively. That is, the group similarity measurement
Sg is the weighted sum of similarities between each pair of bitmaps,
among the total number of pixels in each pair of bitmaps.
4.3. Similarity matching between bitmap groups
For each image, four bitmaps are extracted to represent the col-
or characteristics of an image. To measure the similarity between
two images, a one-to-one and onto match (Liu, 1985) between ele-
ments of two bitmap groups can be performed by the following
proposed method.
Giving two bitmap groupsA¼ A1; . . . ; ANf g and B¼ B1; . . . ; BNf g,
where Ai and Bi are bitmaps. An N � N matrix M is formed in the fol-
lowing manner:
M ¼
m11 � � � m1N
..
. . .
. ..
.
mN1 � � � mNN
2
664
3
775; where mij ¼ SðAi; BjÞ:
772 P.-S. Lai, H.-C. Fu / Expert Systems with Applications 38 (2011) 764–775
Since an element mij represents the similarity value when Ai and Bi
are considered to be matched with each other. Finding a one-to-one
and onto match between A and B is equivalent to selecting N ele-
ments from each column of M, and no two elements are selected
from the same row. To have the largest overall similarity SgðA;BÞ
of a one-to-one and onto match between two bitmap groups
A and B, the N elements should be selected such that the sum of se-
lected elements is maximized. In order to regularize the computing
procedure, row exchanges are performed to have the N selected ele-
ments aligned on the diagonal position. Thus, the similarity value
between bitmap groups can be computed as the sum of diagonal
elements.
Fig. 13 shows the flow chart of the proposed method that max-
imizes the sum of diagonal elements by properly exchanging rows.
Giving a similarity matrix M. First, find an element mij with the
largest value in a similarity matrix M. Then, swap the ith row
and jth row such that the element mij is at (j, j). Then, mark all ele-
ments in jth row and jth column. Then, find elements with maxi-
mal value from un-marked elements and swap the corresponding
raw and column, such that the maximal element is moved to diag-
onal position. Repeat the same process, until all elements are
marked.
Although the procedures stated above are efficient, its results
may not be good enough. Therefore, a checking step is applied on
each rows and columns to see whether any possible row exchange
Fig. 13. The flow chart of the proposed matching method. At the beginning, a
similarity matrix is constructed. The order of rows is then initialized by greedily
exchanging the largest element to diagonal position. Then, further checking steps
are applied on the diagonal elements to maximize the sum of diagonal elements.
can further maximize the sum of diagonal elements in the similar-
ity matrix M. The checking step is described as follows:
Fig
co
se
sim
th
Th
ro
res
IF mii + mjj > mij + mji
THEN exchange the ith and jth rows.
This checking step is performed repeatedly until no further row
exchange can be performed to increase the sum of diagonal ele-
ments. In other words, the near maximum similarity value be-
tween bitmap group A and B is achieved.
An example of the one-to-one onto match method is shown in
Fig. 14. At first, since m11 = 178 is the largest value in the 5 � 4 ma-
trix, the elements in first row and first column are marked. Since
the value m23 = 169, the largest value among the unmarked ele-
ments is located in the third row, and the second row and the third
row are exchanged. Then, the elements in second row and second
column are marked too. Similar process is repeated until all col-
umns are marked. According to the checking step, the second
and the fourth rows are exchanged, since 80 + 143 > 60 + 162.
In order to evaluate the performance of the proposed one-to-
one and onto match algorithm, synthesis testing data sets are gen-
erated by a random number generator. To evaluate different size of
similarity matrices, 10 groups of test data sets with row dimension
from 2 to 11 are generated. For each group, random seeds from 0 to
5999 are used to generate test data.
The testing results are listed in Table 1. Label OST (optimizing
sorting test) is performed by the proposed method, and label TAT
(Testing All Test) indicates an exhaustive search method that vali-
dates all the possible solutions. The TAT results are considered as
a
b
c
. 14. An example of the proposed matching method. The similarity matrix that is
nstructed from the pair-wise similarity values is shown in (a). At first, by
lecting an element with the largest value among unmarked elements in a
ilarity matrix, swapping the selected element to diagonal position, and marking
e elements in the same row and column iteratively until all elements are marked.
e result is shown as (b). Then, the checking steps are applied to see whether any
w exchanges can be applied to maximize the sum of diagonal elements. The final
ults are shown in (c).
Table 1
The performance comparison of the proposed one-to-one match algorithm (OST) and
the exhaustive search method (TAT). To evaluate different size of similarity matrices,
10 groups of test data sets with row dimension from 2 to 11 are generated. The data
listed in table are the averaged computing time in milliseconds spent by the matching
process. TAT/OST is the ratio of the spending time by these two methods.
Matrix dimension Computing time (ms) TAT/OST
TAT OST
2 206 175 1.2
3 295 214 1.4
4 855 216 4.0
5 5440 228 23.9
6 40,853 232 176.1
7 349,117 255 1369.1
8 3,582,596 301 11902.3
9 39,197,125 327 119868.9
10 469,318,748 362 1296460.6
11 6,219,635,476 408 15244204.6
P.-S. Lai, H.-C. Fu / Expert Systems with Applications 38 (2011) 764–775 773
the ground truth. The performance is measured in milliseconds of
the total computing time to process all the test data. Apparently,
the proposed method is quite efficient. Besides, based on the
60,000 (10 group of 6000 random seeds) testing data, the proposed
method generates the same results as the ground truth.
4.4. Images clustering by enhanced K-medoid method
Giving an image set I ¼ I1; . . . ; INf g, a matrix M is constructed
by measuring the similarity between two images using the meth-
ods proposed in Sections 4.1, 4.2, and 4.3. The image similarity ma-
trix M for image comparison is defined as follows:
M¼
m11 � � � m1N
..
. . .
. ..
.
mN1 � � � mNN
2
664
3
775; where mij ¼ SgðIi; IjÞ:
Fig. 15. According to the web-based image clustering prototype system, the proposed
similarity threshold is set as 0.35. That is, the similarity between a data object and the rep
Then, the proposed variance enhanced K-medoid model clusters a set
of images using the pair-wise image similarities Sg represented in
the image similarity matrix M. The image corresponding to the
medoid of a data cluster is selected as the representative image
for each image cluster.
5. System demonstration
A web-based prototype system (http://www.csie.nctu.edu.tw/
pslai/ASKMDDemo) is implemented to demonstrate the proposed
method for public trial and testing. Fig. 15 shows a web-based user
interface for image clustering. As shown in Fig. 15, a user may
select a test media (http://www.youtube.com/watch?v = MgpzUo_
kbFY) set from the selector at the top-right corner of the web
interface. The threshold, which represents the maximal allowed
difference between data objects and its cluster center, is selected
by clicking at a proper position on the ruler at the top-left corner.
Then, the images in the test media set are clustered. And, the re-
sulted clusters are presented in the order of cluster size. For each
cluster, the represented image of each cluster is displayed in larger
size, and the rest of images in a cluster are shown in smaller-sized
thumbnail.
Figs. 15–17 show the testing results using three different
thresholds on the same media data set. In general, using higher
threshold may reduce the number of clusters. More specifically,
enlarging similarity threshold may reduce the similarity among
data elements in certain clusters, so as to receive more not so sim-
ilar data elements in a cluster. That is, the maximum allowed var-
iance is increased. However, for data clusters containing quiet
similar data elements, enlarge the similarity threshold will only
slightly increase the number of less similar data elements, thus
the maximal allowed variance has little inference on these data
clusters. In addition, noisy data elements are clustered into certain
less-similar clusters instead of spreading to all clusters.
method successfully clusters similar images together. In this demonstration, the
resentative object (cluster center) of a cluster should be larger than 0.65(= 1 � 0.35).
Fig. 16. By increasing the similarity threshold to 0.4, certain data clusters receive several not so similar data objects (red frame A). However, the data clusters, which are
previously contained quiet similar data elements, still have almost the same similarity among each other (clusters marked with B). (For interpretation of the references to
color in this figure legend, the reader is referred to the web version of this paper.)
Fig. 17. By further increasing the similarity threshold to 0.45, data clusters previously contained some not so similar objects will grow to accept more dis-similar objects (red
frame A). However, these data clusters, which previously contained quiet similar objects, will still have very similar objects (clusters marked with B). (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this paper.)
774 P.-S. Lai, H.-C. Fu / Expert Systems with Applications 38 (2011) 764–775
6. Conclusion
Generally speaking, grouping similar data together is useful to
help user browse data efficiently. However, the similarity of data
are difficult to define. Often, the definition of similarity between
data objects depends on their applications. On the other hand,
most clustering algorithms are developed for data sets which are
represented as data points in feature space. That is, before
P.-S. Lai, H.-C. Fu / Expert Systems with Applications 38 (2011) 764–775 775
clustering similar data, a feature extraction process is applied to
each data objects to represent data objects as points in feature
space. However, since the definition of similarity is different for
various applications, and mapping the data objects into a feature
space reasonably can be difficult and also complicated. Most of
the time, proper feature extraction becomes a bottleneck for appli-
cation system development.
Instead of mapping data objects into points in feature space, an-
other way to show the similarity among data objects is to measure
the pairwise similarity among data objects. Describing the similar-
ity between two data objects is more instinctively than mapping
data objects into a feature space since computing similarity be-
tween pair of data objects only need the special relationship be-
tween data objects.
A popular clustering algorithm, called K-medoid, is often used
to cluster data objects using the pairwise similarities between data
objects. K-medoid clustering algorithm groups data objects into
clusters by maximizing the sum of similarity between each data
objects and their cluster center. That is, the variance of each cluster
still has the same value.
In this paper, variance enhanced K-medoid clustering is pro-
posed to introduce the idea of data variance into the original K-
medoid algorithm. Besides, multiple variances are used to describe
data variances along line segments between cluster centers. That
is, the variance of cluster in different orientation is considered.
An intelligent image thumbnail system is prototyped to demon-
strate the proposed variance enhanced K-medoid clustering meth-
od. For each image, the covered regions of four dominant colors are
segmented at first. Between two images, the differences in four
dominant color regions are measured as the similarity between
images. Then, the variance enhanced K-medoid model is applied
on these pairwise image similarities to group images into clusters.
And, the image at cluster center (medoid) is used as the key image
for this group. The web-based thumbnail demonstration show that
the system successfully groups similar images together, and the
key image is a good representative of the images in the cluster.
By increasing the similarity thresholds of the proposed clustering
method, the number of clusters will decrease. And, the clusters
that contained quiet similar images before the changes will only
changes slightly, i.e., receive a few more images, although the clus-
ter that contained some not so similar images will grow bigger due
to more outlier images are received. Such properties of the pro-
posed model are useful such as selecting representative frame-sets
from video for various video summary applications.
References
Bilmes, J. (1997). A gentle tutorial on the em algorithm and its application to parameter
estimation for Gaussian mixture and hidden markov models, Tech. rep., ICSI-TR-
97-021, University of Berkeley.
Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature
space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(5), 603–619.
Fu, H.-C., Lai, P.S., Lou, R.S., & Pao, H.T. (2000). Face detection and eye localization by
neural network based color segmentation. In: Neural networks for signal
processing X, 2000. Proceedings of the 2000 IEEE signal processing society
workshop (Vol. 2, pp. 507–516).
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection.
The Journal of Machine Learning Research, 3, 1157–1182.
Hastie, T., Tibshirani, R., & Friedman, J. (2001a). The Elements of Statistical
Learning:Data Mining, Inference, and Prediction. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2001b). The Elements of Statistical
Learning:Data Mining, Inference, and Prediction. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2001c). The Elements of Statistical
Learning:Data Mining, Inference, and Prediction. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2001d). The Elements of Statistical
Learning:Data Mining, Inference, and Prediction. Springer.
Haykin, S. (1994). Neural Networks, A Comprehensive Foundation. Macmillan College
Publishing Company, Inc..
Lai, P.-S. & Fu, H.-C. (2007). A polygon description based similarity measurement of
stock market behavior. In: IEEE congress on evolutionary computation, 2007 (CEC
2007) Singapore (pp. 806–812).
Lai, P.-S., & Fu, H.-C. (2010). A polygon description based similarity measurement of
stock market behavior. Expert System with Applications, 37(2), 1113–1123.
Liu, C. L. (1985). Elements of discrete mathematics. New York: McGraw-Hill. Ch. 4, p.
126.
Pao, H., Chen, Y., Lai, P., Xu, Y., & Fu, H.-C. (2008). Constructing and application of
multimedia tv-news archives. Expert Systems with Applications, 35, 1444–1450.
Pao, H., Chuang, S., Xu, Y., & Fu, H.-C. (2008). An em based multiple instance learning
method for image classification. Expert Systems with Applications, 35,
1468–1472.
Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidence-
rated predictions. Machine Learning, 80–91.
The demonstration of variance enhanced K-medoid model. Available from http://
www.csie.nctu.edu.tw�pslai/ASKMDDemo/.
The test video. Available from http://www.youtube.com/watch?v=MgpzUo_kbFY.
http://www.csie.nctu.edu.tw~pslai/ASKMDDemo/
http://www.csie.nctu.edu.tw~pslai/ASKMDDemo/
http://www.csie.nctu.edu.tw~pslai/ASKMDDemo/
http://www.youtube.com/watch?v=MgpzUo_kbFY
Variance enhanced K-medoid clustering
Introduction
K-medoid
The learning algorithm of K-medoid clustering
Self-growing K-medoid
Discussion
Variance enhanced K-medoid clustering
Polygon-shaped data model
Side-cluster segmentation
Projections on the center-to-center link
Decision boundaries adjustment
Variance enhanced K-medoid clustering method
Real-world application
Color clustering
Similarity measurement of two bitmaps
Similarity matching between bitmap groups
Images clustering by enhanced K-medoid method
System demonstration
Conclusion
References