IET Submission Template


International Journal of Advanced Network, Monitoring and Controls            Volume 03, No.03, 2018 

DOI: 10.21307/ijanmc-2019-002                                               16 

Application of K-means Algorithm in Geological Disaster Monitoring System 

Wang Jianguo  

College of Computer Science and Engineering 

Xi’an Technological University  

No.2 Middle Xuefu Road, Weiyang District,  

Xi’an, 710021, China 

e-mail: wjg_xit@126.com, 

Xue Linyao
*
 

College of Computer Science and Engineering 

Xi’an Technological University  

No.2 Middle Xuefu Road, Weiyang District,  

Xi’an, 710021, China 

e-mail: 1525610807@qq.com 

 
Abstract—The K-means algorithm is considered to be the most 

important unsupervised machine learning method in 

clustering, which can divide all the data into k subclasses that 

are very different from each other. As K-means algorithm is 

simple and efficient, it is applied to data mining, knowledge 

discovery and other fields. This paper proposes CMU-kmeans 

algorithm with improved UPGMA algorithm and Canopy 

algorithm. The experimental results is that the algorithm can 

not only get the number k of the initial clustering center 

adaptable, but also avoid the influence of the noise data and 

the edge data. Also, the improved algorithm can void the 

initial effect of the random selection on the clustering, which 

reflects the actual distribution in the dataset. 

Keywords-Clustering Analysis; CMU-kmeans Algorithm; 

Geological Disaster Monitoring Data 

I. INTRODUCTION 

The occurrence of geological disasters caused great 
casualties to humans, the main reasons include landslides 
and debris flow and rainfall and so on. And these geological 
disasters always cause many local public facilities to be 
damaged by large and small, and brought great damage to 
the people and their property. Also, there are still many such 
cases in China. Faced with such a severe threat of geological 
disasters, the state and the government on the prevention and 
control of geological disasters into a lot of human and 
material resources, and achieved remarkable results. With 
the progress of technology and high development of 
information technology, many new detection equipments 
have been put into the geological disaster real-time detection, 
such as GPS, secondary sound wave monitoring, radar and 
so on.  

With the development of geological hazard detection 
technology, the amount of the monitoring data grew by leaps 
and bounds, data types are becoming more and more 
complex as well. K-means algorithm is a clustering 
algorithm based on the classification of the classic algorithm, 
the algorithm in the industrial and commercial applications 
more widely. As we all know, it both has many advantages 
and many disadvantages. In this paper, we mainly study the 
optimization of the initial clustering center and the 
avoidance of the blindness of the k-value selection, and 
propose the CMU-kmeans algorithm. 

The data source of the study is the historical data 
detected by the geological disaster monitoring system, and 
2000 records are randomly selected from the rainfall data of 

different areas in Shaanxi Province as the research object, 
which are served as a representative sample of the improved 
K-means clustering algorithm. The experimental results 
show that the improved algorithm not only eliminates the 
sensitivity to the initial input and improve the stability and 
effectiveness of the algorithm, but also can intelligently 
determine the initial clustering center number k, which 
improves the simplicity and operability of the algorithm. 

A. Overview of  K-means algorithm 

The K-means algorithm is a classical unsupervised 
clustering algorithm. The purpose is to divide a given dataset 
containing N objects into K clusters so that the objects in the 
cluster are as similar as possible, and the objects between 
clusters are as similar as possible. Set the sample set X = {x1, 
x2, x3, ..., xn}, n is the number of samples. The idea of the 
K-means algorithm is that the k data objects are randomly 
selected from the sample set X as the initial clustering center, 
and then the data is allocated to the most similar cluster 
according to the similarity degree of each data object and k 
clustering centers; Recalculate the average of each new 
cluster and regard it as the next clustering center and repeat 
the process until the updated cluster center is consistent with 
the update, that is, the criterion function E converges. The 
goal is to make the object similarity in the cluster the largest, 
and the similarity between the objects is the smallest. The 
degree of similarity between the data can be determined by 
calculating the Euclidean distance between the data. For the 
n-dimensional real vector space, the Euclidean distance of 
two points is defined as form.1: 

        
 

Here,    and    are the attribute values of x and y 
respectively, and the criterion function is defined as form.2:  

        
 

   

Here, k is the total number of clusters, and    is the 
center of cluster c. The flow of K-means algorithm is shown 
in Fig. 1. 

 
International Journal of Advanced Network, Monitoring and Controls            Volume 03, No.03, 2018 

17 

 
Figure 1. K-means clustering algorithm flow chart 

B. Research status quo of K-means algorithm  

For the advantages of K-means algorithm, it has been 
widely used in practice, but there are many shortcomings as 
well. In order to get better clustering effect, many 
researchers have explored the shortcomings of improving K-
means. Aiming at the shortcomings of K-means algorithm in 
selecting the initial point, many scholars have proposed an 
improved method. Duan Guiqin [1] uses the method of 
product based on mean and maximum distance to optimize 
the initial clustering center. The algorithm first selects the 
set of data objects which are the farthest from the sample set 
to join the clustering center, and then the set of mean and 
current poly The largest data object of the class center is 
added to the clustering center set, which improves the 
accuracy. Yi Baolin [2] et al. proposed another improved K-
means algorithm, which first calculates the density of the 
region to which the data object belongs, and then selects k 
points as the initial center in the high density region. The 
experimental results show that the algorithm reduces the 
initial center point Impact. Yiu-Ming Cheng[3] and others 
proposed a new clustering technique called K * -means 
algorithm. The algorithm consists of two separate steps. A 
center point is provided for each cluster in the first step; and 
then adjust the unit through adaptive learning rules in the 
second step. The algorithm overcomes the shortcomings of 
K-means algorithm initial center sensitivity and K value 
blindness, but the calculation is complicated. Xie and others 
[4] proposed a k-means algorithm to optimize the initial 

clustering center by using the minimum variance based on 
the sample space distribution compactness information. The 
algorithm chooses the samples with the smallest variance 
and a distance away from each other as the initial clustering 
center. Liu Jiaxing et al.[5] proposed a radius-based k-means 
+ λ algorithm. When selecting the initial center point of the 
cluster, the distance ratio between points is calculated from 
the λ parameter and rounded at a specific distance. In the 
circle, an initialized center point is selected according to the 
distance ratio, and the algorithm has higher performance in 
error rate and operation time. Ren Jiangtao[6] proposed an 
improved K-means algorithm for text clustering, which is 
improved by using feature selection and dimension 
reduction, sparse vector selection, initial center point search 
based on density and spreading, Class accuracy, stability and 
other aspects have improved.  

C. The performance analysis of K-means algorithm  

K-means clustering algorithm uses the Euclidean 
distance to calculate the distance between each sample point. 
For the convex and spherical data distribution, the clustering 
effect is better and has been widely used in many fields. 
However, the Euclidean distance criterion adopted by the 
algorithm also has some limitations. For the more 
complicated or non-convex data, the clustering effect is 
often not very satisfactory. Clustering algorithm in the 
iterative process, if you do not meet the termination criteria 
will recalculate the average clustering center, this operation 
also improves the convergence rate of the clustering 
algorithm. In summary, K-means clustering algorithm has 
the following advantages and disadvantages of the following 
aspects. 

1) The main advantages of K-means algorithm: 
a) K-means clustering algorithm has high stability and 

scalability, clustering effect is very well. 
b) The results of the treatment is intuitive and easy to 

understand. When dealing with the target data in numerical 
form, its geometric meaning is very clear. When clustering 
images and texts, the extracted eigenvalues can be regarded 
as clustering result values for the convenience of people's 
understanding. 

c) K-means clustering algorithm When dealing with 
numerical data sets, the input data sequence will not affect 
the clustering result. 

d) It can be a good judge of the data set shape is 
convex cluster. 

2) The main shortcomings of K-means algorithm: 
a) The K value in the K-means algorithm needs to be 

given in advance. According to the K value determined in 
advance, the clustering samples are classified into K class, 
so that the sum of squares of all the samples in the clustering 
domain to the clustering center is minimized. 

b) Clustering results are highly dependent on the 
selection of initial clustering centers. The K-means 
algorithm uses the stochastic method to select the initial 
clustering center. If the initial clustering center is chosen 
improperly, it is difficult to obtain the ideal clustering effect. 
This dependence on the initial value may lead to the 


International Journal of Advanced Network, Monitoring and Controls            Volume 03, No.03, 2018 

18 

instability of the clustering results, and it is easy to fall into 
the local optimal rather than the global optimal results. 

c) Sensitive to noise and isolated points. 
d) The time complexity of the algorithm is large. 

II. IMPROVEMENT OF K-MEANS ALGORITHM AND ITS 
APPLICATION  

Aiming at the shortcomings of traditional K-means 
algorithm, this paper mainly improves on the optimization of 
initial clustering center to enhance the clustering effect. 

A. The selection of data object in Cluster analysis 

The preliminary data are collected firstly when data 
selecting, then know about the characteristics of data to 
identify the quality of the data and to find a basic 
observation of the data or assume the implied information to 
monitor the subset of data of interest. The data object 
segmentation variable determines the formation of clustering, 
which in turn affects the correct interpretation of the 
clustering results, and ultimately affects the stability of the 
clustering clusters after the new data objects are added. 
Before the K-means clustering related data mining, the 
sample data set related to the data mining clustering analysis 
should be extracted from the original data object set, and it is 
not necessary to use all the historical data. In addition, we 
should pay attention to the quality of data, only high-quality 
data to the correct analysis of conclusions everywhere, to 
provide a scientific basis for clustering. 

The source of this research object is the historical 
monitoring data of the geological disaster monitoring system. 
From the records of geological monitoring data from 2016 to 
2017, a representative sample of K-means clustering 
algorithm for this improved algorithm is selected as the 
object of study in 2000, and the two samples of rainfall are 
randomly selected in different regions. 

The sample data attributes show as table1: 

TABLE I. THE SAMPLE DATA ATTRIBUTES 

Field number Field name Field code Type of data 

1 Id Xx Number 

2 Sno Yy Varchar 

3 Type type Varchar 

4 Gettime time Datatime 

5 Alarm Level alarm Integer 

6 Value value Double 

7 Day Value d_value Double 

 
For the cluster analysis, there are obviously redundant 

ones in the data attributes of the above geological hazard 
monitoring system, and it does not have the objectivity of 
the cluster analysis data. Therefore, the redundant ones 
should be eliminated. Finally, only four data object attributes 
reflecting the characteristics of rainfall data are selected as 
the research object. The optimized data attributes show as 
table2: 

TABLE II. THE OPTIMIZED DATA ATTRIBUTES 

Field number Field name Field code Type of data 

1 Id xx Number 

2 Sno yy Varchar 

3 Gettime time Datatime 

4 Day Value d_value Double 

B. Improvement of K-means algorithm 

It is not difficult to see that, through the above study of 
the status quo, we can see that most of the above algorithm 
improvements are only a single defect in the traditional k-
means algorithm is optimized. Although these improvements 
have optimized the k-means algorithm to some extent, there 
are still many shortcomings. For the above geological 
disaster monitoring system rainfall data characteristics, the 
K-means algorithm is very sensitive to the initialization 
center, and the initial clustering center is very easy to make 
the clustering result into the local optimum and the influence 
of the isolated point is large. In this paper, the simple 
random sampling technique is used to reduce the scale of the 
data set on the original dataset, and then the improved 
UPGMA algorithm and Canopy algorithm are combined to 
propose the CMU-kmeans algorithm. The improved 
algorithm can select the points with the furthest distance k in 
the high density region as the initial clustering center 
according to the regional density of each data, so that the 
improved k-means algorithm can produce high quality poly 
The results show that the sensitivity of the algorithm is not 
only eliminated, but also the stability and validity of the 
algorithm are improved. 

1) Improved UPGMA algorithm  
a) The basic idea of improved UPGMA algorithm 

At the beginning of the UPGMA algorithm, each data 

object in the sample data set is considered as a separate class；
and calculates the distance between each two data objects to 
obtains the distance matrix, then merges the two data objects 
that are closest to each other to obtain a new subclass, repeat 
the process .The UPGMA algorithm stops until no new class 
is generated or the stop condition is satisfied. It can be found 
that the first subclasses are usually located in the dense area 
of the data set, so the subclass center selected by this 
algorithm can be used as the initial clustering center 
candidate point for the next step. In this way, the selection of 
the initial clustering center is optimized and its accuracy is 
improved. The distance between two data objects is 
measured using the Euclidean distance formula, as form3: 



 
2

1

d
m

ik jk

k

sqrt X X


 
  

 




Here, Xi and Xj represent the data objects in the sample 
data set. 

Xi={Xi1,Xi2,…,Xik,…,Xim}，k=1,2,…,m 
Xj={Xi1,Xi2,…,Xik,…,Xjm}，k=1,2,…,m 
The formula for calculating subclasses is as form4: 


International Journal of Advanced Network, Monitoring and Controls            Volume 03, No.03, 2018 

19 



n

j

1

1
Z X

n j
 



Here, n refers to the number of data objects contained in 
a subclass, and Xj refers to a data object in the subclass. 

b) The description of improved UPGMA algorithm 
Input: All data in the sample data set, parameters m, p, Q; 
Output: initial clustering center candidate point. 
(1) set each data object as a separate class; 
(2) Calculate the distance between two data objects, and 

then merge the nearest two classes into a new subclass to 
determine whether the subclass of the data object containing 
no less than m% of the total amount of data continues to 
produce , If not, then go to (4); 

(3) For (i = 1 to maxcluster) { 
          {  for (j = i + 1 to maxcluster) { 
                  If the number of data objects in subclasses i 

and j is less than or equal to m% of the total amount of data, 
calculate  the distance between them to obtain the distance 
matrix. 

                 } 
           } 
Find the nearest two subclasses i and j and merge them 

into a new subclass, then add the new subclass to the end of 
the sequence Q to go to (2); 

(4) Select the former p% subclasses in the sequence Q as 
the candidate subclasses and calculate the centers of all 
candidate subclasses as the initial clustering center candidate 
points. 

Using the advantage of the improved UPGMA 
hierarchical clustering algorithm, we can find the dense 

region of the data set，which avoid the edge data and the 
noise data become the initial center candidate point. At the 
same time, considering the relative intensity of the region, 
we propose new clustering conditions and filter conditions to 
change the traditional UPGMA algorithm, so that the 
generation of subtrees can be stopped at different clustering 
levels to adapt to the actual density distribution data set. But 
the improved UPGMA algorithm also has some 
shortcomings. For example, if the m% and p% values are not 
set properly, the selection of the initial clustering center 
candidate points may be too dense and centralized. However, 
the Canopy algorithm, which introduces the idea of 
maximum and minimum distance, can select the data points 
that are far apart from each other. It is necessary to make up 
the deficiencies of the improved UPGMA algorithm. 
Therefore, it is necessary to introduce the Canopy algorithm 
to ensure that the distribution of the initial clustering center 
is decentralized, which can correctly reflect the data 
distribution of the original data set. 

2) Improved Canopy algorithm  
In order to avoid the clustering process is locally optimal, 

it is necessary to make Canopy get the center point spacing 
as large as possible. The maximum and minimum distance 
method [30] is a kind of test-based algorithm in the field of 
pattern recognition. Its basic idea is to take the object as far 
as possible as a cluster center, trying to get a better initial 

division. The algorithm not only intelligently determines the 
number k in the initial clustering, but also improves the 
efficiency of dividing the initial data set. 

a) The description of improved Canopy algorithm  
The Euclidean distance method is used to measure the 

degree of dissimilarity between data objects. Set the data set, 
S={X1,X2,…,Xn}, and the initial cluster center set is V = 
{v1, v2, ..., vn}. The improved Canopy algorithm is 
described as follows: 

Input: Improve the initial clustering center candidate 
point of the UPGMA algorithm output, the parameter θ; 

Output: Optimize the initial clustering center. 
(1) Arbitrarily select a data object from the data set S as 

the first cluster center point v1 and put it into V; 
(2) Calculate the distance between v1 and all the data 

objects remaining in the data set S, and put the farthest data 
object into V as the second cluster center v2; 

(3) Calculate the distance Di between all the data objects 
Xi and all the data objects remaining in the data set S, select 
the smaller distance and denote Min (Di); 

(4) Selects the maximum value in all the Min(Di) , 
marked as Max (Min (Di)), and regard the corresponding 
data Xi as the candidate cluster center, then judgment is 
made by the discriminant formula Max (Min (Di))> θ || v1-
v2 ||. If the condition is satisfied, Xi is added to the initial 
clustering center set V, and if it is not satisfied, 

(5) To (3); 
(6) Output optimization of the initial clustering center. 
The most critical step in the improved Canopy algorithm 

is the step (4), which takes the corresponding point of Max 
as the candidate of the new clustering center, thus avoiding 
the fact that the distance from an existing clustering center is 
closer to the other Clustering centers are far away as 
candidates for possible candidates. Therefore, the algorithm 
can be used to ensure that each new clustering center is far 
from the distance of the existing clustering center. 

b) The analysis of advantages and disadvantages of 
improved algorithm of Canopy 

The improved Canopy algorithm can use the k data 
objects farthest from each other in the data set as the initial 
clustering center, so as to avoid the situation that the initial 
clustering center distribution is too concentrated and 
intensive. But on the one hand, it is possible to select the 
noise data and the edge data, making the algorithm easy to 
fall into the local optimal solution, it is difficult to get the 
global optimal solution. 

On the other hand, if the sample size of the whole data 
set is n, we need to scan the database first if we want to find 
a new cluster center each time; After finding the nearest 
distance from each object to the existing cluster center, we 
need scan the database to get the maximum-minimum 
distance. so we need a total of 2n distance calculation. The 
time complexity of the improved Canopy algorithm is: O 
(nk) .if the k clustering centers need to be found in the end 
of algorithm. Therefore, the computational complexity of the 
improved Canopy algorithm depends on the size of n, and 
there are thousands of objects in large databases usually, if 
we treat the original data set  


International Journal of Advanced Network, Monitoring and Controls            Volume 03, No.03, 2018 

20 

with the improved Canopy algorithm directly, the 
implementation efficiency is low and the required storage 
space will be significantly increased. 

3) MCU-kmeans algorithm 
Generally, in order to ensure fully reflecting the 

distribution of data in the entire data set, every cluster center 
should be distributed in the high density area of the data set 
and dispersed as much as possible. Based on the above 
considerations, this paper proposes the MCU-kmeans 
algorithm, which combines the improved UPGMA 
algorithm and the improved Canopy algorithm to obtain the 
optimized initial clustering center, and then apply these 
optimized initial clustering centers to the k-means algorithm 
to enhance the clustering effect. Among them, the improved 
UPGMA algorithm is used to find the high density region, 
so that the selected initial clustering center candidate point 
away from the noise data and edge data; And the improved 
Canopy algorithm is used to avoid that the initial clustering 
center distribution is too concentrated and dense to ensure 
that the distances between the cluster center points are far 
away, which fully reflect the overall distribution of the data 
set. Therefore, the improved UPGMA algorithm and the 
improved Canopy algorithm complement each other so that 
the initial clustering centers selected by the algorithm are far 
apart from each other and all are located in the high density 
region of the data set. To sum up, the CMU-kmeans 
algorithm is as follows. 

a) The initialization of the cluster center; 
 Improved UPGMA algorithm: obtain the initial 

clustering center candidate point; 

 Improved Canopy algorithm: obtain the appropriate 
initial clustering center; 

b) K-means algorithm iteration; 
c) The assessment of clustering results. 
It can be seen that the framework of the CMU-kmeans 

algorithm is divided into three phases, as shown in Figure 2, 
the first stage of the algorithm is the initial optimization 
algorithm, which is the most important part of the 
improvement. The purpose is to intelligently capture the 
original The optimal initial clustering seed and the optimal 
initial clustering number of the data set distribution. The 
second stage is the main body of the algorithm, and the K-
means algorithm is used to cluster on the whole data set and 
get the clustering result. The third stage is experiment and 
evaluated to verify the validity of the proposed CMU-
kmeans algorithm. 

It can be seen that the framework of the CMU-kmeans 
algorithm is divided into three phases, as shown in Fig.2 , 
the first stage of the algorithm is the initial optimization 
algorithm, which is the most important part of the 
improvement. The purpose is to intelligently capture the 
optimal initial clustering seed and the number of the data set 
distribution. The second stage is the main body of the 
algorithm, and the K-means algorithm is used to cluster on 
the whole data set and get the clustering result. The third 
stage is experiment and evaluated to verify the validity of 
the proposed CMU-kmeans algorithm. 

 
Figure 2. CMU-kmeans algorithm framework 

The CMU-kmeans algorithm proposed in this paper can 
effectively reduce the dependency of the k-means algorithm 
on the initial clustering center selection. For the data set with 
uneven data distribution, on the one hand, it avoids the idea 
that the initial clustering center is too dense; On the other 
hand, it avoids the fact that the selected initial clustering 
centers are too scattered and even select noise data and edge 
data is happening, which can improve the stability and 
validity of the algorithm. At the same time, the number k of 
the initial clustering center can be automatically determined 
without the pre-set and the simplicity and maneuverability of 
the algorithm can be improved. 

III.  EXPERIMENT ANALYSIS 

A. Experimental description 

The data set selected from the experiment comes from 
the rainfall data collected in the geological hazard detection 
system and the rainfall data set after the artificial noise is 
added. The experimental environment is: 

Inter(R)Core(TM)i3-2330M,4G RAM，250G  hard disk，
Win7 operating system. 

In order to verify the validity and stability of the 
algorithm, the traditional K-means clustering algorithm, the 
improved Canopy algorithm and the CMU-kmeans 
algorithm are compared under the rainfall data set. The 
clustering result of the traditional k-means algorithm is an 
average of 10 executions. Evaluate the performance of the 
algorithm according to the accuracy of the clustering results 
and the recall rate.  

B. Performance evaluation criteria 

The traditional k-means algorithm, the improved Canopy 
algorithm and the clustering effect of CMU-kmeans 
algorithm proposed in this paper are evaluated by the 
commonly used evaluation method to evaluate the quality of 
clustering effect, namely, precision and recall. The accuracy 
and recall rate are defined as follows: 

    P（i, j）= precision(i, j) = Ni,j / Ni                                (5) 


International Journal of Advanced Network, Monitoring and Controls            Volume 03, No.03, 2018 

21 

    R（i, j）= recall(i, j) = Ni,j / Nj                                     (6) 

Here, Ni, j represents the number of classes i in cluster j; 
Ni is the number of all objects in class i; Nj is the number of 
all objects in cluster j. 

C. Experimental content and structure analysis 

Table3 below shows the detailed experimental results of 
the three algorithms on the geo-disaster monitoring system 
rainfall data set. 

 
TABLE III. DETAILED EXPERIMENTAL RESULTS ON THE RAINFALL 
DATASET 

Rainf

all set 

k-means 

algorithm 

Improved 

Canopy 

algorithm 

CMU-kmeans 

algorithm 

precisi

on 
recall 

Precisi

on 
recall 

precisi

on 
recall 

1st 25.423 26.125 50.799 61.078 56.939 65.783 

2nd 24.287 25.365 52.975 63.288 56.423 64.921 

3rd 25.61 18.864 48.895 58.887 57.413 66.174 

4th 27.143 26.143 53.425 63.683 57.682 68.108 

5th 22.365 25.31 50.073 58.404 56.163 65.063 

6th 18.102 26.421 50.444 64.65 56.468 65.224 

7th 25.326 24.623 49.362 57.338 58.921 67.638 

8th 28.325 26.852 49.975 60.075 56.239 66.405 

9th 26.562 28.154 54.267 62.392 58.341 66.423 

10th 23.985 26.523 51.445 60.651 57.267 65.392 

average 25.013 25.938 51.666 61.045 57.186 66.113 

 
As can be seen from the above table, in ten experiments, 

values of the two performance evaluation criteria (precision 
and recall)vary greatly based on the traditional k-means 
algorithm ,showing a very unstable state. To precision as an 
example, the minimum value of the ten experimental results 
is 18.102, and the maximum is 28.325, the difference is 
10.323, and the recall is different from 9.290. The result of 
the improved Canopy algorithm has improved, the precision 
is 6.549, and the difference is 6.345. 

In the CMU-kmeans algorithm, the values of the two 
performance evaluation criteria are obviously improved and 
are still stable. The precision of the ten results is 56.163, the 
maximum is 58.921, the difference is 2.758, and the 
recovery value is 3.187. 

In order to make the experimental results more 
straightforward, the above 10 experimental results with the 
wave diagram shown in order to compare the stability of the 
two algorithms and accuracy. 

 
Figure 3. The precision and recall values of the traditional k-means 
algorithm  

 
Figure 4. The precision and recall values of the improved Cannopy 
algorithm 

 
15.00% 

25.00% 

35.00% 

45.00% 

55.00% 

65.00% 

1 2 3 4 5 6 7 8 9 10 

recall 

precision 

47.00% 

52.00% 

57.00% 

62.00% 

67.00% 

72.00% 

1 3 5 7 9 

precision 

recall 


International Journal of Advanced Network, Monitoring and Controls            Volume 03, No.03, 2018 

22 

 
Figure 5. The precision and recall values of the CMU-kmeans 
algorithm 

It can be seen from Fig.3 , Fig.4 and Fig.5  that the 
improved Canopy algorithm clustering effect is improved on 
the basis of the traditional k-means algorithm. The improved 
CMU-kmeans algorithm can improve the state of the 
improved Canopy algorithm. The precision and recall value 
of the CMU-kmeans algorithm are small and obviously 
improved, and the lifting effect is significant. 

IV. CONCLUSION 

The CMU-kmeans algorithm improves the clustering 
effect, make the performance tend to be stable, and the 
computational complexity of the calculation is obviously 
reduced compared with the traditional k-means algorithm 
and the improved Canopy algorithm. Also, the algorithm can 
adaptively determine the number k of the initial clustering 
center, avoid the influence of the noise data and the edge 
data and random selection of initial clustering center, and 
also well reflect the actual distribution of clustering center in 
the dataset. 

 
REFERENCES 

[1] Zhai D H，Yu J，Gao F，et al. k-means text clustering algorithm 
based oninitial cluster centers selection according to maximum 

distance [J]. Application Research of Computers，2014，31(3):379 
– 382. 

[2] Baolin Yi，Haiquan Qiao，Fan Yang，Chenwei Xu. An Improved 
Initialization Center Algorithm for K-Means Clustering[C]. 
Computational Intelligence and Software Engineering，2010，pp:1-
4.  

[3] Redmond S J，Heneghan C.A method for initializing the K-means 
clustering algorithm using kd-trees[J]. Pattern recognition letters，
2007，28(8):965-973. 

[4] Liu J X, Zhu G H, Xi M. A k-means Algorithm based on the radius 
[J]. Journal of Guilin University of Electronic 
Technology,2013,33(2):134-138.  

[5] Habibpour R, Khalipour K. A new k-means and K-nearest-neighbor 
algorithms for text document clustering [J]. International Journal of 
Academic Research Part A, 2014,6( 3) : 79 － 84. 

[6] Data mining techniques and applications - A decade review from 
2000 to 2011[J]. Shu-Hsien Liao, Pei-Hui Chu, Pei-Yuan Hsiao. 
Expert Systems With Applicati--ons . 2012 (12). 

[7] Application of Improved K-Means Clustering Algorithm in Transit 
Data Collection. Ying Wu, Chun long Yao. 20103rd International 
Conference on Biomedical Engineering and Informatics (BMET）. 
2010. 

[8] Zhou A W, Yu Y F. The research about clustering algorithm of K-
means [J]. Computer Technology and Development, 2011,21(2):62-
65. 

[9] Duan G Q. Auto generation cloud optimization based on genetic 
algorithm [J]. Computer and Digital Engineering, 2015,43(3):379-
382. 

[10] Wang C L, Zhang J X. Improved k-means algorithm based on latent 
Dirichlet allocation for text clustering [J]. Journal of Computer 
Applications,2014,34(1):249-254. 

[11] Deepa V K,Geetha J R R. Rapid development of applications in data 
mining[C]. Green High Performance Computing 
(ICGHPC),2013,pp:1-4. 

[12] Sharma S, Agrawal J, Agarwal S, et al. Machinelearn-ing techniques 
for data mining: A survey[C]. Com-putational Intelligence and 
Computing Research (ICCIC),2013,pp:1-6. 

[13] Efficient Data Clustering Algorithms: Improvements over Kmeans[J]. 
Mohamed Abubaker, Wesam Ashour. International Journal of 
Intelligent Systems and Applications(IJISA). 2013 (3). 

[14] Fahad A, Alshatri N, Tari Z, Alamri A. A Survey of Clustering 
Algorithms for Big Data: Taxonomy and Empirical Analysis[C]. 
Emerging Topics in Computing.2014:267-279. 

[15] Abubaker M, Ashour Wesam. Efficient data clustering algorithm 
algorithms:improvemwnts over k-means[J]. I nternational Journal of 
Intelligent Systems and Applications.2013(3):37-49. 

[16] Tang Zhaoxia, Zhang Hui. Improved K-means Clustering Algorithm 
Based on Genetic Algorithm[C], Telkomnika Indonesian Journal of 
Electrical Engineering.2014, pp:1917-1923. 

 
47.00% 

52.00% 

57.00% 

62.00% 

67.00% 

72.00% 

1 3 5 7 9 

precision 

recall 

http://kns.cnki.net/kcms/detail/detail.aspx?filename=SJES13011501637554&dbcode=SSJD
http://kns.cnki.net/kcms/detail/detail.aspx?filename=SJES13011501637554&dbcode=SSJD
http://kns.cnki.net/kcms/detail/detail.aspx?filename=SJMS13061700000135&dbcode=SSJD