9

International Journal of Advanced Network Monitoring and Controls                    Volume 02, No.2, 2017

Improved K-means Algorithm Based on 
optimizing Initial Cluster Centers and Its 

Application

Xue Linyao, Wang Jianguo

School of Computer Science and Engineering, 

Xi’an Technological University, Xi’an, 710032 China 

Email: xuelinyaoyao@foxmail.com 

Abstract. Data mining is a process of data grouping or partitioning from the large and complex 
data, and the clustering analysis is an important research field in data mining. The K-means algorithm 
is considered to be the most important unsupervised machine learning method in clustering, which 
can divide all the data into k subclasses that are very different from each other. By constantly iterating, 
the distance between each data object and the center of its subclass is minimized. Because K-means 
algorithm is simple and efficient, it is applied to data mining, knowledge discovery and other fields. 
However, the algorithm has its inherent shortcomings, such as the K value in the K-means algorithm 
needs to be given in advance; clustering results are highly dependent on the selection of initial clustering 
centers and so on. In order to adapt to the historical data clustering of the geological disaster monitoring 
system, this paper presents a method to optimize the initial clustering center and the method of isolating 
points. The experimental results show that the improved k-means algorithm is better than the traditional 
clustering in terms of accuracy and stability, and the experimental results are closer to the actual data 
distribution.

Keywords: Clustering analysis, improved K-means algorithm, geological disaster monitoring data

1. Introduction

The occurrence of geological disasters caused great casualties to humans, the main reasons include 
landslides and debris flow and rainfall and so on. And these geological disasters always cause many local 
public facilities to be damaged by large and small, and brought great damage to the people and their 
property. Also, there are still many such cases in China. Faced with such a severe threat of geological 
disasters, the state and the government on the prevention and control of geological disasters into a lot of 
human and material resources, and achieved remarkable results. With the progress of technology and 
high development of information technology, many new detection equipments have been put into the 
geological disaster real-time detection, such as GPS, secondary sound wave monitoring, radar and 
so on. 

With the development of geological hazard detection technology, the amount of the monitoring data 
grew by leaps and bounds, data types are becoming more and more complex as well. K-means algorithm 
is a clustering algorithm based on the classification of the classic algorithm, the algorithm in the industrial 
and commercial applications more widely. As we all know, it both has many advantages and many 
disadvantages. The research on the deficiency of K-means algorithm is divided into two branches: 1) the 


10

International Journal of Advanced Network Monitoring and Controls                    Volume 02, No.2, 2017

number of initial clustering centers K; 2) the choice of initial clustering center. In this paper, we mainly 
study the latter, and propose a new initial clustering center algorithm.

The data source of the study is the historical data detected by the geological disaster monitoring system, 
and 2000 records are randomly selected from the rainfall data of different areas in Shaanxi Province as 
the research object, which are served as a representative sample of the improved K-means clustering 
algorithm. The experimental results show that the algorithm is better than the traditional clustering in 
terms of accuracy and stability, and the experimental results are closer to the actual data distribution.

2. Brief and Research Status of K-Means Algorithm

2.1 Overview of K-means algorithm

The K-means algorithm is a classical unsupervised clustering algorithm. The purpose is to divide a given 
data set containing N objects into K clusters so that the objects in the cluster are as similar as possible, 
and the objects between clusters are as similar as possible. Set the sample set X = {x1, x2, x3, ... xn}, n 
is the number of samples. The idea of the K-means algorithm is: Firstly, k data objects are randomly 
selected from the sample set X as the initial clustering center; Secondly, according to the degree of 
similarity between each data object and k clustering centers, it is allocated to the most similar clusters; 
Then recalculate the average of each new cluster and use it as the next iteration of the clustering center, 
and repeat the process until the updated cluster center is consistent with the update, that is, the criterion 
function E converges. 

Figure.1  K-means flow

The goal is to make the object similarity in the cluster the largest, and the similarity between the objects 
is the smallest. The degree of similarity between the data can be determined by calculating the Euclidean 
distance between the data. For the n-dimensional real vector space, the Euclidean distance of two points 
is defined as:


Improved K-means Algorithm Based on optimizing Initial Cluster Centers and Its Application

11

d(x, y)=                                                                                                                                 （1）

Here, x_i and y_i are the attribute values of x and y respectively, and the criterion function is defined as: 

E =                                                                                                                             （2）

Here, k is the total number of clusters, and  is the center of cluster c. The flow of K-means algorithm is 
shown in Figure 1.

2.2 Research status quo of  K-means algorithm

For the advantages of K-means algorithm, it has been widely used in practice, but there are many 
shortcomings as well. In order to get better clustering effect, many researchers have explored the 
shortcomings of improving K-means. Aiming at the shortcomings of K-means algorithm in selecting the 
initial point, many scholars have proposed an improved method. Duan Guiqin [1] uses the method of 
product based on mean and maximum distance to optimize the initial clustering center. The algorithm 
first selects the set of data objects which are the farthest from the sample set to join the clustering center, 
and then the set of mean and current poly The largest data object of the class center is added to the 
clustering center set, which improves the accuracy. Yi Baolin [2] et al. proposed another improved 
K-means algorithm, which first calculates the density of the region to which the data object belongs, and 
then selects k points as the initial center in the high density region. The experimental results show that 
the algorithm reduces the initial center point Impact. Yiu-Ming Cheng[3] and others proposed a new 
clustering technique called K * -means algorithm. The algorithm consists of two separate steps. A center 
point is provided for each cluster in the first step; and then adjust the unit through adaptive learning 
rules in the second step. The algorithm overcomes the shortcomings of K-means algorithm initial 
center sensitivity and K value blindness, but the calculation is complicated. Xie and others [4] proposed 
a k-means algorithm to optimize the initial clustering center by using the minimum variance based on 
the sample space distribution compactness information. The algorithm chooses the samples with the 
smallest variance and a distance away from each other as the initial clustering center. Liu Jiaxing et al.[5] 
proposed a radius-based k-means + λ algorithm. When selecting the initial center point of the cluster, 
the distance ratio between points is calculated from the λ parameter and rounded at a specific distance. 
In the circle, an initialized center point is selected according to the distance ratio, and the algorithm has 
higher performance in error rate and operation time. Ren Jiangtao[6] proposed an improved K-means 
algorithm for text clustering, which is improved by using feature selection and dimension reduction, 
sparse vector selection, initial center point search based on density and spreading, Class accuracy, 
stability and other aspects have improved.

2.3 The analysis of shortcomings of K-means algorithm

1) The K value in the K-means algorithm needs to be given in advance. According to the K value 
determined in advance, the clustering samples are classified into K class, so that the sum of squares of all 
the samples in the clustering domain to the clustering center is minimized.

2) Clustering results are highly dependent on the selection of initial clustering centers. The K-means 
algorithm uses the stochastic method to select the initial clustering center. If the initial clustering center is 
chosen improperly, it is difficult to obtain the ideal clustering effect. This dependence on the initial value 
may lead to the instability of the clustering results, and it is easy to fall into the local optimal rather than 
the global optimal results.


12

International Journal of Advanced Network Monitoring and Controls                    Volume 02, No.2, 2017

3) Sensitive to noise and isolated points.

3. Improvement of K-Means Algorithm and Its Application

3.1 The selection of data object in Cluster analysis

The preliminary data are collected firstly when data selecting, then know about the characteristics of 
data to identify the quality of the data and to find a basic observation of the data or assume the implied 
information to monitor the subset of data of interest. The data object segmentation variable determines 
the formation of clustering, which in turn affects the correct interpretation of the clustering results, and 
ultimately affects the stability of the clustering clusters after the new data objects are added. Before the 
K-means clustering related data mining, the sample data set related to the data mining clustering analysis 
should be extracted from the original data object set, and it is not necessary to use all the historical data. 
In addition, we should pay attention to the quality of data, only high-quality data to the correct analysis 
of conclusions everywhere, to provide a scientific basis for clustering.

The source of this research object is the historical monitoring data of the geological disaster monitoring 
system. From the records of geological monitoring data from 2015 to 2016, a representative sample of 
K-means clustering algorithm for this improved algorithm is selected as the object of study in 2000, and 
the two samples of 3D rainfall are randomly selected in different regions.

The sample data attributes show as table1:

Table 1  The sample data attributes

For the cluster analysis, there are obviously redundant ones in the data attributes of the above geological 
hazard monitoring system, and it does not have the objectivity of the cluster analysis data. Therefore, 
the redundant ones should be eliminated. Finally, only four data object attributes reflecting the 
characteristics of rainfall data are selected as the research object. The optimized data attributes show as 
table2:

Field number Field name Field code Type of data

1 Id xx Number

2 Sno yy Varchar

3 Type type Varchar

4 Gettime time Datatime

5 Alarm Level alarm Integer

6 Value value Double

7 Day Value d_value Double


Improved K-means Algorithm Based on optimizing Initial Cluster Centers and Its Application

13

Table 2  The optimized data attributes

3.2 Improvement of K-means algorithm

For the above geological disaster monitoring system rainfall data characteristics, the K-means algorithm 
is very sensitive to the initialization center, and the initial clustering center is very easy to make the 
clustering result into the local optimum and the influence of the isolated point is large. The algorithm 
is based on the small cluster with the largest variance and can be divided into two clusters with different 
variance. The algorithm of initializing center is proposed. In addition, a method of isolating points has 
been proposed. The idea of this algorithm is to first find out the two points furthest from the sample 
point as the initial center point, and then divide the other sample points into the cluster to which the 
nearest center point belongs, and determine the number of points within the cluster And whether the 
corresponding initial clustering center is an isolated point, and finally select the next object to be split 
according to the variance within the cluster and update the initial cluster center according to certain 
rules. The above steps are repeated until the number of cluster centers is satisfied.

1) Initial clustering center selection algorithm 

X={x1,x2,x3…xn}, n is the number of samples. d (xi , xj ) (i, j ∈ {1,2 … n}) is the Euclidean distance 
between the data points xi and xj, ci (i ∈ 1,2…n} is the clustering center, Q is the data object that will be 
spited, S is the number of clustering centers.

The initial clustering center selection algorithm is as follows:

Input: data set X, number of clusters k, threshold u

Output: cluster center set C and isolated point set D

(1) Let Q=X={x1,x2,x3,……xn};S=0;

(2) Calculate the Euclidean distance d (xi,xj) between the two data points in W, and find the two points 
xi,xj, which are marked as ci, cj, and 

let:

S = S + 2;

                Qi = {xp | d(xp,xi)< d(xp,xj),xp ∈ Q},

                Qj = {xp | d(xp,xj)< d(xp,xi),xp ∈ Q},

which means Q is divided by the cluster ,and Qi and Qj become the split clusters.

(3) If the number of data objects in Qi or Qj is less than u, the selected initial center xi or xj is an isolated 

Field number Field name Field code Type of data

1 Id xx Number

2 Sno yy Varchar

3 Gettime time Datatime

4 Day Value d_value Double


14

International Journal of Advanced Network Monitoring and Controls                    Volume 02, No.2, 2017

point. Remove xi or xj from Q, remove ci or cj in set C, and add xi or xj to D, return to step 1;

(4) If the number of data objects in the set C is less than k, find the set Qp with the largest variance in the 
splitting cluster and let Q=QpQp ，S=S-1, then remove cpcp the set C;

(5) Calculate the mean of all the objects in the split cluster, and the resulting mean is k initial clustering 
centers.

2) Improved K-means algorithm

Data set X={x1,x2,x3,……xn}, there are n objects. Cold,i represents the i-th cluster center of the 
previous round, Cnew,,i represents the new cluster center calculated in current time, and the algorithm 
is described as follows:

Input: data set X, number of clusters k, threshold u

Output: k clusters and the number of bands

(1) Call the improved initialization center selection algorithm to get the initialization center, if there is 
an isolated point will be isolated points alone in a class, do not participate in the follow-up clustering 
algorithm;

(2) Calculate the distance between all data objects and k cluster centers, and assign the text to the nearest 
cluster;

(3) Calculate the mean of each cluster to obtain a new round of cluster center;

(4) If  E’=  =  , then the iteration is 

terminated, otherwise it returns to 2). (note: E’is the measure function)

4. Experiment Analysis

4.1 Experimental description

The data set selected from the experiment comes from the rainfall data collected in the geological hazard 
detection system and the rainfall data set after the artificial noise is added. The experimental environment 
is: Inter(R)Core(TM)i3-2330M,4G RAM，250G  hard disk，Win7 operating system.

In order to verify the validity and stability of the improved algorithm, the original k-means algorithm, 
the algorithm in literature [4]and the improved algorithm are analyzed and compared under the rainfall 
data set. In order to further verify the superiority of the algorithm in dealing with isolated points, the 
algorithm is compared with other algorithms on the rainfall data set after adding noise. The clustering 
results are clustered and criterion function changes and the clustering time are used to evaluate the 
clustering results.

4.2 Experimental results and analysis

The clustering criterion function of the two algorithms will decrease with the increase of the number 
of the adjustment of the cluster until the final convergence, and the more compact the two curves, the 
higher the accuracy of the corresponding clustering results The vice versa. Figure 2 is the  comparison 
of the traditional k-means algorithm and the improved algorithm standard function values with the 
clustering cancroids adjustment and constantly changing the comparison chart.; In order to test the 


Improved K-means Algorithm Based on optimizing Initial Cluster Centers and Its Application

15

speed of the improved algorithm in this paper, three samples were randomly selected from the historical 
data of the geological hazard system, and the sample capacity was 5000，10000 and 18000 respectively. 
The experimental results are shown in Figure 3.

Figure.2  The comparison of criterion function changes trend graph

Figure.3  The comparison of clustering times in artificial data sets

According to the comparison of criterion function change trend graph in the rainfall data set in Figure 6, 
the clustering criterion function of the improved algorithm is superior to the clustering criterion function 
value of the traditional k-means algorithm because the data objects in the optimized cluster are more 
compact and independent in each iteration process, the criterion function value is significantly lower 
than the traditional k-means algorithm, which also further validates the superiority of this algorithm; 
Figure7 shows that the traditional k-means algorithm is running at the fastest speed, and the speed of this 
algorithm is slightly lower than that of algorithm[4].

5. Conclusion

Aiming at the instability of the clustering results caused by the random clustering of the traditional 
k-means algorithm and the effect of the isolated points on the clustering results, the authors of this paper 
have the advantages of small distance from the large sample points to the same cluster The clustering 
algorithm with the largest variance of variance can be split into two clusters with relatively small 
variance, a k-means clustering algorithm is proposed to optimize the initial clustering center. Simulation 
experiments in geological hazard systems and artificial data sets with the same proportion of noise show 
that the proposed algorithm improves the accuracy and clustering error compared with the traditional 
k-means algorithm and the other two optimization initial center algorithms. However, the initial 
algorithm of the algorithm is somewhat complicated, and it takes too much time in the selection of the 
central problem. In the future work, it will be further improved, and it will be tried in all respects.


16

International Journal of Advanced Network Monitoring and Controls                    Volume 02, No.2, 2017

References

[1] Zhai D H, Yu J，Gao F，et al. k-means text clustering algorithm based oninitial cluster centers selection 
according to maximum distance [J]. Application Research of Computers, 2014, 31(3):379 – 382.

[2] Baolin Yi，Haiquan Qiao，Fan Yang，Chenwei Xu. An Improved Initialization Center Algorithm for 
K-Means Clustering[C]. Computational Intelligence and Software Engineering, 2010, pp:1-4. 

[3] Redmond S J, Heneghan C.A method for initializing the K-means clustering algorithm using kd-trees[J]. 
Pattern recognition letters，2007，28(8):965-973.

[4] Liu J X , Zhu G H, Xi M. A k-means Algorithm based on the  radius [J]. Journal of Guilin University of 
Electronic Technology，2013，33(2):134-138. 

[5] Habibpour R, Khalipour K. A new k-means and K-nearest-neighbor algorithms for text document clustering 
[J]. International Journal of Academic Research Part A, 2014, 6( 3) : 79 － 84．

[6] Data mining techniques and applications – A decade review from 2000 to 2011[J] . Shu-Hsien Liao,Pei-Hui 
Chu, Pei-Yuan Hsiao. Expert Systems With Applications . 2012 (12)

[7] Application of Improved K-Means Clustering Algorithm in Transit Data Collection. Ying Wu, Chun long Yao. 
20103rd International Conference on Biomedical Engineering and Informatics (BMET). 2010.

[8] Zhou A W, Yu Y F. The research about clustering algorithm of K-means [J]. Computer Technology and 
Development, 2011, 21(2):62-65.

[9] Duan G Q. Auto generation cloud optimization based on genetic algorithm [J]. Computer and Digital 
Engineering, 2015, 43(3):379-382.

[10] Wang C L, Zhang J X. Improved k-means algorithm based on latent Dirichlet allocation for text clustering 
[J]. Journal of Computer Applications,2014,34(1):249-254.

[11] Deepa V K,Geetha J R R. Rapid development of applications in data mining[C]. Green High Performance 
Computing(ICGHPC),2013,pp:1-4.

[12] Sharma S, Agrawal J, Agarwal S, et al. Machine learning techniques for data mining:A survey[C].
Computational Intelligence and Computing Research (ICCIC), 2013, pp:1-6.

[13] Efficient Data Clustering Algorithms: Improvements over Kmeans[J] . Mohamed Abubaker, Wesam 
Ashour. International Journal of Intelligent Systems and Applications(IJISA) . 2013 (3).

[14] Fahad A, Alshatri N, Tari Z, Alamri A. A Survey of Clustering Algorithms for Big Data: Taxonomy and 
Empirical Analysis[C]. Emerging Topics in Computing.2014:267-279.

[15] Abubaker M, Ashour Wesam. Efficient data clustering algorithm algorithms: improvemwnts over 
k-means[J]. I nternational Journal of Intelligent Systems and Applications.2013(3):37-49.

[16] Tang Zhaoxia, Zhang Hui. Improved K-means Clustering Algorithm Based on Genetic Algorithm[C], 
Telkomnika Indonesian Journal of Electrical Engineering.2014, pp:1917-1923.

[17] Optimal variable weighting for ultrametric and additive tree clustering[J]. Geert Soete. Quality and Quantity . 
1986 (2).