key: cord-0497672-lgl4i6pw authors: Zubair, Md.; Iqbal, MD.Asif; Shil, Avijeet; Haque, Enamul; Hoque, Mohammed Moshiul; Sarker, Iqbal H. title: An Efficient K-means Clustering Algorithm for Analysing COVID-19 date: 2020-12-21 journal: nan DOI: nan sha: 48d83f356b30669f134fca3e28f43c3f74120a69 doc_id: 497672 cord_uid: lgl4i6pw COVID-19 hits the world like a storm by arising pandemic situations for most of the countries around the world. The whole world is trying to overcome this pandemic situation. A better health care quality may help a country to tackle the pandemic. Making clusters of countries with similar types of health care quality provides an insight into the quality of health care in different countries. In the area of machine learning and data science, the K-means clustering algorithm is typically used to create clusters based on similarity. In this paper, we propose an efficient K-means clustering method that determines the initial centroids of the clusters efficiently. Based on this proposed method, we have determined health care quality clusters of countries utilizing the COVID-19 datasets. Experimental results show that our proposed method reduces the number of iterations and execution time to analyze COVID-19 while comparing with the traditional k-means clustering algorithm. COVID-19 was detected near the end of 2019 from the Chinese city of Wuhan [1] . The virus spread rapidly by transmission throughout the whole world within a short period of time. The World Health Organization (WHO) had to declare COVID-19 pandemic. To eradicate the pandemic situation, different countries have taken initiatives to remold their health care quality. Clustering the countries with similar types of health care quality helps us to know about the health care quality comparing with the rest of the counties. In the area of machine learning and data science, both supervised learning and unsupervised learning are popular to solve various kinds of real world problems [2] [3] . In the case of clustering, unsupervised machine learning algorithms are being used [4] [5] . The unsupervised machine learning algorithm typically identifies insight structures of the data from unlabelled data contained in the dataset. The clustering algorithm finds and divides the data points according to the similarity of the hidden structures of the dataset [6] . K-Means [5] clustering is one of the most important and widely used unsupervised machine learning algorithms as it is widely used to identify the hidden structure automatically. The algorithm is used to find the number of clusters so that it is possible to divide the unlabelled dataset into subgroups. It is done by calculating the distances from a centroid of a cluster. K-means algorithm is an NP-Hard problem [7] . Efficiently estimating the initial centroids is a difficult problem. At the initial state, we need to fix the coordinates of the initial centroid for finding the number of clusters. In the traditional K-Means clustering algorithm, initial centroids usually being selected randomly. Thus, determining the initial coordinate points of centroids can play a signifcant role in clustering, in which we are interested in. The key contributions of our work are as follows -• Our proposed method efficiently selects the k-means clustering algorithm's centroids that provides an optimum number of constant iteration as well as execution time. • We have clustered similar types of countries according to the health care quality during the COVID-19 pandemic. • We have tested our model with a COVID-19 real-world dataset to show the efficiency of our model with the existing models. In the following, several related works have been discussed in Section 2. We present our proposed methodology in Section 3. In Section 4, we have shown the experimental results and comparative analysis. In Section 5 we highlight some key points and concludes this paper. Several approaches have been proposed to find the initial cluster centroids more efficiently. In this section, we bring some of these works. M. S. Rahman et al. [8] , provides a centroids selection method based on radial and angular coordinates. However, the number of iterations of his proposed idea is not constant for all the instances. Also, the execution time of that method increases drastically by the increase of the cluster number. A. Kumar et al. also proposed to find initial centroids based on the dissimilarity tree [9] . Although this method improves kmeans clustering, the execution time is not exalted significantly. Also, the smaller datasets that are used for experiment results can not provide insight into a large dataset. In [10] , M.S. Mahmud et al. proposed a novel approach to find the initial centroids by calculating the mean of each distance of the data points. It only describes 3 clusters of given three datasets with the execution time. Improvement of execution time is also trivial. The concept is based on the weighted average. In [11] , another approach is made to find the initial centroids efficiently. Here in M. Goyal et al. also tried to find the centroids by dividing the sorted distances with k, the number of equal partitions. No execution time was given for this proposed method. S. Na et al. [12] have proposed the use of two elementary data structures to store cluster labels and the distance of all data items with each iteration. On the next iteration, data of the previous iteration was used. But, execution time was trivial for the dataset experimented by authors. M. A. Lakshmi et al. [13] have proposed a method to find initial centroids by using the nearest neighbor method. They compared their idea by using SSE(Sum of the Squared Differences) with random and kmeans++ initial selection. Their SSE is roughly similar to random and kmeans++ initial selection. Moreover, They did not provide any comparison concerning execution time as well. S. R. Vadyala et al. proposed a combined algorithm with k-means and LSTM to predict the number of confirmed cases of COVID-19 [14] . LSTM is abbreviated as long short-term memory, an artificial recurrent neural network architecture used for Deep learning. In [15] , author A. Poompaavai et al. attempted to identify the affected areas by COVID-19 of India by using the k-means clustering algorithm. Many approaches related to COVID-19 problem, k-means clustering has been used. In [16] , S.K. Sonbhadra et al. proposed a novel bottom-up approach for COVID-19 articles using k-means clustering along with DBSCAN and HAC. In this section, we present our k-means clustering-based COVID-19 analysis to determine the clusters according to the health care quality of the countries. The result of the k-means clustering is effected on the selection or assignment of initial centroids [17] . So, it is necessary to select the centroids more systematically to improve the performance of the k-means clustering algorithm, also the execution time. In our approach, we use Principal Component Analysis (PCA) [18] while determining the centroids. The following flowchart in Figure 1 depicts the overall process of our proposed method. Our proposed method can determine the coordinates of initial centroids which can obtain the coordinates more precisely than the existing methods. Many machine learning algorithms, including both supervised and unsupervised methods have been applied to the COVID-19 dataset [15] . Each attribute of the dataset must contain numerical values for approaching the k-means algorithm [5] . If not then, some pre-processing might be required for handling missing values and non-numeric values. Before going to further methodology, it needs to be ensured that each attributes values in numeric. For creating our model, we have used a few datasets for selecting the features, required for analyzing the health care quality of the countries. The selected datasets are owid-covid-data [19] , covid-19-testing-policy [20] , public-events-covid [20] , covid-containment-and-health-index [20] , inform-covid-indicators [21] . It is worth mentioning that we used the data up to 11 st August 2020. For instance, some of the attributes of the owid-covid-data [19] are shown in Table 1 . Covid- Other datasets also contains such types of features which is required for ensuring the health care quality of a country. These real-world datasets help us to analyze our proposed method for real-world scenarios. Principal component analysis (PCA) is a method, that utilizes an orthogonal transformation to translate a set of observations of potentially correlated variables into a set of values of linearly uncorrelated variables that are called principal components. PCA is widely used in data analysis and for making a predictive model considering dimensionality reduction [22] [18] , that has been taken into account in our approach. On the other hand, the percentile method [23] used in our model divides the whole dataset into 100 different parts. Each part contains 1 percent data of the total dataset. For example, the 25th percentile means this part contains 25 percent data of the total dataset. That implies, using the percentile method, we can split our dataset into different distributions according to our given values. The percentile formula is given below [23] : Here, ρ = The Percentile wants to find, η = Total Number of values, R = Percentile at ρ. After splitting the reduced dimensional dataset through percentile, we then extract the split data from the primary dataset by indexing for each percentile. In this process, we get back the original data. After retrieving the original data for each percentile, we have calculated the mean of each attribute. These means from the split dataset is the initial centroids of our proposed method. After splitting the dataset and calculating the mean according to the subsection 3.3, we have selected the means of each split dataset as a centroid. These centroids are taken into account as initial centroids for the efficient k-means clustering algorithm. At the last step, we have executed our modified k-means algorithm until the centroids converge. Passing our proposed centroids instead of random or kmeans++ centroids through the k-means algorithm we have generated the final clusters [24] . The proposed method always considers the same centroids for each test. The pseudocode of our whole proposed methodology is given in the algorithm 1. In the next section, evaluation and experimental results are discussed. To measure the effectiveness and validate our proposed model for selecting the optimum initial centroids for the k-means clustering algorithm, we have implemented and showed experimental results with real-world dataset. We have used five COVID-19 datasets and merged them to have a handful of features for clustering the countries according to their health quality during COVID-19. We have merged the datasets described in subsection 3.1 according to the country name. With the regular expression, some pre-processing and data cleaning have been conducted in the case of merging the data. We have also handled some missing data consciously. There are so many attributes regarding COVID-19, Algorithm 1 Proposed Methodology 1: Input: A dataset D and Number of clusters K 2: Output: Efficient initial centroids for K clusters 3: procedure Proposed Method(D) 4: All n attributes 1, 2, 3, ..., n of D must be numeric. If there is any nonnumeric attribute just convert it to numeric value. 5: Apply Principal Component Analysis (PCA) with 2 components to the dataset, D. Apply percentile for splitting the whole dataset into K equal parts based on 1st component. 7: Extract the split dataset from primary data by index. 8: Calculate the mean of each attribute of the split datasets. 9: Take the mean of each dataset as the initial clusters centroids, C = c1, c2, ..., ck, where c1, c2, .., ck are the initial centroids for 1st, 2nd, ...., k clusters consecutively. 10: Assign the centroids to the k-means clustering algorithm 11: Applying k-means algorithm with proposed initial centroids. 12: end procedure among them 25 attributes had been selected finally, as these attributes closely signify the health care quality of a country. The attributes represent categorical and numerical values. These are country name, cancellation of public events (due to public health awareness), stringency index 1 , testing policy ( category of testing facility available to the mass people), total positive case per million, new cases per million, total death per million, new deaths per million, cardiovascular death rate, hospital beds available per thousand, life expectancy, inform the COVID-19 risk (rate), hazard and exposure dimension rate, people using at least basic sanitation services (rate), inform vulnerability(rate), inform health conditions (rate), inform epidemic vulnerability (rate), mortality rate, prevalence of undernourishment, lack of coping capacity, access to healthcare, physicians density, current health expenditure per capita, maternal mortality ratio. All of the attributes are closely investigated before feeding the model. As we are going to make clusters for the countries with similar types of healthcare quality, the optimum number of clusters is 4, defined by the elbow method [26] . We have applied Principal Component Analysis (PCA) to convert a high dimensional dataset into two dimensions. The number of clusters is 4 for the COVID-19 dataset. After applying PCA, we have used percentile concepts to divide the whole dataset into K equal parts. For K=4, each portion contains 25% of the dataset. For this purpose, we have calculated the percentile for 25%, 50%, 75%, and 99.9% data. We have divided the data horizontally because it provides a good intuition to the cluster. Figure 2 depicts plotting the data with two dimensional PCA where horizontal lines represents the splitting according to percentile. In table 3 , c1,c2,c3 and c4 represent initial clusters centroids consecutively. In machine learning and data science, computational power is one of the main issues. Because at a time, a computer needs to process a large amount of data. So, reducing computational cost is a big deal. As discussed in the methodology in section 3, we implemented our proposed method. Though many researchers proposed many ideas those are discussed in the related work section. We have compared our method with the best existing kmeans++ method [24] . We have measured the effectiveness of the model with In figure 3 , we show the experimental result of the COVID-19 dataset for 10 tests. In the graph, the green and the red line represents results for the proposed and kmeans++ algorithms consecutively. As our centroid is pre-defined, the number of iterations is constant in every test case. On the contrary, kmeans++ method works with random iterations. For getting insight into the clusters, we have plotted the cluster map in Figure 4 showing similar types of countries in terms of health care quality. The 2D plot also shows the cluster distributions in a two-dimensional space. Before jumping to the demonstration, it is worth mentioning that the model have been tested on the Intel ® Core TM i7-8750H processor. Figure 3 (a) provides the iteration comparison for existing kmeans++ and our proposed method. When our proposed method have been applied, we find a constant number of iterations in each test. But the existing kmeans++ method's iteration is random. It might be varied for different test cases. • Figure 3 (b) provides the execution time comparison for existing kmeans++ and our proposed method. For each test case, our model have executed the k-means clustering algorithm with the shortest time. Based on the experimental results, we claim that our model outperforms in the case of real-world application and reduces the computational power for the kmean clustering algorithm. In this paper, we have presented an efficient clustering method that selects the optimal initial centroids of K-means clustering algorithm. With the help of the proposed method, we have efficiently created the clusters of different countries according to similar health care quality during COVID-19. While experimenting with COVID-19 datasets, our model outperforms in terms of a reduced number of constant iterations, that consequently reduces the execution times as well. Although our proposed model outperforms for real-world scenarios, it might be a little bit diverged if the number of instances of the dataset are high or extremely high. In the future, we have a plan to work on a huge number of instances and build an effective recommendation system. Coronavirus disease 2019 (covid-19): situation report Mobile data science and intelligent apps: Concepts, ai-based modeling and research directions Cybersecurity data science: an overview from machine learning perspective Context-aware rule learning from smartphone data: survey, challenges and future directions Data mining: concepts and techniques Individualized time-series segmentation for mining mobile phone user behavior The hardness of k-means clustering in the plane An initial centroid selection method based on radial and angular coordinates for k-means algorithm A new initial centroid finding method based on dissimilarity tree for k-means algorithm Improvement of k-means clustering algorithm with better initial centroids based on weighted average Improving the initial centroids of k-means clustering algorithm to generalize its applicability Research on k-means clustering algorithm: An improved k-means clustering algorithm Initial centroids for k-means using nearest neighbors and feature means Prediction of the number of covid-19 confirmed cases based on k-means-lstm Clustering study of indian states and union territories affected by coronavirus (covid-19) using k-means algorithm Target specific mining of covid-19 scholarly articles using one-class approach A survey of clustering data mining techniques. In: Grouping multidimensional data Contextpca: Predicting context-aware smartphone apps usage based on machine learning techniques Total covid-19 tests performed by country -humanitarian data exchange Uncover covid-19 challenge kaggle Principal component analysis Statistics notes: quartiles, quintiles, centiles, and other quantiles k-means++: The advantages of careful seeding Coronavirus government response tracker -blavatnik school of government Review on determining number of cluster in k-means clustering