key: cord-1011840-3t2v72e4
authors: Fidan, Huseyin; Erkan Yuksel, Mehmet
title: A comparative study for determining Covid-19 risk levels by unsupervised machine learning methods
date: 2021-11-19
journal: Expert Syst Appl
DOI: 10.1016/j.eswa.2021.116243
sha: 9730be091247462ac7c66cd729ea7c6e7bc832ec
doc_id: 1011840
cord_uid: 3t2v72e4

The restrictions have been preferred by governments to reduce the spread of Covid-19 and to protect people's health according to regional risk levels. The risk levels of locations are determined due to threshold values ​​based on the number of cases per 100,000 people without environmental variables. The purpose of our study is to apply unsupervised machine learning techniques to determine the cities with similar risk levels by using the number of cases and environmental parameters. Hierarchical, partitional, soft, and gray relational clustering algorithms were applied to different datasets created with weekly the number of cases, population densities, average ages, and air pollution levels. Comparisons of the clustering algorithms were performed by using internal validation indexes, and the most successful method was identified. In the study, it was revealed that the most successful method in clustering based on the number of cases is Gray Relational Clustering. The results show that using the environmental variables for restrictions requires more clusters than 4 for healthier decisions and Gray Relational Clustering gives stable results, unlike other algorithms.

The Covid-19 epidemic has spread rapidly all over the world since 2019 and affects people's lives adversely. Countries make great efforts to solve the economic, health, and social problems caused by . It is still tried to prevent the spread of the epidemic by some personal, regional, and national precautions related to the problems. In order to prevent the spreading of the virus, restrictions are applied regionally. In some cases, these limitations are expanded throughout the country and lead to curfews. The main factor for restrictions is the number of Covid-19 cases per 100,000 people. In the literature, it is stated that some environmental variables such as average age, population density, acreage, and air quality that affect the spreading of the virus should be used in the analysis. In the literature, it is stated that the average age (Ferguson et al., 2020; Wang et al., 2020) and crowded environments (Hutchins et al., 2020) increase the risk level of the pandemic.

Although a few studies reveal that there is no significant relationship between air pollution and Covid-19 infection (Bontempi, 2020; Fattorini & Regoli, 2020; Conticini, Frediani & Caro, 2020) , some studies are emphasized that air pollution leads increasing of the risk levels and should be included in the analysis (Ciencewicki & Jaspers, 2007; Ye et al., 2016) .

Due to Covid-19 cases, some restrictions are imposed such as curfews under age 18 and over age 65, closing restaurants and cafes, banning meetings and demonstrations, closing schools, transition to flexible working, banning intercity travels, curfew, etc. Decisions are partly taken by regional governments, while wider restrictions are by the central government in Turkey. The restrictions are applied based on the data announced by the Ministry of Health. In general, only the number of cases per 100,000 people is used as a parameter for restrictions. The criteria used to determine the risk groups by the fixed-threshold values (FV) in Turkey are given in Table 1 . As seen in Table 1 , provinces are evaluated in 4 different risk groups depending on the number of cases in 100,000 people. Restrictions are applied by the government according to risk levels. For example, the province with a number of cases between 0 and 20 that means no restrictions are defined as low risk and colored in blue. The other risk levels of the groups are determined with yellow, orange, and red, respectively. Using the FV can be thought of as a clustering technique, but it is not a suitable method for the clustering approach. New threshold values need to be determined when the number of cases increases, so the method will not use efficiently. Changes frequently in threshold values will create confusion for the determination of risk levels, lead to unhealthy groupings and decrease the efficiency of restrictions. In addition, it is not possible to apply this method in the analysis of datasets to which environmental variables are added. For these reasons, the FV method cannot offer a sustainable clustering opportunity. In this study, unsupervised machine learning techniques were applied to obtain regional risk groups instead of FV. Clustering that defined as the grouping of similar data is an important research area in machine learning (Han et al., 2012) . Clustering techniques that can effectively identify regions with similar characteristics according to risk levels will make significant contributions to restriction decisions. It is observed in the literature that clustering analyses are preferred rarely in determining the risk levels related to Covid-19 cases.

Hierarchical (HC), K-Means (KM), and Fuzzy C-Means (FCM) are among the most preferred algorithms in the literature for clustering analysis applied according to dataset characteristics (Peters et al., 2013) . In particular, the insufficient number of items in the dataset decreases the efficiency or may cause even failure. The limited sample and the inability to collect data in detail are the main problems for clustering Covid-19 cases. It is emphasized in some studies that Gray System Theorybased approaches offer much more performance in clustering datasets containing limited data Fidan & Yuksel, 2020) . In this context, our study has two purposes. The first aim is to specify the unsupervised machine learning algorithm having the least error in determining risk groups. So, clustering analysis was performed using HC, KM, FCM, and Gray Relational Clustering (GRC) algorithms. The second aim of the study is to compare the clustering performances of different datasets created by the number of cases, the population density, the average age, and the air pollution variables that are stated in the literature affecting the spread of Covid-19. In this context, clustering algorithms were applied to 4 datasets created with data belonging to 81 provinces in Turkey, and the clustering performances of algorithms were compared by Silhouette, Calinski-Harabasz, and Dunn indexes. The results revealed that the risk levels of the regions can be determined by using unsupervised machine learning techniques, and the most successful algorithms is GRC having the highest clustering performance.

The Covid-19 outbreak that spreads all over the world shortly after the first cases were seen in China in December 2019, has also become an attractive subject for academic researches.

Researchers, who have limited data at the beginning of the epidemic, have more data about Covid-19 now. In researches, analyzes have been realized by using daily data, as well as data in a certain time period. In these studies, clustering, classification, and prediction have been realized based on the number of cases of countries and regions.

Claiming that traditional time series algorithms cannot give reliable results due to reasons such as the different lengths of the Covid-19 case numbers and the inconsistent ranges between the data, Zarikas et al. (2020) have developed a clustering method based on HC. The researchers performed the clustering analysis with the number of cases, active cases per population and active cases per area of 30 countries emphasized that population size and area size should also be used in the analyzes. Adam et al. (2020) , who conducted a cluster analysis according to the transmission types of Covid-19 cases, divided the transmission sources of 1039 confirmed cases into 51 clusters.

According to the results of the research, social environments such as cafes, restaurants, meetings, theaters are the first place accelerating the spread of infection. In this context, the first precaution should be the restriction of social environments. Maugeri et al. (2020) , who carried out regional clustering of Covid-19 cases in Italy, used HC and KM algorithms. The researchers, grouped the regions under 4 clusters, stated that the KM algorithm is an alternative tool for measuring spread. In a study, HC and KM clustering algorithms were applied to multivariate time series. 32day data was examined and it was found that there was a close similarity between the number of cases and deaths (James & Menzies, 2020) . Virgantari & Faridhan (2020) , who conducted cluster analysis with KM using the number of Covid-19 cases in Indonesia, grouped 34 provinces under 7 clusters with 680 confirmed case data occurring in one day. The study, which emphasized that the KM algorithm is a suitable option, has no comparisons with different algorithms. In another study, the Covid-19 case clustering of American states was carried out by using daily confirmed case data (Chen et al., 2020) . The states were divided into 7 clusters by applying KM to the Nonnegative Matrix Factorization coefficients. Applying the same method to the number of cases on different days, the researchers determined the states be restricted and reopened. Stating that Hard Clustering methods will not work in determining the data in the intermediate regions, Mahmoudi et al. (2020) suggested that soft clustering will yield more successful results in Covid-19 case data. The researchers applied the FCM for clustering the virus spread and divided the countries into 3 risk groups. Crnogorac et al. (2021) carried out the Covid-19 case clustering of European countries with KM, HC, and Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) and compared the performances with the Silhouette metric. According to the comparison results, it was stated that there were no significant differences between performances and that three algorithms could be used in clustering Covid-19 cases. In a study investigating the effect of living areas on the spread of Covid-19, spatial clustering analysis was performed on the data obtained from the Indian Ministry of Health (Das et al., 2021) . It was determined that the data on the living areas are an important factor in the spread of Covid-19. Kinnunen et al. (2021) , which perform the clustering analysis of countries according to the economic policies applied to alleviate the restrictions in the Covid-19 process, used the Gaussian Mixture Model (GMM). They argued that GMM is a viable choice for large datasets, but fuzzy approaches would be more appropriate for comparative analysis of countries and regions (Kinnunen et al., 2021) . In a recent study, it has been emphasized that using unsupervised machine learning methods for Covid-19 case analyzes will increase efficiency (Hozumi et al., 2021) . In the study, KM was proposed for Covid-19 clustering analysis, Uniform Manifold Approximation and Projection (UMAP) is recommended for dimension reduction of the dataset.

In the literature, it is observed that the number of cases is the main variable for clustering analysis of Covid-19, environmental parameters are ignored, and HC and KM algorithms are generally applied to clustering. In addition to hard clustering approaches, there are also studies that suggest soft clustering methods. However, there are no studies using or suggesting the GRC method in the literature. GRC was emphasized as a method having very healthy results in clustering analyzes Fidan, 2020) . It is suggested especially in uncertainties arising from insufficient data (Fidan &Yuksel, 2020) . In this context, it can be said that GRC is a viable option for the clustering of Covid-19 cases and the specifying regional risk levels.

The number of cases used in this study includes the number of Covid-19 cases seen in every 100,000 people approved on a provincial basis in Turkey announced by the Ministry of Health for February 20-26, 2021 (Turkish Ministry of Health, 2021). The population density and the average age of the provinces were collected from the Turkish Statistical Institute (TUIK) announced in 2020 (TUIK, 2021). The PM2.5 index was taken as a basis for the air pollution levels and the data were compiled from IQAir for 2020 (IQAIR, 2021) . Data of the first 10 provinces in the dataset are presented in Table 2 . The full dataset can be seen in Appendix A. In order to compare the clustering performances, 4 different datasets were created with different variables. Thus, it is aimed to determine which variables increase the clustering performance. The datasets are given in Table 3 . Number of cases Ds2

Number of cases + Population density Ds3

Number of cases + Population density + Average age Ds4

Number of cases + Population density + Average age + Air pollution

It is a machine learning technique that is applied to determine the relationships, similarities, and patterns between values assuming that there will be data having more similarities than the others in a dataset (Alpaydın, 2010) . Since there is no need for a supervisor, unsupervised machine learning does not include a training process. Thus, analysis of unlabeled data becomes possible.

Due to the use of unlabeled data in the analysis, its performance is lower than other machine learning techniques. Unsupervised machine learning methods can be categorized under two general groups: association and clustering (Han et al., 2012) . While the association method is used to identify patterns among unlabeled data, the clustering method is for grouping data. In this context, FV, which is used to group risk levels based on the number of Covid-19 cases, is not a valid clustering method for unsupervised machine learning. Because Covid-19 data is unlabeled, clustering will be the most appropriate method for determining regional risk groups.

Clustering defined as the grouping of data with similar characteristics is one of the most important unsupervised machine learning problems (Xu & Tian, 2015) . In clustering analysis, it is desired to have similar data in the same groups and different data in separate groups as much as possible (Han et al., 2012) . In other words, clustering is the process of grouping data having uncertainties in which group, according to their similarities and dissimilarities. The aim of clustering is to discover natural structures of data in a dataset according to their distances (Mirkin, 2005; Arbelaitz et al., 2013) . Distance measures such as Cosine, Euclidean, Manhattan, Gini are applied to find the differences between the data (Fidan & Yuksel, 2020) . There are many clustering methods having different approaches in the literature. Hierarchical, Partitional and Soft methods are among the most widely used approaches in the literature. The most preferred basic algorithms in these approaches are HC, KM, and FCM, respectively (Peters et al., 2013) .

It is a clustering method has a binary tree structure that is performed by determining the closest pairs according to the distance between items. Two methods are used due to the tree structure namely agglomerative (bottom to top) and divisive (top to bottom) (Han et al., 2012) . In the agglomerative method, each item in the dataset is considered as a single cluster initially. The closest two items determined by distance criteria are combined to form a cluster. Other items remain as a single cluster. In the second step, the closest item pairs are determined again and combined. This process continues until all items are in a single cluster. In the divisive method, all items are initially taken in a single cluster. The furthest item is thrown out of the cluster and considered as a separate cluster. This splitting process continues until each item creates a cluster on its own.

Finding the minimum distance between items is given in Eq.

(1), determining the maximum distance is given in Eq.

(2).

(1)

(1) and Eq. (2) show the minimum and maximum values of the distance ( ) between items.

and indicate the item pairs, and indicates the distance between these items. According to - * the agglomerative approach, the scheme of HC is shown on the dendrogram in Fig. 1 , considering a data set with items.

The dendrogram in Fig. 1 having an agglomerative tree structure shows the HC clustering of a dataset containing n items. So, clustering is performed by considering Eq. (1). Initially, each item represents a cluster and labelled as a, b, c, etc. The first item ( ) is taken as a reference and the distance values for all items are calculated. The minimum value of these distances indicates the closest item to . In Fig. 1 , item was found as the closest item to , and in the first step, and were combined into a cluster. The same process is repeated to find the second closest pair. This pairing process continues until all items are in one cluster.

HC is preferred especially when the number of clusters is uncertain since it does not require the number of clusters before analysis. In other words, it provides an advantage when the number of clusters cannot be determined as a parameter. It is also stated that it is more efficient than other clustering algorithms in the analysis of small datasets (Abbas, 2008) . On the other hand, HC has disadvantages such as the long processing time, the inability to undo item pairing (Han et al., 2012) , and fluctuation in the performance for small datasets (Fidan & Yuksel, 2020) .

The KM algorithm was described in MacQueen's work "Some Methods for Classification and Analysis of Multivariate Observations" in 1967. In the KM algorithm, it is aimed to determine the closest members to a randomly selected center for each cluster in order to create clusters (MacQueen, 1967) . Distance measures such as Euclid, Manhattan, and Cosine are used to calculate the distance between members, and the average of the distance values of the items in the cluster is accepted as the cluster center (Han et al., 2012) . Since each cluster is created with the distances of the members from the center, the item can be placed in only one cluster. Methods in which an item cannot belong to another cluster are called Hard Clustering (Peters et al., 2013) .

The KM algorithm is used to split a dataset of elements into groups. The parameter is used to determine the number of clusters and . KM aims to minimize the sum of squares of ≤ distances ( ) from cluster centers (See in Eq. 3).

(

Eq. (3) shows that is an item in cluster .

represents the arithmetic mean of cluster . After determining the cluster centers in the first step, they are determined again according to the cluster members formed in the second step, and the process is repeated. The process of re-appointment of the centers, called iteration, continues until the minimum value of is determined (Peters et al., 2013) .

The KM algorithm is the most widely used algorithm in literature because it is simple to implement and has a high processing speed for small amounts of data. However, it has some shortcomings. The main problem of the KM algorithm is that effective results cannot be achieved when the initial centers cannot be selected appropriately (Jain et al., 1999) . Another drawback is the requirement of the parameter before clustering analysis (Fidan & Yuksel, 2020) . Determining the value before the analysis causes a problem in the uncertainty of the number of clusters. The algorithm that works very fast at small values slows down as the value increases (Peters et al., 2013) . Besides, the differences in cluster densities and cluster sizes reduce the efficiency of the KM algorithm (Han et al., 2012) .

Pioneering research on fuzzy theory in clustering was published in the study named "A New Approach to Clustering" to develop an alternative method for data reduction (Ruspini, 1969) .

However, the FCM algorithm, which provides the starting point for soft clustering algorithms, was developed by Bezdek. The basic idea in the FCM approach given in Fig. 2 is that each item has relational values in the range of [0,1] with all cluster centers (Bezdek, 1981) . In FCM clustering, the distance between items is determined by Euclidean distance. The clustering criterion is to minimize the weighted sums of membership values determined by item distances. Eq. (4) shows the clustering criteria for clusters of items.

(4)

where is membership degree of item for cluster , and . is called fuzzifier , , ∈ [0,1] parameter and As the value increases, the fuzziness increases. The approaches 1, ∈ (1,∞).

FCM will be similar to the K-means algorithm in terms of its results (Wu, 2012) . In the literature, it is recommended to set for better results (Bezdek, 1981; Peters et all., 2013) . The = 2 membership function of item in cluster ( ) is calculated by Euclidean distance is given in Eq. means that item is definitely in the cluster and cannot be included in another cluster. , = 1 , means that item is definitely not in the cluster . Membership values between and ( = 0 0 1 0 < ) indicates the degree of closeness of item to cluster . It must be noted that the sum of all , < 1 membership values of an item must be equal to (Bezdek, 1981) . So, if , item has at least 1 , < 1 two memberships functions greater than . In this case, item will belong to two clusters. 0

Gray System Theory (GST), which was seen in literature firstly with the study named "Control Problems of Gray System" performed by Deng, is a method recommended for analysis of datasets containing small samples and incomplete information (Deng, 1982) . In the GST, the certain unknown information is represented by black, the certain known information is represented by white, while the partial information is represented by gray (Liu et al., 2012) . GST is seen as one of the most successful methods to be used in cases of uncertainty arising from insufficient data (Fidan & Yuksel, 2020) .

Gray Relational Clustering (GRC), which is developed on the basis of GST, is a clustering method that determines similar observations according to gray relational coefficients (Jin, 1993) .

Since clusters consist of items grouped according to a certain rule, the clusters have homogeneity . It is an effective clustering method with its easy implementation and flexible structure (Fidan & Yuksel, 2020) . In addition, since the number of clusters can be determined after the clustering analysis, it is more realistic than traditional clustering algorithms . It is stated that GRC, which has recently started to be seen in clustering literature, has a higher performance than partitional-based algorithms (Chang & Yeh, 2005; Wu et al., 2012; Fidan & Yuksel, 2020; Fidan, 2020 ).

The first process in the GRC method is to create the decision matrix with the dataset to be clustered. The decision matrix for a dataset that has items and features is shown in Eq. (6). Utility-based and cost-based normalizations are given in Eq. (7) and Eq. (8), respectively. Thus, the normalization matrix is obtained seen in Eq. (9). *

The absolute differences matrix shown in Eq. (11) is obtained by using Eq. (10) that calculates the differences between the normalization matrix and the reference item. For example, if the first item is a reference ( ), the differences between the second item ( ) for criterion = 1 = 2 is calculated by . After the absolute differences are calculated for all features of ,

the process is repeated while as . Thus, the process of absolute differences is applied to all = 3 items for the first reference (while ). is called distinguish parameter and value should be (Ertugrul et al., 2016) . 0.5

and . Since there will be as many gray

coefficients as the number of features in an item, gray relational degrees are calculated by the arithmetic mean of the coefficients. In Eq. (13), the calculation of gray relational degrees for item is shown. The highest value in the series formed with gray relational degrees means the item with the highest relational level with the reference series. The item that is closest to the reference item and that will build a cluster is determined. The center of the cluster consisting of these two items is the arithmetic mean of the members. Thus, the first cluster has been built. After the first clustering, a new decision matrix is created and the next item is determined as a reference and the processes are repeated to build the next cluster.

(13) = 1 ∑ = 1

Validation techniques are used to measure the performance of algorithms. The measurements, which are classified under two groups, are named as external validation and internal validation, respectively, according to whether the cluster items contain external data or not (Han et al., 2012) .

Since there is no external data capability in internal validation methods, the performance of clustering algorithms is decided due to the structure of the clusters. Silhouette index (SI), Calinski-Harabasz index (CH), and Dunn index (DI) are commonly applied methods in the literature for internal validation (Arbelaitz et al., 2013; Hassani & Seidl, 2017; Gupta & Panda, 2019) .

SI is widely preferred in the literature since it takes into account the compactness and separation together in determining clustering performances (Chaimontree et al., 2010; Liu et al., 2010; Arbelaitz et al., 2013; Mahi et al., 2018; Gupta & Panda, 2019) . The calculation method of the SI is given in Eq. (14). ( 1 -1 ) 2 + ( 2 -2 ) 2 + …( - (Han, Kamber and Pei, 2012) . The calculated SI value is . The closer value to the +1 -1 ≤ ≤ 1

indicates that the performance of cluster analysis is high (Rousseeuw, 1987) .

The CH index is an internal validation method calculated by the ratio of the separation value between the clusters to the dispersion value within the cluster (Calinski & Harabasz, 1974) . The dispersion value is the distance of cluster items from the cluster center. The separation value between clusters is the distance of cluster centers from center of the entire dataset. The CH index for items and clusters is given in Eq. (17).

is the inter-cluster divergence and is the intra-cluster divergence. ( ) ( ) and represent the number of items and the arithmetic mean of cluster , and represents the arithmetic mean of entire dataset. In other words, is the centroid of cluster and is the centroid of the dataset. is an item of the cluster.

The high CH value means the clustering algorithm has high performance (Liu et al., 2010) . In addition, the maximum CH value shows the optimal number of clusters (Arbelaitz et al., 2013; Kettani et al., 2015) . The CH index, also called Variance Ratio Criterion, is stated in some studies to be widely used because it gives more consistent results compared to other indexes, has an easy implementation, and has a low computational cost (Milligan & Cooper, 1985; Kettani et al., 2015; Harsh & Ball, 2016) .

DI is another internal validation index measuring clustering performance by using the ratio of intra-cluster compactness and inter-cluster separation (Dunn, 1973) . While the minimum distance between items in different clusters is taken as the basis for separation between clusters, the maximum diameter of the clusters represents the compactness (Arbelaitz et al., 2013) . In other words, the DI given in Eq. (18) is the ratio of the minimum distance between items in all clusters to the maximum diameter among the clusters.

where and being different cluster items and

A high DI value means a high clustering performance, and the value with the max 1 , 2 ∈ ( 1 , 2 ) maximum DI value indicates the optimal number of clusters (Arbelaitz et al., 2013) .

In the experimental study, hierarchical, partitional, soft clustering, and gray relational methods are used. For clustering analysis, Agglomerative HC for hierarchical, KM for partitional, FCM for soft clustering, and GRC were preferred. R was used for the HC, KM, and FCM algorithms. The GRC algorithm was improved by C#. The validation values of FV were found 0.67 for SI, 362.18 for CH and 0.046 for DI. Validation values according to the number of clusters of the algorithms are shown in Table 4 . value of FV is almost the same as the DI value of GRC. This result proves that the GRC demonstrates more clustering performance for Ds1 than the FV and others algorithms. In the clustering analysis of Ds1, the algorithm performances due to the number of clusters are given in Fig. 3 for SI, Fig. 4 for CH and Fig. 5 for DI. In all metrics, it is seen that the most successful algorithm for all clusters in Ds1 is GRC. The risk groups of the provinces determined by FV and GRC for Ds1 are given in Table 5 . P1, P3, P7, P11, P13, P15, P21, P23,  P24, P25, P29, P30, P31, P32, P33,  P37, P39, P42, P43, P45, P46, P50,  P55, P56, P61, P70, P76, P80   P1, P3, P7, P14, P17, P15, P23, P24,  P25, P29, P30, P33, P37, P39, P42, P43,  P46, P56, P70, P76   3  P8, P9, P10, P16, P19, P22, P27, P40,  P41, P44, P47, P48, P49, P51, P52,  P54, P58, P59, P62, P73, P79, P81   P2, P5, P6, P8, P9, P10, P11, P13, P16,  P19, P20, P21, P22, P27, P28, P31, P32,  P35, P40, P41, P44, P47, P48, P49, P51,  P52, P53, P54, P55, P58, P59, P61, P62,  P66, P73, P79, P80, P81  4  P2, P5, P6, P12, P20, P28, P34, P35,  P53, P63, P64, P65, P66, P67, P69,  P74, P75 P12, P34, P63, P64, P65, P67, P69, P74, P75

In Fidan & Yuksel (2020) . Our study emphasizes that GRC is the most stable algorithms for determining regional Covid-19 risk levels.

One of the precautions to control the Covid-19 epidemic that has harmful effects on people globally is to impose restrictions. These restrictions that include limitations in social and economic life generally are decided according to the number of cases per 100,000 people regionally.

Determining risk levels and grouping cities by fixed values can be seen as a kind of clustering method. However, this method is not a valid approach for clustering. This study aimed to demonstrate that it would be more realistic to use unsupervised machine learning techniques for determining the restricted locations.

The algorithms of the 4 clustering approaches, namely Hierarchical, Partitional, Soft, and Gray Relational, were applied to the 4 datasets created with the number of cases, the population density, the average age, and the air pollution of provinces. Clustering performances were determined by using SI, CH, and DI. It has been determined that the traditional algorithms have less performance because the datasets containing only the number of cases have insufficient data. If clustering will be realized with only the number of cases, it was proved that GRC is the most successful algorithm. In this context, this study reveals that GRC is a more suitable option instead of FV in determining the areas to be restricted according to the number of Covid-19 cases.

This study emphasizes that the number of clusters is an important issue if more variables are used in the datasets besides the number of cases in determining the restrictions. It has been observed that a high number of clusters increases cluster performances. In this context, it would be a suitable decision to identify as many risk groups as possible for restrictions. Thus, healthier and more effective restriction decisions can be made within the scope of reducing the spread of Covid-19. In clustering datasets with environmental variables for determining the restriction regions, it would be proper to use GRC, HC or KM for the number of clusters 4 and below ( ), but GRC should be ≤ 4 chosen for the number of clusters 5 and above ( ). In this context, this study recommends for ≥ 5 governments that restrictions should be determined by at least 5 risk levels and grouped the regions by using GRC. 

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Comparisons between data clustering algorithms

Clustering and superspreading potential of SARS-CoV-2 infections in Hong Kong

Introduction to Machine Learning

An extensive comparative study of cluster validity indices

Pattern Recognition with Fuzzy Objective Algorithms

First data analysis about possible COVID-19 virus airborne diffusion due to air particulate matter (PM): The case of Lombardy (Italy)

A dendrite method for cluster analysis

Advanced Data Mining and Applications Lecture Notes in Computer Science, 6440

Grey relational analysis based approach for data clustering

Clustering US States by Time Series of COVID-19 New Case Counts with Non-negative Matrix Factorization

Air pollution and respiratory viral infection

Can atmospheric pollution be considered a co-factor in extremely high level of SARS-CoV-2 lethality in Northern Italy?

Clustering of European countries and territories based on cumulative relative number of COVID 19 patients in 2020

Living environment matters: Unravelling the spatial clustering of COVID-19 hotspots in Kolkata megacity India. Sustainable Cities and Society, Article 102577

Control problems of grey systems

A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters

Grey relational analysis approach in academic performance comparison of university: A case study of Turkish universities

Role of the chronic air pollution levels in the Covid-19 outbreak risk in Italy

Report 9: impact of nonpharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand

Grey Relational Classification of Consumers' Textual Evaluations in E-Commerce

A Novel Short Text Clustering Model Based on Grey System Theory

Clustering Validation of CLARA and K-Means Using Silhouette & DUNN Measures on Iris Dataset

Data Mining Concepts and Techniques

Automatic k-expectation maximization (a k-em) algorithm for data mining applications

Using internal evaluation measures to validate the quality of diverse stream clustering algorithms

UMAP-assisted K-means clustering of largescale SARS-CoV-2 mutation datasets

COVID-19 Mitigation Behaviors by Age Group-United States

Data Clustering: A review

Cluster-based dual evolution for multivariate time series: Analyzing COVID-19

Grey relational clustering method and its application

Ak-means: An automatic clustering algorithm based on K-means

Dynamic indexing and clustering of government strategies to mitigate Covid-19

A brief introduction to grey systems theory

Understanding of Internal Clustering Validation Measures

Some Methods for Classification and Analysis of Multivariate Observations

The Silhouette Index and the K-Harmonic Means algorithm for Multispectral Satellite Images Clustering

Fuzzy clustering method to compare the spread rate of Covid-19 in the high risks countries

Clustering Approach to Classify Italian Regions and Provinces Based on Prevalence and Trend of SARS-CoV-2 Cases

An examination of procedures for determining the number of clusters in a dataset

Clustering for Data Mining: A Data Recovery Approach

Soft clustering -Fuzzy and rough approaches and their extensions and derivatives

Silhouettes: A graphical aid to the interpretation and validation of cluster analysis

A New Approach to Clustering

K-Means Clustering of COVID-19 Cases in Indonesia's Provinces

Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in Wuhan, China

Analysis of parameter selections for fuzzy c-means

Applying hierarchical grey relation clustering analysis to geographical information systems -A case study of the hospitals in Taipei city

A Comprehensive Survey of Clustering Algorithms

Haze is a risk factor contributing to the rapid spread of respiratory syncytial virus in children

Clustering analysis of countries using the COVID-19 cases dataset

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper