key: cord-0783426-j219i9nl authors: Cerqueti, Roy; Ficcadenti, Valerio title: Combining rank-size and k-means for clustering countries over the COVID-19 new deaths per million date: 2022-03-11 journal: Chaos Solitons Fractals DOI: 10.1016/j.chaos.2022.111975 sha: 52334872b97fbe81cbbb5e53d726f0aec2abb78c doc_id: 783426 cord_uid: j219i9nl This paper deals with the cluster analysis of selected countries based on COVID-19 new deaths per million data. We implement a statistical procedure that combines a rank-size exploration and a k-means approach for clustering. Specifically, we first carry out a best-fit exercise on a suitable polynomial rank-size law at an individual country level; then, we cluster the considered countries by adopting a k-means clustering procedure based on the calibrated best-fit parameters. The investigated countries are selected considering those with a high value for the Healthcare Access and Quality Index to make a consistent analysis and reduce biases from the data collection phase. Interesting results emerge from the meaningful interpretation of the parameters of the best-fit curves; in particular, we show some relevant properties of the considered countries when dealing with the days with the highest number of new daily deaths per million and waves. Moreover, the exploration of the obtained clusters allows explaining some common countries' features. The spatio-temporal patterns of COVID-19 represent one of the most relevant themes for statistical research nowadays, given the crucial relevance of the pandemic disease in contexts of society such as economics and, of course, health. This pandemic has heterogeneous implications on countries and regional realities. The most common example we can mention is given by the different applications of the so-called non-pharmaceutical interventions in the preliminary phases (see, e.g. Flaxman et al., 2020; Tian et al., 2021) . These differences must be included in the premise of an effective exploration of COVID-19 repercussions. Some authors deal with forecasting exercises of the future evolution of deaths and infections dynamics (see, e.g., Bertozzi et al., 2020; Moein et al., 2021; Nabi, 2020; Tang et al., 2021; Prasanth et al., 2021) . In this respect, Ioannidis et al. (2020) 's authors state that the reliability of the predictions related to COVID-19 is debatable for several reasons, including the relevant sensitivity of the estimates on the employed methodology. based on latent variables and estimating it through a Markov chain Monte Carlo (MCMC) algorithm. Still, in the context of MCMC, Lee et al. (2021) discuss the propagation of in Scotland by adopting a Bayesian-type framework. In Schneble et al. (2021) , the registered death counts related to COVID-19 are modelled to monitor the dynamic behaviour of the infections on a small-area level in Germany. Differently from the studies above, we consider a selection of countries and deal with the exploration of their daily data about COVID-19 new deaths per million. We combine a rank-size best-fit exercisebeing the size, the considered variableand a cluster analysis of k -means type. So, it is shown that a rank-size law of third-degree polynomial type provides high-quality goodness-of-fit parameters. The calibrated parameters feed the k -means cluster analysis based on a Euclidean distance, with =3 k (the reasons for this choice are presented in Section 2). In this paper, we do not intend to propose a new method for clustering COVID-19 data in terms of countries as Zubair et al. (2020) have done; instead, we want to consider a statistical clustering technique well known in the literature and widely used in the context of even operations research, and apply it to a novel, relevant problem, with highly informative results. To assure data reliability and to reduce possible sources of biases in the data collection, countries are selected by taking those with a high value of the Healthcare Access and Quality Index (HAQ hereafter, see Barber et al., 2017) . Moreover, to avoid distortions in the best-fit procedure, we have removed the outliers at a country level during the data pre-treatment phase. The results interpretation is grounded on the meaningfulness of the calibrated parameters in terms of the polynomial curve shape; the analysis of the obtained clusters allows highlighting regularities and deviations of the considered countries. Other contributions present a cluster analysis of the data related to COVID-19 and are summarised in Table 1 . For instance, James and Menzies (2020) , and Rios et al. (2021]) respectively employ k -means and hierarchical to analyze public policies along the time (former) and to make forecasting of pandemic waves (latter). In these studies, the time is considered because of the purposes of the researches. Similarly, Li et al. (2021) Siddiqui et al. (2020) consider the relationship between temperature in Chinese areas and the spreading of the disease to cluster China's territories. A similar exercise is done by Vadyala et al. (2021) where humidity is also considered but for exploring Louisiana's pandemic related data. Kiaghadi et al. (2020) consider a larger number of variables. The authors included in the clustering elements like "Access Lopes (2020) hierarchical COVID-19 cases in in multiple countries. Kumar (2020) hierarchical COVID-19 cases, deaths and recovery in India. Li et al. (2021) k-means COVID-19 cases in China. Rizvi et al. (2021) k-means COVID-19 cases and deaths in multiple countries. (2021) apply rank-size distributions to model the spatio-temporal evolution of COVID-19 in USA and China, but not directly with data regarding deaths or new cases. More in general, the usage of rank-size laws is typically driven by robust compliance of the data to theoretical modelssee, e.g., the work of Ficcadenti et al. 2019 for text analysis or Ficcadentiand Cerqueti 2017 for earthquakes cost evaluations based on rank-size lawnamely, when the best fit is appropriate, the goodness of fit must result excellent. It is the case in this paper, as presented in the section devoted to the results. In studying other researches attempting to model similar information related to COVID-19, one can notice, for example, Table 1 by Tuli et al. (2020) where the 2 R s are lower than those usually obtained with rank-size best fits. In addition, Machado and Lopes (2020) write in section "Regression models for describing the spread of COVID-19", that "a single model with a limited number of parameters is not able to fit well the time series [...] for all countries". So, even if they have found some goodness of fit comparable to those expected for rank-size compliance (e.g., they report an 2 = 0.99 R for Italian and Chinese data), the issue of identifying a model that works well for all the countries remain open. It also involves some consideration around over-fitting the data with many parameters and the increasing computational complexity in fitting and then clustering. Therefore, another advantage of the approach proposed in the present study is the rank-size relationships' capacity to create a unified environment where comparisons are possible. Namely, we can fit each country's data and compare the results, ensuring that the best fit capacity does not affect the clustering activity. The rank-size approach has the advantage (in this case) of allowing the analysis without data's temporal feature. Namely, in sorting the observations of new deaths per million and ranking them, the dates in which the causalities occurred are no longer relevant to reach conclusions regarding the countries. In this way, the issues presented by Middelburg and Rosendaal (2020) and Besides, the study is advancing a unique combination of rank-size and clustering analysis to evaluate past realizations of COVID-19 patterns. This is novel in the literature to the best of our knowledge. The rest of the paper is organised as follows. Section 2 describes the considered dataset and presents the methodologies employed for the analysis. Section 3 contains the empirical results, along with a discussion of them. The time series of daily new deaths per million by country has been downloaded from Roser et al. (2020). The data source collects a comprehensive set of variables describing many features related to COVID-19, and it has been employed in several authoritative studies (see, e.g. those from Zhao et al., 2020; Hasell et al., 2020; Berg et al., 2020) . Each country has a specific reference period, depending on the registered beginning of the pandemic propagation. All the investigated periods ends on April 18th, 2021when data have been retrievedwhile the starting points are reported in the last column of Table 2 . Countries are rather heterogeneous in terms of health care standards. This might create different reporting best practices, especially at the beginning of the pandemic (see for example MCDonell, 2020). To overcome this potential bias and obtain a more reliable dataset, we have chosen countries with a high level of HAQ, presented by Barber et al. (2017) . Such an indicator is Table 2 for the details. The same table contains the main descriptive statistics of the considered data at a country level for a more detailed overview. We have preprocessed the dataset to make the best fit more effective and avoid distortions. Journal Pre-proof First, we have removed the outliers in each analyzed series by applying an interquartile method. Namely, for each series, we have calculated the 75th ( 3 Q ) and 25th ( 1 Q ) percentiles; then, we eliminated all the observations being outside the range [ 1 1.5 ( 3 1), 3 1.5 ( 3 1) In doing so, we have faced the so-called king and vice-roy and queen and harem effect, i.e. the deviations due to outliers at low and high ranks in the rank-size analysis (see, e.g. Ausloos, 2014; Cerqueti and Ausloos, 2015; Ficcadenti et al., 2020) . The country data presents tails on the right side only so that the eliminated observations always sit on the right side of the upper limit. After many trials of different functional forms such as the Zipf-Mandelbrot by (see Mandelbrot, 1961 Mandelbrot, , 1953 and the Universal Law by (see Ausloos and Cerqueti, 2016) , we have used for each country the following third-degree polynomial relationship which, as we will see, gives satisfactory best fit outcomes: where z is the size, r represents the rank related to size z and ,, abc and d are real parameters to be calibrated. To implement the best fit procedure, we have used the Scikit-learn Python's library (Pedregosa et al., 2011 ) which leads to a parameters' estimation "by adding higher-order polynomial terms of existing data features as new features in the dataset" as reported by Bisong (2019) . Once the best fit procedure is performed, each country =1, , iN remains associated to four calibrated parameters, collected in a vector ˆ= ( , , , ) Such parameters have been used in the countries' clustering procedure by adopting the "k-means++" Scikit-learn Python algorithm (see contributions from Pedregosa et al., 2011; Arthur and Vassilvitskii, 2006) . Anyway, the effect of random initialization have been tested, and they did not impact the results. To implement the k -means++ procedure, the parameters have been standardized over the considered countries. We set Table 3 , as it is possible to note, they straightforwardly suggest =3 k as the best option. Indeed, obtaining adjacent neighbourhoods of clusters representing different countries' structures and regimes is preferred. So, taking =3 k means selecting the condition where the distance between clusters is the minimum. Hence, the groups are close to each other, and the variance in the clusters is maximum, ensuring more comprehensive clusters' perimeters, namely a higher probability of capturing the countries behaving similarly, falling in the same clusters' area. The proposed clustering algorithm selects the three clusters' centroids that minimize the within-clusters sum-of-squares criterion: is the Euclidean distance between the four-dimensional vectors x and  . Let us denote the centroids coming from the optimization procedure in (2) by ,, ; they are associated to clusters labelled with "0", "1" and "2", respectively. Such ,, , for =1, , iN and = 0,1, 2 J . To summarise the proposed procedure, we report in pseudo-code what described above in Algorithm 1. Algorithm 1 Summary of the rank-size analysis and k -means clustering. Table 3 : Evaluation of the best k for the k -means cluster analysis. The reference for the indexes used are listed here in the same order they appear in the table Rousseeuw (1987) ; Davies and Bouldin (1979) ; Dunn (1974) ; Caliński and Harabasz (1974) 3 Results The results of the best fit for Eq. (1) are reported in the first six columns of Table 4 . The 2 R and the RSME are outstanding, proving the ability of Eq. (1) to represent the rank-size relationship. The interpretation of the calibrated parameters â , b , ĉ and d leads to relevant comments related to the considered countries. The parameter â is the intercept of the best fit curve with the y -axis. Hence, such a parameter is positively influenced by the highest level of new daily deaths per million experienced by the countries. We observe the maximum value of â held by Hungary and the minimum one by Australia. Differently, b is associated with the slope of the decay; thus, not unexpectedly, it is always negative in our case. In particular, at low ranks, a high value of the absolute value of b stands for a steep curve, while b close to zero means that the curve is rather flat. The difference between such cases is the distance between the sizes at the low ranks, which is large for the steep cases and small for the flat ones. We observe that countries experiencing a single pandemic wave J o u r n a l P r e -p r o o f with low daily deaths have similar values. This explains why such countries have parameter b much closer to zerosee the case of Australia in Figure 1 ; on the contrary, the maximum value of the absolute value of b is scored by Slovenia. The concavity is driven by ĉ , that is positive or negative according to a convex or concave shape of the curve at low ranks, respectively. In our case, such a calibrated parameter is always positive, except for Czechia and Italy. Therefore, for all the other countries, decrements of the low-rank sizes decrease as the rank grows. This means that the highest number of daily deaths per million form a peak in the overall distribution. The flatter shape of the concave curve at low ranks for Italy and Czechia points to more homogeneous values of the daily deaths per million at low ranks. Such behaviours are amplified as the absolute value of ĉ increases. We notice that Slovenia scores the maximum value of such a parameter in this respect. Concluding, an easy computation gives that rank represents the unique inflection point of the best fit curve, where a change of concavity is observed. This quantity is reported in column "Inflection point" of Table 4 . The highest value of the rank associated with the inflection point is scored by Czechia, while the lowest is by Slovenia. Such values further confirm the aforementioned logic, with Czechia having experienced a more recent and prolonged critical situation than Slovenia (see Figure 1 ). The standardised parameters â , b , ĉ and d are employed to feed the clustering algorithm. The resulting clusters are reported in the column "Clusters" in Table 4 which, jointly with Figures 1, 2 and 3 , further informs about the features captured in fitting the data with Eq. (1). The cluster identified with the colour blue and the number "0" mainly contains countries with a relatively low number of deaths per million; Australia, Japan and South Korea obtain the lowest losses. This is further confirmed by sorting the results by â . It is relevant to point out that the same Despite the outlier removal procedure, these relatively small countries suffered losses with high daily peaks. Finally, the green cluster identified with the number "2" has countries sitting in the middle of the distribution when values are sorted by b . Similarly, when they are ordered by ĉ except for Czechia and Italy, which present the lowest values of ĉ even if they belong to cluster number 2. To justify such a result, we look at the "Inflection point" column in Table 4 Table 4 ). We have provided a reasonable interpretation of the four estimated parameters in Eq. (1), hence capturing insightful information regarding the COVID-19 severity in the countries. In this respect, the clustering activity is grounded on the parameters calibrated from the best-fit exercise, and we group countries according to the phenomenon's features captured by such rank-size function's parameters. The main determinants are given by days with picks of deaths, the steadiness of casualties number, endured COVID-19 waves and other elements that affect the shape of the ranked data. In Figure 3 a visual representation of the cluster profiles is reported, and in Table 4 It is important to notice that the rank-size approach allows for evaluations of the overall phenomenon without referring to specific periods and time ranges. In doing so, the proposed approach is free from biases associated with the time inconsistency of the data at country levels. Hence, we do not need here to implement time-based normalising procedures, which would demand an additional transformation of the data as reported by Zarikas et al. (2020); Middelburg and Rosendaal (2020) . Table 4 : Estimated parameters and clusters per each country. The last two columns respectively represent the rank at which the best fit of Eq. (1) presents a change in concavity and the maximum rank obtained for that country, namely the length of the series. (1) and the countries' respective parameters â , b , ĉ , and d from The application of K-means clustering for province clustering in Indonesia of the risk of the COVID-19 pandemic based on COVID-19 data k-means++: The advantages of careful seeding Zipf-Mandelbrot-Pareto model for co-authorship popularity A universal rank-size law Healthcare access and quality index based on mortality from causes amenable to personal health care in 195 countries and territories A spatio-temporal model based on discrete latent variables for the analysis of COVID-19 incidence Mandated bacillus calmette-guérin (BCG) vaccination predicts flattened curves for the spread of COVID-19 The challenges of modeling and forecasting the spread of COVID-19 Building Machine Learning and Deep Learning Models on Google Cloud Platform, chapter Linear Regression A dendrite method for cluster analysis Evidence of economic regularities and disparities of Italian regions from aggregated tax income size data A cluster separation measure Well-separated clusters and optimal fuzzy partitions Earthquakes economic costs through rank-size laws A joint text mining-rank size investigation of the rhetoric structures of the US Presidents' speeches Words ranking and Hirsch index for identifying the core of the hapaxes in political texts Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe A cross-country database of COVID-19 testing COVID-19 Cases and Deaths in Southeast Asia Clustering using K-Means Algorithm Forecasting for COVID-19 has failed Cluster-based dual evolution for multivariate time series: Analyzing COVID-19 A power-law-based approach to mapping COVID-19 cases in the United States On the authenticity of COVID-19 case figures Assessing COVID-19 risk, vulnerability and infection prevalence in communities Monitoring novel corona virus (COVID-19) infections in India by cluster analysis Quantifying the small-area spatio-temporal dynamics of the Covid-19 pandemic in Scotland during a period with limited testing capacity Efficient management strategy of COVID-19 patients based on cluster analysis and clinical decision tree classification Rare and extreme events: the case of COVID-19 pandemic An informational theory of the statistical structure of language On the theory of word frequencies and on related Markovian models of discourse Coronavirus: Sharp increase in deaths and cases in Hubei COVID-19: How to make between-country comparisons Inefficiency of SIR models in forecasting COVID-19 epidemic: a case study of Isfahan Forecasting COVID-19 pandemic: A data-driven analysis Scikit-learn: Machine Learning in Python Forecasting spread of COVID-19 using google trends: A hybrid GWO-deep learning approach Country transition index based on hierarchical clustering to predict next COVID-19 waves Clustering of countries for COVID-19 cases based on disease prevalence, health systems and environmental indicators Coronavirus pandemic (COVID-19) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis Nowcasting fatal COVID-19 infections on a regional level in Germany Correlation between temperature and COVID-19 (suspected, confirmed and death) cases based on machine learning analysis Spatiotemporal evolution of COVID-19 infection and detection within night light networks: comparative analysis of USA and China The Interplay of Demographic Variables and Social Distancing Scores in Deep Prediction of U.S. COVID-19 Cases The Effects of Stringent and Mild Interventions for Coronavirus Pandemic Predicting the growth and trend of COVID-19 pandemic using machine learning and cloud computing Prediction of the number of covid-19 confirmed cases based on K-means-LSTM. Array Modeling the Epidemic Growth of Preprints on COVID-19 and SARS-CoV-2 Clustering analysis of countries using the COVID-19 cases dataset Generalized k-means in GLMs with applications to the outbreak of COVID-19 in the united states Time to lead the prevention and control of public health emergencies by informatics technologies in an information era