key: cord-1037003-rbe3pge5
authors: da Costa, Joaquim Pinto; Garcia, André
title: New confinement index and new perspective for comparing countries - COVID-19
date: 2021-08-17
journal: Comput Methods Programs Biomed
DOI: 10.1016/j.cmpb.2021.106346
sha: cb9bc86a980fe3567daf1ea456ab3afeaa804e0e
doc_id: 1037003
cord_uid: rbe3pge5

Background and objective: In the difficult problem of comparing countries regarding their lockdown measures or deaths caused by the COVID-19, there is still no agreement on what is the best strategy to follow. Thus, we propose a new way of comparison countries that avoids the main difficulties in the comparison by using three-dimensional trajectories for this type of data. Methods: We introduce a new index to analyze the level of confinement that each country was subject to overtime, based on the Community Mobility Reports published by Google resorting to Principal Component Analysis. Subsequently, by using longitudinal clustering, we divide the European countries into similar groups according to the COVID-19 obits and also to the confinement index. However, to make the most out of the clustering methods we resort to artificial longitudinal data to evaluate both the methods and the indices. Results: By using artificial data, we discover that Calinski–Harabasz outperformed other internal indices in indicating the real number of clusters. The tests also suggested that [Formula: see text]-means with Euclidean distance was the best method among the ones studied. With the application to both the mobility and fatalities datasets, we found two groups in each one. Conclusions: Our analysis enables us to discover that European northern countries had more mobility during the first confinement and that the deaths caused by COVID-19 started to drop around the 40th day since the first death.

Over the years, the occurrence and development of many diseases have been increasing [1] [2] [3] [4] , leading to the necessity of cohort studies in fields like genetics and epidemiology. This type of study is a particular form of longitudinal studies, in which individuals are observed over a certain period, gathering, what is known as, longitudinal data. Observations can be seen as repeated measurements over time, making each individual produce his own trajectory in a geometric plane.

Univariate time series data typically occur from the collection of several data points over time from a single source, whereas longitudinal data typically arises from gathering observations over time from several sources. Thus, the collection of longitudinal data is naturally much more costly than the collection of time series or cross-sectional data [5] . However, it has become widely available * Corresponding author.

E-mail address: jpcosta@fc.up.pt (J.P. da Costa). in both developed and developing countries [6] . Longitudinal studies are mainly encountered in the health science field. Nonetheless, they can also be found in areas like climatology [7] , business [8] , or the bank industry [9] .

In addition, longitudinal studies allow the researcher to separate aging effects (changes over time within individuals) from cohort effects (differences between subjects at baseline). Such cohort effects are often mistaken for changes occurring within individuals. With the absence of longitudinal data, one cannot distinguish these two alternatives [10, 11] .

In many situations, like in clinical trials, there is a natural heterogeneity among individuals in terms of how diseases develop and progress. This heterogeneity is owed to many factors, but mainly to genetics [12] [13] [14] . Hence, it is important to study and analyze the existence of homogeneous groups among the individual's trajectories. This helps to identify relevant trajectories patterns and track subjects who are very likely to be involved in the same or similar processes. This process of finding homogeneity is not a novelty in the medical field and is often called disease subtyp-ing , which is, by definition, the task of identifying sub-populations of similar patients that can conduct treatment decisions for a given individual [15] . One efficient approach to detect patterns and structures in the data is through cluster analysis. The relevant groups obtained by clustering share common characteristics and play an important role in the composition of the data. Although there isn't a general consensus on how to categorize clustering methods, they can be organized into 5 categories [16] : Partitioning, Hierarchical, Density-based, Grid-based, and Model-based.

Because model-based algorithms have a wide range of procedures, it is also common to classify all of the clustering methods as either model-based or non-parametric [17] [18] [19] . Model-based approaches assume that a mixture of underlying probability distributions generates the data and that it can be described using a standard statistical model [20] [21] [22] [23] . They are gaining popularity and have been applied in several contexts of longitudinal data over the years [21] [22] [23] [24] [25] . Unlike model-based approaches, non-parametric clustering methods have no assumptions on how the data was generated and explicitly focus on defining similarity between subjects and clusters. The K-means algorithm is by far the most used nonparametric method and has already been extended and adapted to longitudinal data [17, 19, 26, 27] .

Finally, most of the clustering algorithms require prior knowledge of the number of clusters as an input. However, since there is no reliable method to determine the "optimum" number of clusters in a dataset [28] , various internal clustering validity indices have also been proposed. These indices are used to measure the goodness of a clustering structure without external information [29] . This decision is done by executing the clustering algorithm several times with a different number of clusters in each run in order to choose the one with the best index results. Still, this issue remains mostly unsolved.

In late 2019, a local outbreak of pneumonia of originally unknown causes was reported in Wuhan, China, and was quickly determined to be caused by a novel coronavirus, which has subsequently affected multiple countries worldwide [30, 31] . It has since been identified as a zoonotic coronavirus, resembling MERS coronavirus and SARS coronavirus and designated as COVID-19 [32] . This type of pneumonia, caused by the SARS-CoV-2 virus, is a highly infectious disease, and the ongoing outbreak has been declared by the World Health Organization (WHO) as a public health emergency of international concern, posing a high risk to countries with vulnerable health systems.

It is known that epidemics tend to grow exponentially. Nevertheless, growth cannot be exponential forever. Eventually, the virus will run out of new people to infect/kill either because most people have already been infected/killed or because we as a society manage to control it. Since countries did not follow the same restrictions or responded equally in the COVID-19 pandemic, different situations were seen throughout the world. For instance, considering the number of inhabitants, European countries like Italy or Belgium had a high number of obits while countries like Turkey or Czechia had relatively low numbers. Thus, it is very pertinent to compare the response given by different countries in the COVID-19 battle.

Several approaches have been done to study the trajectories generated by the coronavirus infections (or deaths) in an attempt to model or predict future outcomes, while a vaccine is still yet to be found [33] [34] [35] [36] . In [37] , a longitudinal model-based clustering system on the COVID -19 disease trajectories over time was used to identify "vulnerable" clusters of counties. Even so, as of today, the application of longitudinal clustering to these trajectories is not commonly found in the literature. Besides that, in the course of the COVID-19 pandemic, multiple governmental authorities dictated quarantines in an effort to control the rapid spread of the virus. In late March 2020, around a third of the world's population were under some form of lockdown. Consequently, the economy and other sectors suffered big repercussions due to reduced mobility.

In the analysis of COVID-19 statistics, there is a lot of misconception. Some people argue that the number of infections should be divided by the total number of inhabitants of each country while some argue that it should be divided by the total number of tests. Moreover, the trajectories can also be seen on a logarithmic scale instead of a linear one, thereby possibly providing different conclusions [38] . This uncertainty, along with the incorrect use of statistical concepts, has led to some deceitful results shared throughout the media [39] . To make things worse, the available data might not correspond to the real numbers out there since not everyone takes the test and some non-asymptomatic individuals might have the virus without even knowing it.

Regarding the comparison between different countries, many studies report the number of cases per million inhabitants ˆ [40] .

A country capable of testing more population will likely detect more cases than a country with less capability. However, even when considering the number of tests N T (t, A ) that a country A performs at time t, there is an added complexity. Not all of the countries are doing the same type of tests, which makes the true positives to vary depending on the type [40] .

Generally, given the problems with the tests just described, deaths are usually a relatively safe indicator of the actual spread of the pandemic -always with a time lag that people take to succumb, being affected by how well risk groups are protected, or not. For these reasons, in order to avoid any erroneous results regarding the infections and the number of tests, we decided to not consider them in our analysis. Next, we will explain how we manage to compare different countries by the use of a new procedure consisting of using three-dimensional trajectories to analyze the data.

Given the above difficulties, the data we will analyze concern the number of deaths N D (t, A ) that country A has at time t caused by the new coronavirus between the months of February and June; this type of data seem to us more reliable. Since most of the COVID-19 statistics are public, this data can be downloaded on several websites.

In such analyses, rather than only focusing on absolute numbers, the rate of growth should also be considered. Thus, instead of only considering a variable with the total number of casualties overtime N D (t, A ) , we added a variable that, for each day, takes into account the number of deaths in the past week N D 7 (t, A ) . In other words, for the first variable N D (t, A ) , each point represents the cumulative number of new deaths whereas the second variable represents the total number of new deaths regarding the past 7 days N D 7 (t, A ) . This added variable, when comparing to the total number of obits, in which time is implicit, gives a different perspective of the matter.

For instance, let us consider the number of deaths in Portugal. If we observe the graphic with the total number of deaths over time ( Fig. 1 ) and we were around the 30th day since the first death (blue line), one might conclude that the numbers of deaths are still growing at a considerable rate.

However, in the same conditions, but observing the number of deaths regarding their past week ( Fig. 2 ) , one would notice that the growth rate of deaths has already started to slow down. Fig. 3 illustrates the plot of deaths regarding the past weeks against the cumulative number of deaths (in Portugal). As it is shown, in the beginning, the weekly values are much more spaced and tend to condense as the total number of deaths increases. When the total number of deaths was 1300 and until they reached a value of 1450, the reported weekly deaths were very similar. After that value, the growth rate of deaths rapidly started to decrease. The blue line simultaneously indicates the weekly numbers and total numbers of deaths that happened around the 30th day since the first death. When this curve "closes", that is, when there are no new deaths in the past week, the final curve for the country will "land" at a certain point on the x-axis which represents the total number of deaths for the respective country. We can see the trajectory made by each country, and compare it to other countries until reaching control of the epidemic. In particular, if we choose to divide both variables by the number of inhabitants (here that is not a problem because we are watching the relation between the two variables), the comparison is even more meaningful.

The majority of the coronavirus analyses show the infections/obits plotted against time. However, the virus does not care if it is April or May. It only cares about how many cases there are today and how many will be tomorrow, i.e, the total number of infections/obits and the growth rate of infection/obits. Thus, this added variable when plotted against the total number of deaths gives a more clear view of the casualty's growth. Besides, if an exponential growth is on a log scale, the exponential feature effect evens out. This means that if we plot the new cases against the total of cases, the exponential growth appears as a straight line if we use a log scale. This way, the plot shows that all countries are on their COVID-19 journeys. It is obvious that the disease is spreading in the same manner everywhere; only when that linear trajectory is abandoned (it drops), the country leaves its exponential growth. Nevertheless, in our analysis, we will not use log scale in order to make more visible the different trajectories; otherwise, they will look very much like lines on the top of each other and the visible differences reduce to the point on the x -axis (total number of deaths) when each line is abandoned. Now, in order to show and make more clear the evolution of these two-dimensional trajectories, we will take time explicitly into consideration, as it is an important element. We will see how the two variables just described (new cases in the past 7 days and the total number of cases) vary over time, considering thus these three-dimensional trajectories in our comparisons in later sections.

In this section, we propose a new confinement index that summarizes the information made available by Google concerning the mobility of people into different places, like parks, shops, homes, etc.

One intriguing aspect to study about the coronavirus pandemic is mobility during the lockdown period. In the course of the COVID-19 pandemic, multiple governmental authorities enacted quarantines in an effort to control the rapid spread of the virus. In late March, around 2.6 billion people worldwide (around a third of the world's population) were under some form of lockdown. Undoubtedly, people had to adapt to new daily routines. In addition to the physical and psychological impacts of the quarantine [41] , the economy and other activities also suffered a big impact.

The Community Mobility Reports (CMR) released by Google aims to provide insights into what has changed in terms of mobility, in each country, in response to policies aimed at combating COVID-19. Using anonymous data provided by geographic location apps such as Google Maps, the company has assembled a routinely updated dataset that displays how people's movements have changed throughout the pandemic. These reports track move-ment trends over time, in several sectors such as grocery stores, parks, leisure, public transport stations, workplaces, and residential. The data measures the following six mobility changes compared to their baseline value (in percentage):

1. Retail and Recreation ( m 1 ): Mobility trends for places like restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters. 2. Grocery and Pharmacy ( m 2 ): Mobility trends for places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies. For this analysis, we will consider several European countries from the 5th of February till the 9th of May. Because the data from Google is arranged on a daily basis format, the study will consist of 85 days in total (including 15th of February and 9th May). The last day we chose for the analysis was the 9th of May and not further because we want to study mobility during the lockdown period. In mid-May, some of the countries were no longer on a mandatory quarantine and some governments even encourage the population to go out and buy local products. Since not all of the European countries had available data, we only could use 37 of them in the analysis.

To take into consideration these 6 mobility variables, we defined a Confinement Index (CI) which summarizes the six variables above in a single variable which explains about 85% of the information contained in those variables. The CI was built performing Principal Component Analysis (PCA) to a matrix M, shown below, consisting of the average values of each variable for each country. According both to Pearson's criterion and Cattell's criterion, one principal component was enough to summarize the confinement data with 85% of variance explained. Let us consider the notation m i (t, A ) which represents the mobility variable i (i = 1 , . . . , 6) observed in country A (A = 1 , . . . , 37) on day t (t = 1 , . . . , 85) . Thus, M is a 37 × 6 dimension matrix given by: 

After applying Principal Component Analysis to M, we then used the coefficients of the first eigenvector

1) obtained and multiplied them by the mobility variables. This was done because often the first principal component is a kind of index, that is, a weighted average of the various variables. The same happened here. In reality, one of the signs was negative because one of the mobility variables increased (inside the house) contrary to all the others. Because the first principal component explains a big percentage of the variance, it was not necessary to consider the remaining.

Hence, our index, for a country A at a certain day t, is defined as the following: Hierarchical with DTW distance M11

Bayesian Hiearchical [43] where A = 1 , . . . , 37 and t = 1 , . . . , 85 .

A country with a high Confinement Index value indicates that its nation followed a more restrict quarantine, i.e, less mobility. The new mobility metric is not time-invariant, because it changes with time as do the six "Google variables" which enter in its definition. However, the coefficients that define the new mobility met- (1) ) should be an interpretative measure that does not change everyday, although its value changes everyday. For instance, by looking at the previous equation, we can see the relative importance of each of the six "Google" variables in that period. Nevertheless, for a different period, we would have to apply again the principal component analysis, as explained above, and would get a different confinement index.

In the next sections, we will compare the European countries described above as to their confinement index by the use of longitudinal clustering.

Over time, several studies have been done regarding the clustering of longitudinal data such as using K-means with Euclidean [17] and other distances [27] , using mixture models [25, 42] , using hierarchical methods in a Bayesian setting [43] , and novel approaches such as the one in [44] , where the authors automatically cluster a longitudinal dataset in different clusters and for each cluster an average shape trajectory (or representative curve) is found. Even very recently, in [45] the authors formulate a semi-parametric joint model to include random effects for centres as well as subjects, because although methods for accounting for the association between observation frequency and outcome are available, they do not currently account for clustering within centres. However, there is still no consensus on how to choose the number of clusters, K, as input, especially in the longitudinal environment.

There are essentially two types of indices that can help us in the process of validating a clustering result: internal indices and external indices. In this section, we will analyse a certain number of internal validity indices in order to test which index is best to find the optimal number of clusters, so we can optimize the use of the clustering algorithms. Because there is also uncertainty on the efficacy of the methods, we also compare some of the existing methods using external validity indices. First, let us start by describing the clustering methods that we address in this work, which are displayed in Table 1 .

Internal validity indices aim to discover if the structure obtained by clustering is intrinsically appropriate for the data. Thus, it can be used to determine the number of clusters necessary in the clustering. In order to do so, clustering methods are executed with different values of K , followed by a comparison of the ob- [48] . They concluded that even though the Gamma and C index yielded good results, the Calinski-Harabasz outperformed all of the indices. Motivated by this study, Shim et al. [49] compared 16 different indices for non-hierarchical methods and also deduced that the Calinski-Harabsz was the best among them. This latter study also indicated that Ray-Tury and Davies-Bouldin indices also yield interesting outcomes. Even more recently, the authors in Arbelaitz et al. [50] tested 30 indices and revealed Silhouette, Davies-Bouldin, and Calinski-Harabsz to be the best ones among its artificial datasets. Notwithstanding, both of these findings were done on non-longitudinal artificial data. Thus, in Garcia [51] we present a comparison of 10 different internal indices applied specifically to longitudinal data. These indices are displayed in Table 2 , where the term "goal" represents if the corresponding index aims to maximize or minimize its value, e.g, the higher the PBM index the better, whereas in C index, the lowest value is desired.

In [52] the authors evaluate, empirically, characteristics of a representative set of internal clustering validation indices with many datasets. They concluded that no single index dominates in each context, and some indices are better suited to different kinds of data. They found that WG index would be worth studying more in the future. In addition, PBM was an index with a high and stable performance, along with CH for K-means. Based on their resuts, DB and RT are not recommended.

After running any type of clustering method, some kind of evaluation is needed to perceive if the method is efficient. Just knowing that a clustering method is more suitable to identify the right number of clusters is not an indicator of better outcomes. A method can correctly guess the number of clusters but still be less accurate than a method that failed the real number of clusters. Identifying the right number of clusters is not the only aim.

External validity indices can be used to compare the true partition with the partition obtained from the clustering and therefore conclude about the effectiveness of the method. In this segment we make use of 10 external indices, displayed in Table 3 , to evalue the clustering algorithms. Unlike internal indices, each one of the external indices aims to maximize its value.

Regarding the comparison of clustering methods, there have been some studies over time [71, 72] but there is still no agreement on which method to use. Even recently, Rodriguez [73] tested several clustering methods including K-means, Hierarchical, and other probabilistic methods while the authors in Den Teuling et al. [74] investigated the performance of five longitudinal clustering In [75] , the authors used clustering indexes such as Rand, Jaccard, Fowlkes and Mallows, Morey and Agresti Adjusted Rand Index (ARI MA ) and Hubert and Arabie Adjusted Rand Index (ARI HA ). Because in literature, Hubert and Arabie Adjusted Rand Index has been adjudged as a good measure of cluster validity, these authors developed a tool based on it for cluster quality validation in high throughput analysis.

After running all of the clustering methods with the 12 artificial datasets, presented in Fig. 4 , while varying the number of clusters from 2 to 6, the Calinski-Harabsz was the best index with a total of 59 correct guesses, followed by the Gamma index with 54 (See Table 4 ). For this reason, we used the Calinski-Harabasz index in the later applications.

For the evaluation of the methods displayed in Table 1 , we [51] have used the same 12 artificial datasets used above (and presented in Fig. 4 ) , while resorting to various indices. For the tests, all of the clustering algorithms were executed with their real number of clusters. In order to make a summary of all of the experiments comparing the 10 external indices in the 12 simulated datasets for each of the 11 clustering methods we have decided to present for each clustering method and each dataset the mean of all of the 10 indices, given that in the case of the external indices, they all aim to maximize the value and take similar values. The results, which are in Table 5 , have shown that K-means with Euclidean distance (M1) was the best method overall, followed by Hierarchical with Euclidean distance (M9), and by M2, M8 and M10. K-means with Mahalanobis distance: Profile I and II (M4 and M5) were the worst methods. The results also seem to indicate that, for the previous datasets, the non-parametric methods achieved bet- ter results than the model-based ones. It should be noted that the K-means algorithm for longitudinal data [17] was executed with the predefined default value of 20 for the number of times that k-means must be run (with different starting conditions) for each number of clusters.

This segment focus on the application of clustering to two different types of data associated with the new coronavirus. One is based on the Community Mobility Reports published by Google, while the other addresses the number of deaths over time. For the first data, before the application of clustering, we initially define a confinement index that takes into account several variables of mobility changes (see above). Regarding the data of the deceased, we employ an additional variable and therefore resort to multivariate clustering. The data reports the number of deaths caused by the new coronavirus between February and June. Since most of the COVID-19 statistics are public, this data can be downloaded on several websites.

In both cases, we use the Calinski-Harabasz internal index to indicate the number of clusters, as it was shown above to outperform the other indices.

We will start by observing the trajectories of the new Confinement Index (CI), that we have defined above, of 37 European countries over time. What stands out immediately from Fig. 5 , is that most countries started their confinement in mid-March, although Italy started to confine earlier. Besides, what is also clear from the figure, is that a reasonable number of northern European countries have very low levels of confinement. Obviously, Sweden does, but others, for instance Denmark, has similar or worst levels of confinement; surprisingly, the numbers relating to the epidemic in Denmark are not at all as bad as in Sweden. That might be an interesting aspect to investigate. Here, we will concentrate now on applying clustering to these data in order to find groups of countries with similar behavior regarding their confinement levels.

Regarding clustering, after running the methods using the Calinski-Harabasz internal index, we found that all of them, except K-means with the Mahalanobis distance (M3, M4, M5), suggested two clusters. The algorithms M3, M4 and M5, which suggested 5, 6 and 4 groups respectively, use a pre-defined correlation matrix and have different profiles to take into account the behavior of the trajectory. They work by grouping together trajectories with similar behavior (increasing and decreasing at the same time even if taking very different values) rather than trajectories having similar values but behaving very differently.

Despite having different outcomes, the clusters found by Kmeans with Euclidean and Fréchet distance are quite similar as well as their mean trajectories. Fig. 6 displays the clusters obtained with K-means using the traditional Euclidean distance. Note that each color represents a different cluster where its mean trajectory is represented as a thicker line. The mixture models using Gaussian and multivariate t-distribution did not converge for this data and therefore indicate some software limitations.

Taking into account that the majority of the methods suggest two clusters, one might deduce the existence of two tendencies of mobility in the period of 15th February to 9th of May. Most of the countries in the analysis (around 65%) were relatively restrictive during the quarantine while all of the others (around 35%) did not follow such measures. 

As was explained above, there is a lot of misconception regarding the comparison between different countries as to the numbers related to COVID-19 infection. We will therefore use two variables (total number of casualties, number of deaths in the past week) for each day, proposing therefore a new approach that consists of using these three-dimensional trajectories for the comparison. We will thus have to resort to the clustering of multivariate trajectories. This application was possible using the kml3d package [26] , which clusters several variable-trajectories jointly using K-means with Euclidean distance. The package also provides tools to visualize interactive 3D dynamic graphs in an R session that can be attached to PDF pages. However, that feature does not work well with all PDF readers. Thus, we will display the 2D-plot as well as a 3D-plot obtained from clustering in a non-interactive way, i.e, separately.

Because we are comparing countries that have big differences in population size, we decided to divide the values by their respective number of inhabitants. Although incorporating the population sizes might be problematic, as explained above, that will not be an issue because we are dealing with deaths instead of infections and we are also aiming to visualize the number of deaths in the past week against the total amount of deaths. These two variables do not incorporate time, explicitly, in any of the axes and because time is such an important element for this comparison, we end up using the three-dimensional trajectories of those two variables over time.

Moreover, it is known that the virus did not cause the first death on the same day across the countries. Still, we have made all of the trajectories start at the same moment, i.e, at the first death of the respective country. In such a manner, it is more appropriate to compare trajectories development. Hence, trajectories length vary and depends on the first day of death.

It is worth noting that the data is consistently suffering amendments. For instance, Spain announced on the 25th of May a differ-ent way of collecting data, by counting a death based on when it happened, instead of when authorities were notified about it. As a result, the country's death toll slightly dropped.

Running the only multivariate clustering (kml3d), we discovered 2 clusters in the data. The resulting plot can be seen in the following figures. Again, the colors of the trajectories represent a group while the thick ones indicate the mean trajectories of each group.

A small group (17.8%) was discovered by the multivariate longitudinal clustering with a higher number of fatalities containing the following countries: Belgium, France, Ireland, Italy, Netherlands, Spain, Sweden, United Kingdom (blue trajectories). The bigger group (82.2%) consists of the rest of the countries and has relatively lower numbers of deaths (red trajectories).

The number of deaths over time is illustrated in Fig. 7 at the bottom image. The mean trajectory of the blue cluster starts to rapidly grow and reaches a relatively constant state over time. In the red cluster, the opposite is seen.

The number of deaths over time is illustrated in Fig. 7 at the bottom image. The mean trajectory of the blue cluster starts to rapidly grow and reaches a relatively constant state over time. In the red cluster, the opposite is seen.

By looking meticulously at each group obtained in the application of clustering to the Google Community Mobility Reports data, one can observe that Northern countries like Denmark, Sweden, Norway, Finland, and some of the Central Europe nations like Germany and Netherlands belong to the smaller cluster and therefore seem to have had less restrictive measures during the quarantine in contrast to countries like Portugal, Spain, Italy, UK, Turkey, or France.

In addition, the period in which countries like Denmark and Sweden obtained the lowest confinement index, much lower than the baseline values, was in the week of 18th to 25th of April, intriguingly already in full pandemic (see Fig. 5 ). This behavior might have been explained by the weather during that week. In fact, during that period, cities like Stockholm and Copenhagen had the maximum temperatures of the whole month. A part of the previous analysis was published in the Portuguese newspaper Público [76] and can also be read on their website 1 .

In the second clustering application, by looking at the deaths regarding the past week in Fig. 7 as well as in Fig. 8 , it is notable that all of the trajectories in the smaller cluster (containing the higher number of deaths) have already started to significantly slow the growth of deaths. Observing the mean trajectory of that cluster in Fig. 7 , the growth of deaths apparently started to decrease around the 40th day since the first deaths. Observing that same mean trajectory in Fig. 8 , shows that the decrease in the growth of deaths occurred around the value of 370 deaths per million inhabitants.

In this work, we presented a brief review of some clustering methodologies used in longitudinal data analysis. This study was carried out in order to unravel the best option between nonparametric and model-based methods and to identify which particular method is more reliable. Besides that, with the ongoing debate on the selection of the number of clusters, we also intended to identify the index that performs better in this action, particularly when longitudinal data is used.

Resorting to various clustering validity indices and to artificial longitudinal data, we found that non-parametric methods, in addition to being theoretically less complex, had better results and fewer software limitations. K-means along with Hierarchical clustering, both using Euclidean distance, were the methods to yield the best results, which might reinforce why they are so popular. Moreover, mixture models with the Cholesky decomposition (M6 and M7) revealed some software limitations, as they did not converge in the CMR application. Regarding the number of clusters choice, from the artificial longitudinal dataset tests, the index that suggested the correct number more frequently was the Calinski-Harabsz. Clustering solutions are not unique and strongly depend on the investigator's choices. However, with the use of clustering validity indices, those choices might be improved.

From the analysis performed in the Community Mobility Reports dataset, we discovered that, between the 15th of February and the 9th of May, countries like Turkey, France, Spain, and Portugal had higher levels of confinement when comparing to countries like Sweden or Denmark. The study also suggests that possibly there existed two tendencies (groups) in that period. One major tendency concerning countries with high confinement and another smaller tendency containing countries with less restrictive measures. Looking scrupulously at each group, it is evident that Northern European countries and some of the Central European nations like Germany and the Netherlands, belonging to the smaller cluster, had more mobility during the lockdown.

In the second dataset, with the application of multivariate clustering to the fatalities caused by the COVID-19 between February and June, we discovered two groups among the data. More explicitly, considering the population sizes, we found a small group with a higher number of deaths containing 17 . 8% of the countries. This group includes Belgium, France, Ireland, Italy, Netherlands, Spain, Sweden, and the UK. However, looking at its mean trajectory, it was also uncovered that the growth of deaths in that same group started to significantly drop at around the 40th day after the first death. Moreover, the multivariate clustering enabled us to see the plot of the deaths in the past week against the total number of deaths, in which it is evident that the beginning of that drop was also around the 370 deaths per million inhabitants. The second cluster obtained consists of 82 . 2% of the remaining countries and exhibits trajectories with a relatively small and steady growth of deaths over time. Even so, this analysis is made with the assumption of trusting the available data. As it was said before, there is a lot of complexity and misconception regarding the COVID-19 data.

The content of this article opens several interesting avenues for future research. The analysis conducted in Section 2 concerning the Community Mobility Reports, released by Google, could be combined with the data regarding the deaths confirmed by COVID-19 studied in Section 3.2.2 . For instance, a correlation study between these two datasets could be interesting. Additionally, concerning the analysis of the COVID-19 data, the number of infections and tests was left out and they can add crucial contributions to the analysis. With the constant amendments in the coronavirus numbers, the data is continuously being updated and future analysis will be needed.

Furthermore, alternative statistical procedures can be done separately, or combined with clustering, to provide additional knowledge in the analysis of longitudinal data, and in particular COVID-19 longitudinal data. Ultimately, some metrics introuced in this work for the analysis of COVID-19 data can be studied in a statistically rigorous manner, which alone can give rise to a new work.

Authors declare that they have no conflict of interest.

World health organization cancer priorities in developing countries

World health organization (WHO) and international society of hypertension (ISH) risk prediction charts: assessment of cardiovascular risk for prevention and control of cardiovascular disease in low and middle-income countries

Obesity and diabetes in the developing world-A growing challenge

Global rise in human infectious disease outbreaks

Analysis of cross section, time series and panel data with stata 15

Panel data analysis-advantages and challenges

Multiyear climate variability and dengue -El Niño southern oscillation, weather, and dengue incidence in Puerto Rico, Mexico, and Thailand: a longitudinal data analysis

The effect of small business managers' growth motivation on firm growth: a longitudinal study

A longitudinal case study of profitability reporting in a bank

Observational research methods

Longitudinal Data Analysis

Toward understanding and exploiting tumor heterogeneity

Heterogeneity of autoimmune diseases: pathophysiologic insights from genetics and implications for new therapies

Integrating multiple social statuses in health disparities research: the case of lung cancer

Subtyping: what it is and its role in precision medicine

Data Mining: Concepts and Techniques

KmL: a package to cluster longitudinal data

Longitudinal Cluster Analysis with Applications to Growth Trajectories

Clustering of Longitudinal Trajectories Using Correlation Based Distances

Estimation of extended mixed models using latent classes and latent processes: the R package lcmm

Profile clustering in clinical trials with longitudinal and functional data methods

Clustering of longitudinal data by using an extended baseline: a new method for treatment efficacy clustering in longitudinal data

Cluster analysis of longitudinal profiles with subgroups

Proteo-transcriptomic dynamics of cellular response to HIV-1 infection

Clustering gene expression time course data using mixtures of multivariate t -distributions

Kml and kml3d : R packages to cluster longitudinal data

KmlShape: an efficient method to cluster longitudinal data (time-series) according to their shapes

Wiley Series in Probability and Statistics

Evaluation and comparison of gene clustering methods in microarray analysis

Pathological findings of COVID-19 associated with acute respiratory distress syndrome

Clinical course and risk factors for mortality of adult inpatients with COVID-19 in

China: a retrospective cohort study

The reproductive number of COVID-19 is higher compared to SARS coronavirus

Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts

Early dynamics of transmission and control of COVID-19: a mathematical modelling study

Identification of COVID-19 can be quicker through artificial intelligence framework using a mobile phone-based survey in the populations when cities/towns are under quarantine

The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China: a modelling study

County-level longitudinal clustering of COVID-19 mortality to incidence ratio in the United States

Effective containment explains subexponential growth in recent confirmed COVID-19 cases in China

Nota sobre a Utilização Incorreta de Conceitos Estatísticos19

Factos para compreender a epidemia da COVID-19. O que têm de específico as doenças infecciosas? Público (2020)

The outbreak of COVID-19 coronavirus and its impact on global mental health

On model-based clustering, classification, and discriminant analysis model-based approaches

R / BHC : fast Bayesian hierarchical clustering for microarray data

Learning the clustering of longitudinal shape data sets into a mixture of independent or branching trajectories

Clustered longitudinal data subject to irregular observation

Clustering time series gene expression data with TMixClust

A dendrite method for cluster analysis

An examination of procedures for determining the number of clusters in a data set

A comparison study of cluster validity indices using a nonhierarchical clustering algorithm

An extensive comparative study of cluster validity indices

Clustering of Longitudinal Data: Application to COVID-19 Data

Comparison of internal clustering validation indices for prototype-based clustering

Isodata: A Novel Method of Data Analysis and Pattern Classi Cation

Measuring the power of hierarchical cluster analysis

Quadratic assignment as a general data analysis strategy

A cluster separation measure

Silhouettes : a graphical aid to the interpretation and validation of cluster analysis

Model-based Gaussian and non-Gaussian clustering

Determination of number of clusters in K -means clustering and application in colour image segmentation

A collaborative approach to combine multiple learning methods

Nouvelles researches sur la distribution florale

Die pflanzenassoziationen der pieninen

Measures of the amount of ecologic association between species author (s): Lee R. Dice published by: ecological society of America stable

The utilization of multiple measurements in problems of biological classification

A computer program for classifying plants

Principles of Numerical Taxonomy, W.H. Freeman & Company

Objective criteria for the evaluation of clustering methods

A method for comparing two hierarchical clusterings

Comparing partitions

Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms

Comparisons among clustering techniques for electricity customer classification

Clustering algorithms: a comparative approach

A comparison of methods for clustering longitudinal data with slowly changing trends

Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes