key: cord-0983429-u9mqg2rf
authors: Milano, Marianna; Cannataro, Mario
title: Statistical and Network-Based Analysis of Italian COVID-19 Data: Communities Detection and Temporal Evolution
date: 2020-06-12
journal: Int J Environ Res Public Health
DOI: 10.3390/ijerph17124182
sha: ac6e8cb623609a60c72a6404f2a43ac799709a4e
doc_id: 983429
cord_uid: u9mqg2rf

The coronavirus disease (COVID-19) outbreak started in Wuhan, China, and it has rapidly spread across the world. Italy is one of the European countries most affected by COVID-19, and it has registered high COVID-19 death rates and the death toll. In this article, we analyzed different Italian COVID-19 data at the regional level for the period 24 February to 29 March 2020. The analysis pipeline includes the following steps. After individuating groups of similar or dissimilar regions with respect to the ten types of available COVID-19 data using statistical test, we built several similarity matrices. Then, we mapped those similarity matrices into networks where nodes represent Italian regions and edges represent similarity relationships (edge length is inversely proportional to similarity). Then, network-based analysis was performed mainly discovering communities of regions that show similar behavior. In particular, network-based analysis was performed by running several community detection algorithms on those networks and by underlying communities of regions that show similar behavior. The network-based analysis of Italian COVID-19 data is able to elegantly show how regions form communities, i.e., how they join and leave them, along time and how community consistency changes along time and with respect to the different available data.

Coronavirus disease, known as COVID-19, emerged in the city of Wuhan, in China, in November 2019 [1] .

The disease is caused by the novel coronavirus Sars-CoV-2 [2] and its clinical manifestations include fever, cough, fatigue, chest distress, diarrhea, nausea, vomiting [3] and also acute respiratory distress syndrome in severe cases [4] .

COVID-19 is characterized by a long incubation period, high infectivity, and different transmission methods [5] . The contagion happens mainly through respiratory and blood contact with the coronavirus.

In a few months, COVID-19 epidemic quickly spread to Asian countries and it reached more than 200 countries in the world, causing tens of thousands of deaths.

On 11 March 2020, COVID-19 disease was declared a pandemic by the World Health Organization. In Italy, COVID-19 was identified in January 2020 [6] and the outbreak started in Lombardi and Veneto at the end of February 2020. From the northern regions of Italy, the disease spread very quickly to the nearest regions and then to the rest ones. Italy was considered one of the main epicenters of the pandemic, with 97,689 infections and 10,799 deaths up to 29th of March. The aim of this study is to provide a graph-based representation of daily data provided by Italian Civil Protection that enables evaluation of which regions show similar behavior and discovery of communities. The data refers to the period 24 February to 29 March 2020. To do this, we designed an analysis pipeline to model Italian COVID-19 data as networks and to perform network-based analysis. At first, for each type of data, we evaluated the similarity among a pair of regions by using statistical tests, and accordingly, we built ten similarity matrices (one for each Italian COVID-19 datum). After that, we mapped the similarity matrices into networks where the nodes represent the Italian region, and the edges connect statistically similar regions. Finally, we evaluated how the networks evolved over the weeks by analyzing the networks at different time points: (i) over the period 24 February to 29 March 2020 (study period); and (ii) in single weeks. Then, network-based analysis was performed mainly to discover communities of regions that show similar behavior. The main contribution of the paper is a network-based representation of COVID-19 diffusion similarity among regions and graph-based visualization to underline similar diffusion regions.

The rest of the paper is organized as follows: Section 1 presents the pipeline to analyze Italian COVID-19 data, Section 2 presents the application of our methodology on Italian COVID-19 data, and Section 3 discusses the results. Finally, Section 4 concludes the paper.

We designed an analysis pipeline with the goal of investigating similarity among Italian regions with respect to data provided by Italian Civil Protection and to identify clusters of regions with similar behavior.

The analysis pipeline includes the following steps:

1.

Building of a similarity matrix. The first step consists of the building of a similarity matrix that records the similarity among a pair of regions with respect to an Italian COVID-19 data measure. The similarity is computed by applying a statistical test. We decided to use the Wilcoxon Sum Rank Test. Therefore, the (i, j) value of the matrix for data k (e.g., swab data) represents the p-value of the Wilcoxon statistical test obtained by performing the test on the swab measures of region i with respect to region j. Lower p-value means that regions are more dissimilar with respect to that measure. Higher p-value means that regions are more similar with respect to that measure. We used the usual significance threshold of 0.05, thus matrices report only p-vales ≥ 0.05, while p-values < 0.05 are mapped to zero.

Mapping similarity matrices to networks. The second step consists of the building networks starting from the similarity matrices. We map each matrix M(i, j) to a network N, where nodes represent the Italian regions and an edge connects two regions (i, j) if the p-value in the similarity matrix is greater than the significance threshold of 0.05. edges are weighted with the p-value.

Temporal analysis of networks. The third step consists of the building of the network at different time intervals. Assuming that the analyzed data presents a temporal evolution, for each one, the corresponding networks at different time points (i.e., at the end of week 1,2, . . . , 5) and for an study period are built. 4.

Community detection. The fourth step consists of the extraction of community on the network by applying an appropriate community detection algorithm. For each network, we extracted subgroups of regions that form a community based on similarity of point of view. The identification of community is performed on the networks related to the study period and for all single week. Then, we extract the communities at different time points, i.e., at the end of the first week, after three weeks, and after five weeks (the study period).

We applied the designed pipeline to analyze the data at different temporal zoom levels e.g., by analyzing the period from 24 February to 29 March 2020 and by focusing on single weeks as well as the entire observation. For convenience, in the rest of paper we refer to the period 24 February to 29 March 2020 as the study period.

The present analysis was carried out on the dataset of COVID-19 updated at the https://github. com/pcm-dpc/COVID-19 database, provided by Italian Civil Protection. The dataset consists of: Swabs, the numbers of test swab carried on positive subjects and on subjects with suspected positivity.

The data is daily provided for each Italian region. The data occupies 47.6 Mbytes of memory.

To build similarity matrices for Italian COVID-19 data, we performed a set of statistical analyses. All analyses are performed by using R software [7] . At first, we computed the main descriptive statistics for all regions in the study period, reported in Table 1 . Figure 1 conveys the evolution of all datasets over days. As a preliminary test, we applied Pearson's chi-square test. The p-value was less than 0.05 for each distribution data, i.e., data was not normally distributed. According to this, we performed the paired comparison and multiple comparison of data by using two non-parametric tests: Wilcoxon Sum Rank test and Kruskal-Wallis test.

As an initial step, we used the Wilcoxon Sum Rank test to carry out an analysis within the same type of data for all weeks and then, for each single week. The Wilcoxon test is a non-parametric test designed to evaluate the difference between two treatments or conditions where the samples are correlated. We applied the Wilcoxon test to perform a pair-wise comparison among regions with the goal of highlighting statistically similar distributions among them. For this reason, we built a similarity matrix for each couple of regions, for each of the available COVID-19 data. Figure 2 reports the heat map of similarity value related to Hospitalized with Symptoms network for all regions in the study period. We reported the heat maps for Italian COVID-19 data in the study period in Appendix A, for the lack of space. In addition, we report the Tables of the similarity values computed for Italian COVID-19 data in the study period and in the single weeks in Appendix A. Results show that according to the type of data, a significant difference exists (p-value less than 0.05) among some regions while for others, it is possible to highlight statistically similar distributions. Also, the significance varies by performing the analysis on whole selected time interval and on single week.

After that, we used the Kruskal-Wallis test performing an analysis on the same type of data for all regions (i.e., carrying out multiple comparisons) for the study period and then, for each single week. The Kruskal-Wallis test is a non-parametric method for analysis of variance used to determine if more samples originate from the same distribution. The results confirmed a significant difference considering all regions on the same type of data for the study period for every single week.

Furthermore, we performed multiple linear regression by considering nine indicators: Hospitalized with Symptoms, Intensive Care, Home Isolation, Total Currently Positive, Discharged/Healed, Deceased, Total Cases, Swabs and two geographic factors: population density and number of intensive care beds for regions. We perform a standardization of variables as a preprocessing step. Data related to population density and intensive care beds for regions are reported in Table 2 . According to multiple linear regression, we built nine models for each piece of Italian COVID-19 data in order to evaluate an outcome of each indicator on the basis of multiple distinct predictor variables i.e., Population Density and Intensive Care Beds. Table 3 reports the p-values associated with the Population Density and Intensive Care Beds and the Multiple R-squared. It is possible to notice that the intensive care beds variable is significantly related to Hospitalized with Symptoms, Intensive Care Home Isolation, Total Currently Positive, Discharged/Healed, Deceased, Total Cases variables with Multiple R-squared greater than 0.5. Instead, population density is significantly related to the Swabs variable. In this case, Multiple R-squared is equal to 0.318 (i.e., 32% of the data is explained by the explanatory variable). These results demonstrate that the population density does not influence Hospitalized with Symptoms, Intensive Care Home Isolation, Total Currently Positive, Discharged/Healed, Deceased, Total Cases variables which can be affected by other factors such as smog or climate as reported in [8, 9] . Abruzzo  121  73  Basilicata  56  49  Bolzano  71  48  Calabria  128  107  Campania  424  350  Emilia  199  650  Friuli  153  494  Lazio  341  540  Liguria  286  75  Lombardi  422  1067  Marche  162  400  Molise  69  30  Piemonte  172  320  Puglia  206  320  Sardegna  68  150  Sicilia  194  441  Toscana  162  447  Trento  79  75  Umbria  104  69  ValleAosta  39  15  Veneto  267  498  Table 3 

To evaluate the evolution of Italian COVID-19 data and evidence which regions show similar behavior, we built networks of each piece of data [10] starting from the result of Wilcoxon test. The nodes of the networks are the Italian regions and the edges link two regions (nodes) with similar trend according to significance level (p-value > 0.05) obtained from the Wilcoxon test, otherwise (p-value < 0.05) there is not connection among nodes. The network analysis is performed using the igraph libraries [11] .

At first, we built ten networks, one for each data (Hospitalized with Symptoms, Intensive Care data, Total Hospitalized, Home Isolation, Total Currently Positive, New Currently Positive, Discharged/Healed, Deceased, Total Cases, Swab) by considering the period 24 February to 29 March. Then, we built the same networks by considering single weeks. The ten networks for the study period and for five weeks are reported in 

Starting from the ten networks related to all five weeks, we wanted to identify which regions form a community from the similarity point of view. For this, we applied the Walktrap community-finding algorithm [12] that identifies densely connected subgraphs, i.e., communities, in a graph via random walks. The idea is that short random walks tend to stay in the same community.

The extracted communities from all Italian COVID-19 networks in the study period are reported in Figure 13 . Figure 3 shows the evolution of Hospitalized with Symptoms network over five weeks. Figure 3a reports the network that represents the behavior of regions respect to number of hospitalized patients up to 29 March, whereas Figure 3b-f represent the networks on each single week. It is possible to notice that the network structure changes according to the analyzed time interval. At the end of the 35th day, the network has all nodes connected with exception of a single node that represents the Basilicata region. Furthermore, it is possible to highlight different community structures consisting of groups of regions with similar trends. Figure 13a reports five communities: the first one consists of Basilicata; the second one is composed by Piemonte, Marche, Emilia, Lombardi, Veneto; the third one is composed by Liguria, Lazio, Toscana; the fourth one is formed by Campania, Puglia, Sicilia, Abruzzo, Valle d'Aosta, Friuli, Trento, Bolzano; the last one is composed by Umbria, Sardegna, Calabria, Molise. Figure 4 represents the evolution of the Intensive Care network. It is possible to notice that Lombardi and Veneto, which are the regions most affected by coronavirus disease with the highest numbers hospitalized in Intensive Care Units, are disconnected in the first week. In the second week, Lombardi and Veneto are connected by an edge that represents a level of similarity, whereas in the fifth week Veneto is linked with Emilia and Lombardi and this group of regions becomes a disconnected component among other regions. Thus, while initially Lombardi and Veneto showed a similar trend of Intensive Care data, after Veneto moved far from Lombardi trend. Furthermore, by analyzing the communities in the Intensive Care network (Figure 13b ), it is possible to demonstrate four subgraphs formed by (i) Lombardi and Veneto, (ii) Umbria and Lazio, (iii) Marche, Emilia, Piemonte, Toscana and (iv) a large module formed by Campania, Sicilia, Sardegna, Abruzzo, Umbria, Calabria, Basilicata, Bolzano, Valle d'Aosta, Friuli, Trento, Molise. Figure 5 shows the Total Hospitalized network evolution. Starting form the first week, the structure of the network has two disconnected nodes representing Lombardi and Veneto, whereas, in the second week, the network evolves by presenting a single disconnected node that represent Emilia, and two connected nodes, Lombardi and Veneto, which in turn are disconnected from dense subgraph. In the third and fourth weeks, the network structure presents all connected components, and a high number of nodes are disconnected in the fifth week. Finally, all regions are connected in the final network. By analyzing the communities detected in Total Hospitalized network (Figure 13c ), it is possible to notice a similarity with respect to those extracted by Hospitalized with Symptoms networks. In fact, there is a correspondence among three communities: (i) Basilicata and Piemonte, (ii) Marche, Emilia, Lombardi, (iii) Veneto and Liguria, Lazio, Toscana. This means that those regions that form the three communities present the same behavior according to the number of the hospitalized patients with symptoms and the total number of hospitalized patients. Figure 6 shows the evolution of the Home Isolation network over five weeks. In the first week, Lombardi, Veneto and Emilia each shows a different trend of the number of subjects in isolation at home, both between them and compared to other regions. Next, in the network formed by considering the all weeks, Lombardi, Veneto, Emilia and Marche formed a subgraph disconnected by a different dense subgraph composed by the rest of the regions. However, Veneto represents a single community in Figure 13d . This means that the behavior of Veneto presents a low similarity with respect to Lombardi, Emilia and Marche despite forming a module. It is possible to notice that Lombardi, Veneto, Emilia form a community in both Total Currently Positive network and New Currently Positive network, Piemonte represents a single community in the Total Currently Positive network, while Piemonte is associated with Marche and Toscana in the New Currently Positive network. Figure 9 represents the Discharged/Healed network over five weeks. In the first week, the network structure presents all nodes connected except for Veneto. In the second week, the Discharged/Healed network is formed by three subgraphs; in the third and fourth weeks the network is very dense; in the fifth week, the network structure is characterized by different disconnected components, and finally, at the end of 35 days the network is composed by a subgraph composed by Lombardi and Veneto and another subgraph highly connected. This means that Lombardi and Veneto have a similar behavior that is different from the rest of the Italian regions. Also, Lombardi and Veneto represent one of the five communities extracted by Discharged/Healed network. The extracted communities are reported in Figure 13g . Figure 10 shows the evolution of Deceased network. The evolution of this network is different from other Italian COVID-19 networks. In fact, in the first week all nodes are disconnected, so all Italian regions present different trends. In the second week, it is possible to notice that Emilia and Marche nodes are disconnected and there is a subgraph composed by Lombardi and Veneto and then there is a large subgraph formed by other regions. In the third week, all nodes present connections. In the fourth and fifth week the Deceased network presents different disconnected components; then, the final network shows a single disconnected node that represents Basilicata. Also, the Basilicata represents a single community of Deceased network; see Figure 13h . The other extracted communities are: (i) Piemonte, Toscana Liguria, Lazio, Friuli, Puglia, Valle d'Aosta, (ii) Lombardi, Veneto, Emilia, Marche, (iii) Sicilia, Molise, Abruzzo, Umbria, Campania, Trento, Bolzano, Calabria, Sardegna. Figure 11 represents the Total Cases network over five weeks. The final network demonstrates that the Italian regions present a significant level of similarity respect to the number of total coronavirus cases because all nodes are connected. Figure 13i shows the communities identified in Total Cases. The first community is composed by Lombardi, Veneto, Emilia, Marche; the second community is composed by Piemonte; the third community is composed by Basilicata, Molise; the fourth community is composed by Toscana Liguria, Lazio, Campania, Friuli, Sicilia Puglia, Valle d'Aosta formed, whereas Abruzzo, Umbria, Trento, Bolzano, Calabria, Sardegna formed the fifth community. Figure 12 shows the evolution of the Swab Network that represents the number of performed swab tests. The network, in the first week, shows Lombardi and Veneto nodes disconnected by other regions. In fact, these ones are the Italian regions that initially performed high number of test swabs. Also, the Veneto region has no connections in the final network and this reflects the policy of Veneto to carry out swab tests on asymptomatic subjects, i.e., it is an outlier with respect to other regions. Figure 13j shows the extracted communities in the Swab network. The first community is composed by Veneto; the second community is composed by Lombardi, Emilia; the third community is composed by Basilicata, Molise; the fourth community is formed by Marche, Toscana, Lazio, Piemonte, Friuli, Valle d'Aosta; the fifth community is formed by Sicilia, Campania, Liguria, Puglia; while Abruzzo, Umbria, Trento, Bolzano, Calabria, Sardegna formed the sixth community.

We want to evaluate: (1) if different data present similar or dissimilar communities and (2) if the communities are similar or dissimilar considering different temporal interval on the same data. The Figures 14-23 report the evolution of the communities related to different data. At the end of three weeks, Veneto, after representing a community in the previous week, moves in another community, whereas Emilia leaves the community with Marche and becomes a single community. Also, some regions migrate from fifth and sixth communities to other communities. Therefore, Figure 14b reports the five extracted communities after three weeks: (i) Lombardi, (ii) Emilia, (iii)Veneto, Marche, Piemonte, Liguria, Toscana, Lazio, (iv) Trento, Bolzano, Abruzzo, Friuli, Sicilia, Puglia, (v) Campania, Umbria, Sardegna, Calabria, Molise, Valle d'Aosta, Basilicata. Finally, Figure 14c reports five communities in the study period: the first one consists of Basilicata, which leaves the previous community and becomes a single one; the second one is composed by Piemonte, Marche, Emilia, Lombardi, Veneto; the third one is composed by Liguria, Lazio, Toscana; the fourth one is formed by Campania, Puglia, Sicilia, Abruzzo, Valle d'Aosta, Friuli, Trento, Bolzano; the last one is composed by Umbria, Sardegna, Calabria, Molise. Figure 15 reports the evolution of Intensive Care Network Communities. It is possible to notice that in the first week (Figure 15a ), there is: one large community formed by Umbria, Lazio, Piemonte, Toscana Campania, Sicilia, Sardegna, Abruzzo, Umbria, Calabria, Basilicata, Bolzano, Valle d'Aosta, Friuli, Trento, Molise; two single communities formed by Lombardi and Emilia; and a small community formed by Emilia and Marche.

After three weeks the number of extracted communities increases; see Figure 15b . In fact, Lombardi, Sardegna, Valle d'Aosta represent three single communities, Veneto and Emilia form a community, as well as, Marche and Piemonte. Then, Liguria, Lazio and Toscana form a six community, and the last two are composed by (i) Umbria, Campania, Molise, Abruzzo, Friuli, Trento, Puglia and (ii) Calabria, Sicilia, Bolzano, Basilicata.

Finally, in the study period, five communities are mined (see Figure 15c ), formed by (i) Lombardi and Veneto, (ii) Liguria and Lazio, (iii) Marche, Emilia, Piemonte, Toscana and (iv) a large module formed by Campania, Sicilia, Sardegna, Abruzzo, Umbria, Calabria, Basilicata, Bolzano, Valle d'Aosta, Friuli, Trento, Molise. Figure 16 reports the evolution of Total Hospitalized Network Communities. Figure 16a shows the six mined communities. The first community is composed by Liguria, Toscana, Piemonte, the second one is formed Sardegna, Umbria, Calabria, Basilicata, Valle d'Aosta, Friuli, Trento, Molise; the third module comprises Lazio, Campania, Sicilia, Abruzzo, Bolzano, Puglia; the fourth community is represented by Marche, Emilia; the fifth is represented by Lombardi and the last one consists of Veneto.

After three weeks (see Figure 16b ) the regions move, with the exception of Lombardi, which continues to represent a single community and the communities become eight. In fact, Emilia becomes a single community; Veneto becomes a community among with Piemonte and Marche; Toscana moves in the community with Liguria; Lazio and Campania forms a new community, as well as, Basilicata and Valle d'Aosta; another community is formed by: Abruzzo, Puglia, Sicilia and the last one is formed by Friuli, Bolzano, Trento, Umbria, Sardegna, Calabria, Molise.

At the end of the study period, the five communities reported in Figure 16b are formed. The first one consists of Basilicata that leaves the previous community and becomes a single one; the second one is composed by Toscana, Liguria and Lazio; Sardegna, Calabria, Molise leaves the previous large community and form a smaller one; by the fourth one is formed by Piemonte, Marche, Emilia, Lombardi, Veneto; the last one is composed by Umbria, Puglia, Sicilia, Abruzzo, Valle d'Aosta, Friuli, Trento, Bolzano. Figure 17 reports the evolution of Home Isolation Network Communities. Figure 17a reports the mined communities at the end of first week. It is possible to notice that Lombardi, Veneto and Emilia form single communities, and then there are three large communities: the first one is represented by Piemonte, Liguria, Marche, Sicilia, Campania; the second one is composed by Puglia, Valle d'Aosta, Toscana, Umbria, Calabria; the third one Trento, Lazio, Abruzzo, Sardegna, Bolzano, Basilicata and Molise, Friuli, Campania.

At the end of three week, Lombardi, Veneto and Emilia move together to form a unique community, whereas, the other regions form new communities, such as (i) Puglia, Trento, Lazio, Umbria, (ii) Calabria, Abruzzo, Valle d'Aosta, Sardegna, Bolzano, Basilicata and Molise, (iii) Sicilia, Toscana, Friuli, Piemonte, Liguria, Marche, Campania (see Figure 17b ). Figure 17c shows the community topology in the study period. Veneto leaves the community among with Lombardi and Emilia and it becomes a single one, whereas, Lombardi, Emilia forms a new module among with Marche. Basilicata and Molise move together to form a unique community. Sardegna, Calabria, Abruzzo, Bolzano form a fourth community. The fifth community is composed by Puglia, Trento, Lazio, Umbria, Sicilia, and the sixth one is represented by Toscana, Piemonte, Valle d'Aosta, Friuli, Liguria, Campania. Figure 18 reports the evolution of Total Currently Positive Communities. At the end of first week, there are eight mined communities, and they are reported in Figure 18a . The first community is composed by Umbria, Sardegna, Basilicata, Molise, Friuli Toscana, Calabria, Valle d'Aosta, Trento; the second one is formed by Bolzano, Lazio, Abruzzo, Puglia; the third module comprises Campania, Sicilia Liguria; Piemonte and Lazio represent the fourth community; the fifth one consists of Marche; the sixth community is represented by Emilia; the seventh is represented by Lombardi and the last one consists of Veneto.

After three weeks, the number of communities (see Figure 18b ) decreases. In fact, it is possible to notice five subgraphs. Emilia joints with Veneto and Lombardi remains single community. Lazio, Sicilia, Friuli, Puglia form a new community; Piemonte, Toscana, Campania, Marche, Liguria, represent a fourth community; Trento, Abruzzo, Umbria, Calabria, Sardegna, Basilicata, Molise, Bolzano, Campania, Valle d'Aosta, form a fifth community;

In the study period, the number of extracted communities further decreases, see Figure 18c The first community is composed by Veneto, Lombardi, Emilia, Marche; the second community is composed by Piemonte; the third community is composed by Basilicata, Molise, Calabria, Sardegna; the fourth community is formed by Toscana, Lazio, Friuli, Valle d'Aosta, Sicilia, Campania, Liguria Puglia, Abruzzo, Umbria, Trento, Bolzano. Figure 19 reports the evolution of New Currently Positive Communities. Figure 19a reports the mined communities at the end of first week. It is possible to notice that Lombardi, Veneto and Emilia form single communities, and then there are two large communities: the first one is represented by Marche, Piemonte, Liguria, Campania, Abruzzo, Toscana; the second one is composed by Puglia, Valle d'Aosta, Umbria, Calabria, Sicilia, Campania, Trento, Lazio, Sardegna, Bolzano, Basilicata and Molise, Friuli.

After three weeks, there remain five extracted communities but the regions that form them vary; see Figure 19b . The first community is composed by Lombardi; the second community is composed by Veneto and Emilia; the third community is composed by Basilicata, Molise, Valle d'Aosta, Sardegna, Campania, Bolzano; the fourth community is formed by Marche, Toscana, Piemonte; the fifth community is formed by Sicilia, Liguria, Puglia, Abruzzo, Umbria, Trento" Calabria, Lazio, Friuli.

In the study period, the number of extracted communities further decreases, see Figure 19c After three weeks the number of extracted communities increases; see Figure 20b . In fact, Lombardi leaves the previous community and becomes a single one; Lazio, Emilia, Liguria and Veneto get together to form a second community; the third is composed by Friuli, Sicilia, Toscana; the last one is formed by Sardegna, Valle d'Aosta, Marche and Piemonte, Umbria, Campania, Molise, Abruzzo, Trento, Puglia, Calabria, Bolzano, Basilicata.

Finally, Figure 20c shows the communities in the study period. It is possible to notice five communities. The first one consists of Lombardi and Veneto; the second one is composed by Emilia, Liguria, Lazio; the third one is composed by Friuli, Campania, Toscana, Sicilia; the fourth one is formed by, Puglia, Abruzzo, Trento; the last one is composed by Umbria, Sardegna, Calabria, Piemonte, Marche, Valle d'Aosta, Bolzano, Basilicata, Molise. Figure 21 reports the evolution of Deceased Communities. Figure 21a reports the mined communities at the end of first week. It is possible to notice that there is: one large community formed by Emilia, Piemonte, Liguria, Campania, Abruzzo, Puglia, Valle d'Aosta, Umbria, Calabria, Sicilia, Campania, Trento, Lazio, Sardegna, Bolzano, Basilicata and Molise, Friuli; two single communities represented by Lombardi and Veneto; and a single community composed by Marche and Toscana.

After three weeks, the number of extracted communities increases. In fact, the regions that form them vary by forming new communities; see Figure 21b . The first community is composed by Lombardi that remains a single community; the second community is composed by Veneto and the third one is composed by Emilia; the fourth community is formed by Basilicata, Molise, Sardegna, Calabria; the fifth community is formed by Sicilia, Liguria, Puglia, Abruzzo, Umbria, Trento, Lazio, Friuli, Valle d'Aosta, Campania, Bolzano; Marche, Toscana, Piemonte.

In the study period, the number of extracted communities decreases (see Figure 21c ) and there are: (i) Basilicata represents a single community, (ii) Piemonte, Toscana Liguria, Lazio, Friuli, Puglia, Valle d'Aosta, (iii) Lombardi, Veneto, Emilia, Marche, (iv) Sicilia, Molise, Abruzzo, Umbria, Campania, Trento, Bolzano, Calabria, Sardegna. Figure 22 reports the evolution of Total Cases Network Communities. Figure 22a shows the five mined communities. The first community is composed by Liguria, Toscana Lazio, Piemonte, Campania, Sicilia; the second one is formed Sardegna, Abruzzo, Umbria, Calabria, Basilicata, Bolzano, Valle d'Aosta, Friuli, Trento, Molise, Puglia; the third module comprises Marche, Emilia; the forth one Lombardi and the last one Veneto.

After three weeks (see Figure 22b ) there are six communities. Veneto becomes a community among with Emilia; Marche moves in the community with Liguria, Toscana Lazio, Piemonte, Campania; and three new modules are formed: the first one is composed by Sicilia, Friuli and Puglia, the second one is composed by Abruzzo, Bolzano, Trento and Umbria, the third one is formed by Basilicata, Sardegna, Valle d'Aosta, Calabria, Molise.

At the end of the study period, there are five communities extracted. Figure 22c reports the communities. The first community is composed by Lombardi, Veneto, Emilia, Marche; the second community is composed by Piemonte; the third community is composed by Basilicata, Molise; the fourth community is composed by Toscana Liguria, Lazio, Campania, Friuli, Sicilia Puglia, Valle d'Aosta formed a community, whereas Abruzzo, Umbria, Trento, Bolzano, Calabria, Sardegna formed the fifth community. Figure 23 reports the evolution of the Swab Network Communities. At the end of first week, there are eight mined communities, and they are reported in Figure 23a . The first community is composed by Umbria, Calabria, Valle d'Aosta, Trento, Sicilia, Abruzzo, Puglia; the second one is formed by Sardegna, Basilicata, Molise, Bolzano; the third module comprises Friuli, Campania, Liguria; Piemonte represents the fourth community; the fifth one consists of Toscana and Lazio; the sixth community is represented by Emilia and Marche; the seventh is represented by Lombardi and the last one consists of Veneto.

After three weeks, the number of communities (see Figure 23b ) decreases. In fact, it is possible to notice six subgraphs. Emilia leaves the previous community and forms a single one. Lombardi and Veneto join together. Toscana and Lazio continue to form a community; Piemonte, Friuli, Campania, Marche, Puglia, Liguria, Sicilia form a fourth community; Trento, Abruzzo, Umbria, Calabria, form a fifth community and the last one is composed by Sardegna, Basilicata, Molise, Bolzano, Campania, Valle d'Aosta;

In the study period, there remain six extracted communities but the regions that form them vary; see Figure 23c . The first community is composed by Veneto; the second community is composed by Lombardi, Emilia; the third community is composed by Basilicata, Molise; the fourth community is formed by Marche, Toscana, Lazio, Piemonte, Friuli, Valle d'Aosta; the fifth community is formed by Sicilia, Campania, Liguria Puglia; while Abruzzo, Umbria, Trento, Bolzano, Calabria, Sardegna formed the sixth community.

By analyzing the results, it is possible to demonstrate that the topology of the communities varies, i.e., the regions join and leave them along time and the community consistency changes along time on the same data. For the communities related to the different available data, it is possible to notice that after the first week, the extracted communities are different. This changes, after analyzing the communities after three weeks. In fact, the Total Currently Positive Network Communities and New Currently Positive Network show similar communities as well as Deceased Network and Total Cases Network. Finally, after five weeks, the topology of communities is different for all Italian COVID-19 networks except for the Hospitalized with Symptoms Network and Total Hospitalized Network, which show similar extracted communities.

In the literature, there are different works that apply graph theory to analyze the COVID-19 pandemic spread. For example, Reich et al. [13] modeled the COVID-19 spread by using a SEIR (Susceptible-Exposed-Infectious-Recovered-Susceptible) agent-based model on a graph, which takes into account several important real-life attributes of COVID-19: super-spreaders, realistic epidemiological parameters of the disease, testing, and quarantine policies. The agent is represented as a node in a graph, and infection between contacts is represented by graph edges. Then, the authors have applied the SEIR model to analyze the disease progression. Herrmann et al. [14] modeled the human interaction according to three different networks, i.e., Scale-free, Mitigation Hub, and Mitigation Random, and they applied the SIS (Susceptible-Infected-Susceptible) model. The authors demonstrated that network topology could improve the predictive power of SIR model of COVID-19 by providing novel insights into the potential strategies and policies for mitigating and suppressing the spread of the virus.

Kuzdeuov et al. [15] implemented a network-based stochastic epidemic simulator that models the movement of a disease through the SEIR states of a population. The nodes of the networks represent an administrative unit of the country, such as a city or region, and the edges between nodes represent transit links of roads railways, and air travel routes to model the mobility of inhabitants among cities. In [16] , Kumar presents a network-based model for predicting the spread of COVID-19, incorporating human mobility through knowledge of migration and air transport.

The work of Wang et al. [17] applied statistical and network analysis on heterogeneous network containing patients and hospitals as nodes and relationships between relatives, friends or colleagues as edges. Network analysis provided important information about patients, hospitals and their relationships and it was able to provide a guidance for the distribution of epidemic prevention materials.

In summary, different works rely on network-based representation for the application of predictive models, whereas only Wang et al. [17] uses statistical and network-based analysis to evaluate an infected cluster of people in different hospitals. To the best of our knowledge, our work is the first study that provides a network-based representation and visualization of COVID-19 data at the regional level and applies network-based analysis to discover communities of regions that show similar behavior.

In conclusion, with this study, we wanted to give a graph-based representation of the COVID-19 measures considering how the regions behaved differently with respect to ten different datasets provided by Italian Civil Protection. It emerged that the regions where the epidemic had a greater impact, such as the Lombardi, Veneto, Piemonte and Emilia, had a different behavior with respect to other regions. This is evident in the community detection in which the regions most affected by the epidemic form individual communities or they are part of the same community. In addition, this study also led to identifying similar behaviors of regions that are geographically distant but that together form community. An example is represented by Calabria, Sardegna, and Molise that represent a cluster in Hospitalized with Symptoms Network, Total Hospitalized Network, Total Currently Positive Network, Discharged/Healed Network, Total Cases, Deceased Network, Intensive Care Network. This can lead the search for indicators that unite the regions such as factors, age structure, health care facilities, and socioeconomic status. Moreover, our visual representation of data can lead the search for indicators that are responsible for community formation i.e., factors common to regions such as age structure, health care facilities, and socioeconomic status. Furthermore, starting from the regions that form communities, it could be possible to plan common interventions such as the increase in intensive care units or the increase in swab tests.

The COVID-19 disease has spread worldwide in a matter of weeks. In Italy, the epidemic of COVID-19 started in the north and quickly involved all regions. In this paper, we evaluated the evolution of Italian COVID-19 data provided daily by Italian Civil Protection. The main goal of this work is the network-based representation of COVID-19 diffusion similarity among regions and graph-based visualization with the aim of underlining similar diffusion regions. We identified similar Italian regions with respect to the available COVID-19 data and we mapped these in different networks. Finally, we performed a network-based analysis to discover communities of regions that show similar behavior. For future work, we plan to extend the study by considering the evolution of the communities at greater time intervals to demonstrate a new pattern of regions with respect to COVID-19 data. 

The authors wish to thank Italian Civil Protection Department for freely providing online COVID-19 data.

The authors declare no conflict of interest.

In this section, we reported the results obtained by applying Wilcoxon Sum Rank test. First, we present Figure A1 that shows the heat map related to results obtained by applying Wilcoxon Sum Rank test in the study period (five weeks). Furthermore, we present different tables that report the results of Wilcoxon Sum Rank test related to Italian COVID-19 data in five temporal intervals: in the study period, in the first week, in the second week, in the third week, in the fourth week, in the fifth week. The tables report the results only with p values > 0.05. We mapped the results only with p values < 0.05 equal to 0.

Tables A1-A6, report the similarity values related to Hospitalized network in the study period, first, second, third, fourth, fifth weeks.

Tables A7-A12, report the similarity values related to Intensive Care network in the study period, first, second, third, fourth, fifth weeks.

Tables A13-A18, report the similarity values related to Total Hospitalized network in the study period, first, second, third, fourth, fifth weeks.

Tables A19-A24, report the similarity values related to Home Isolation network in the study period, first, second, third, fourth, fifth weeks.

Tables A25-A30, report the similarity values related to Total Currently Positive network in the study period, first, second, third, fourth, fifth weeks.

Tables A31-A36, report the similarity values related to New Currently Positive network in the study period, first, second, third, fourth, fifth weeks.

Tables A37-A42, report the similarity values related to Discharged/Healed network in the study period, first, second, third, fourth, fifth weeks.

Tables A43-A48, report the similarity values related to Deceased network in the study period, first, second, third, fourth, fifth weeks.

Tables A49-A54, report the similarity values related to Total Cases in the study period, first, second, third, fourth, fifth weeks.

Tables A55-A60, report the similarity values related to Swabs in the study period, first, second, third, fourth, fifth weeks. Table A1 . Similarity values of Hospitalized with Symptoms network in the study period. Table A2 . Similarity values of Hospitalized with Symptoms network in the first week. Table A4 . Similarity values of Hospitalized with Symptoms network in the third week. Table A6 . Similarity values of Hospitalized with Symptoms network in the fifth week. Table A8 . Similarity values of Intensive Care network in the first week. 14 0.14 0.14 0.14 0.14 0 0.14 1 1 0 0.12 0.14 0.66 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0 Liguria 0.14 0.14 0.14 0.14 0.14 0 0.14 1 1 0 0.12 0.14 0.66 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0 Table A12 . Similarity values of Intensive Care network in the fifth week. Table A16 . Similarity values of Total Hospitalized network in the third week. Table A17 . Similarity values of Total Hospitalized network in the fourth week. Table A18 . Similarity values of Total Hospitalized network in the fifth week. Table A19 . Similarity values of Home Isolation network in the study period. Table A20 . Similarity values of Home Isolation network in the first week. Table A21 . Similarity values of Home Isolation network in the second week. Table A22 . Similarity values of Home Isolation network in the third week. Table A24 . Similarity values of Home Isolation network in the fifth week. Table A26 . Similarity values of Total Currently Positive network in the first week. Table A28 . Similarity values of Total Currently Positive network in the third week. Table A30 . Similarity values of Total Currently Positive network in the fifth week. Table A32 . Similarity values of New Currently Positive network in the first week. Table A34 . Similarity values of New Currently Positive network in the third week. Table A36 . Similarity values of New Currently Positive network in the fifth week. 14 0.14 0.14 0.14 0.14 0.14 0.14 0.16 1 0.12 0.14 0.14 0.14 0.14 0.14 0.67 0.94 0.14 0.14 0.14 0 Table A42 . Similarity values of Discharged/Healed network in the fifth week. Table A48 . Similarity values of Deceased network in the fifth week. Table A54 . Similarity values of Total Cases network in the fifth week. 

Abruzzo 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.11 0 0.06 0 0 0 Basilicata 0 1 0 0 0 0 0 0 0 0 0 0.07 0 0 0 0 0 0 0 0 0 Bolzano 0 0 1 0 0 0 0.9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Calabria 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.14 0 0 0 0 0 0 Campania 0 0 0 0 1 0 0 0 0 0 0 0 0 0.32 0 0.14 0 0.08 0 0 0 Emilia 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Friuli 0 0 0.9 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Lazio 0 0 0 0 0 0 0 1 0.32 0 0.53 0 0 0 0 0 0.07 0 0 0 0 Liguria 0 0 0 0 0 0 0 0.32 1 0 0.62 0 0 0 0 0 0.26 0 0 0 0 Lombardi 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Marche 0 0 0 0 0 0 0 0.53 0.62 0 1 0 0 0 0 0 0.16 0 0 0 0 Molise 0 0.07 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 Piemonte 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 Puglia 0 0 0 0 0.32 0 0 0 0 0 0 0 0 1 0 0.08 0 0.08 0 0 0 Sardegna 0 0 0 0.14 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 Sicilia 0.11 0 0 0 0.14 0 0 0 0 0 0 0 0 0.08 0 1 0 0.56 0 0 0 Toscana 0 0 0 0 0 0 0 0.07 0.26 0 0.16 0 0 0 0 0 1 0 0 0 0 Trento 0.06 0 0 0 0.08 0 0 0 0 0 0 0 0 0.08 0 0.56 0 1 0 0 0 Umbria 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ValleAosta 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 Veneto 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: Summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention

A novel coronavirus from patients with pneumonia in China

Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: A descriptive study

A novel coronavirus outbreak of global health concern

Clinical features of patients infected with 2019 novel coronavirus in

Early phylogenetic estimate of the effective reproduction number of SARS-CoV-2

Computing Languages for

Analysis of the biological effect of city smog extract. III. Comparative investigations on the effect of city smog extracts on cell replication and DNA-synthesis of kidney cells

Spread of SARS-CoV-2 Coronavirus likely to be constrained by climate

Challenges and Opportunities for Visualization and Analysis of Graph-Modeled Medical Data

The igraph software package for complex network research

Computing communities in large networks using random walks

Modeling COVID-19 on a network: Super-spreaders, testing and containment

Using network science to propose strategies for effectively dealing with pandemics: The COVID-19 example

A Network-Based Stochastic Epidemic Simulator: Controlling COVID-19 with Region-Specific Policies

Modeling geographical spread of COVID-19 in India using network-based approach

Statistical and network analysis of 1212 COVID-19 patients in Henan

 Table A57 . Similarity values of Swabs network in the second week. Table A58 . Similarity values of Swabs network in the third week. Table A59 . Similarity values of Swabs network in the fourth week.