key: cord-0753812-78kpohyi authors: Buscema, Paolo Massimo; Torre, Francesca Della; Breda, Marco; Massini, Giulia; Grossi, Enzo title: COVID-19 in Italy and extreme data mining date: 2020-07-25 journal: Physica A DOI: 10.1016/j.physa.2020.124991 sha: 7b31908c9d3ae7e90734bfed9caed9ee16dbcfe2 doc_id: 753812 cord_uid: 78kpohyi In this article we want to show the potential of an evolutionary algorithm called Topological Weighted Centroid (TWC). This algorithm can obtain new and relevant information from extremely limited and poor datasets. In a world dominated by the concept of big (fat?) data we want to show that it is possible, by necessity or choice, to work profitably even on small data. This peculiarity of the algorithm means that even in the early stages of an epidemic process, when the data are too few to have sufficient statistics, it is possible to obtain important information. To prove our theory, we addressed one of the most central issues at the moment: the COVID-19 epidemic. In particular, the cases recorded in Italy have been selected. Italy seems to have a central role in this epidemic because of the high number of measured infections. Through this innovative artificial intelligence algorithm, we have tried to analyze the evolution of the phenomenon and to predict its future steps using a dataset that contained only geospatial coordinates (longitude and latitude) of the first recorded cases. Once the coordinates of the places where at least one case of contagion had been officially diagnosed until February 26(th), 2020 had been collected, research and analysis was carried out on: outbreak point and related heat map (TWC alpha); probability distribution of the contagion on February 26th (TWC beta); possible spread of the phenomenon in the immediate future and then in the future of the future (TWC gamma and TWC theta); how this passage occurred in terms of paths and mutual influence (Theta paths and Markov Machine). Finally, a heat map of the possible situation towards the end of the epidemic in terms of infectiousness of the areas was drawn up. The analyses with TWC confirm the assumptions made at the beginning. On January 9 th , 2020, the World Health Organization (WHO) stated that Chinese health authorities have identified a new strain of coronavirus never before identified in humans named COVID-19. On January 30 th , the Italian Superior Institute of Health (Istituto Superiore di Sanità) confirmed the first two cases of COVID-19 infection in Italy [1] . COVID-19 is producing a tremendous impact in Italy form sanitary and economic point of view. According to General Confederation of Italian Industry, better known as Confindustria, the impact of the coronavirus on the GDP of Italy could be more than -6.0% [2] . At the same, time specific data on the infected cases are hidden because of privacy. Consequently, without special permissions and authorizations it is not possible to get data about infected individuals like the precise address and other personal and clinic information. These days it seems like the watchword is "big data." Instead, we belief that an innovative and different strategy, extrapolating everything possible even from limited dataset, is possible. As a demonstration, we aim to analyze the (few) publicly available data about the Italian areas of infection of COVID-19 in its early stages to obtain estimates of the possible outbreak and the areas and methods of future spread of the virus within Italy. The algorithm used is based on geographic profiling using a topological approach. One of the advantages of this algorithm, called Topological Weighted Centroid (TWC) [3] - [9] , compared to standard methods of analysis of diffusion processes, lies in the fact that it requires very simple data: the coordinates of the places where the events of the process took place without any kind of other assumptions. For this reason, in conditions of poor data availability, it is particularly useful. In this case, the data used correspond to all places (by means of longitude and latitude) of Italy where at least one case of COVID-19 was detected until February 26 th , 2020. The TWC has already been successfully applied several times to epidemic cases, in particular to Escherichia coli (Germany 2011), Chikungunya Fever (Italy 2007), Foot and Mouth Disease (UK 1968 (UK -1969 Although the epidemic has a worldwide spread, it was chosen to focus on the atypical case of Italy for several reasons. On the one hand, the evolution of the epidemic in Italy has taken on a completely different character from the rest of the world, which leads us to think of a type of propagation of the local phenomenon. Moreover, it was considered that the movements within a single nation had a different structure compared to international movements. Therefore, it was preferred not to mix together information that could be discordant with each other. As an additional motivation, the set of data available on Italy was really small. Only 24 cities were involved until February 26 th , 2020, so the dataset consisted of 24 rows and two columns, latitude and longitude. The model we are going to use in this paper is based on highly simple datasets. It considers only the geospatial coordinates of latitude and longitude of the places where the events occurred. In this case, all the Italian provinces in which at least one case of contagion has occurred have been taken into consideration. In fact, the only data publicly known are the cities involved in the Covid-19 infection. The data collected correspond to the episodes of contagion recorded up to February 26 th , 2020 (23 provinces and 1 municipality). The information about the contagions was taken from the official Italian civil protection repository [10] . During the course of the epidemic, this repository kept a page constantly updated with information about new episodes of contagion. For each city we have decided to ignore the frequency of cases and the date. Thus, for every city were Covid-19 is present at least with one case, we only have latitude and longitude. Table 1 shows the name of these 24 locations and the relevant geospatial coordinates. Figure 1 shows where they are located. We must emphasize that the latitude and longitude of each city is very fuzzy information on the real location of each infected case; the coordinates of Lodi, for example, globally summarize 10 different municipalities spread out around Lodi. Longitude Latitude Alassio (SV) 8 We analyzed the small dataset (48 real numbers) using a geographic profiling approach [2] [12] [13] [14] . In particular, as mentioned above, we used the Topological Weighted Centroid (TWC), a special adaptive system, with a solid theoretical and mathematical background [3] and already successfully applied in various fields and published in numerous papers [4] [5][6] [7] [8] [9] . The whole TWC theory can be summarized in five main points: TWC(α). Represents a spatial estimate of the point or area where the process under examination originated (outbreak); TWC(β). Represents the current likely distribution of the process under consideration; TWC(γ). Represents the likely future evolution of the TWC(β) distribution, considering the system's self-organizing properties; TWC(θ). Represents a further level of evolution over time, developed from TWC(γ) as the communication and interaction between the observed events stabilizes to become highly organized. TWC(θ) also provides a directed weighted graph in which a hypothesized flow of communication among the points (cities) is represented; TWC(ι). Provides a heatmap representing the infectivity rate of the area. TWC is able to extrapolate time from space, bringing to light fundamental information trapped in the coordinates of events. Similar to the core drilling operations in geology, which allow the analysis of the history of the terrain one layer at a time, the TWC enables the entire process to be analyzed one frame at a time. The main outputs of the different types of TWC are heat maps. For all the mathematical details of the theory, please refer to the cited literature and to the supplementary material. The outbreak We remind that TWC(α) represents a spatial estimate of the point or area where the process under examination originated (outbreak). Figure 2 shows this outbreak point. This area is outside the red zone indicated by the Italian authorities. In fact, the Alpha Point is located about 39 km south east from Codogno, the alleged official outbreak. As seen before, TWC(β) represents the current likely probability distribution of the process under consideration (Figure 4 -top left) . The Beta Map shows the current active area of the epidemic until February 26 th , 2020. According to this advanced algorithm this area should be considered the real risk area to be monitored. TWC(γ) represents the likely future evolution of the TWC(β) distribution, considering the self-organizing properties of the system (Figure 4 -top right) . According to the TWC algorithm, the expansion of the diffusion will be contracted in the north and it will expand slightly in the south, in the direction of Florence, and in the west of Emilia Romagna. TWC(θ) represents a further level of evolution over time, developed from TWC(γ) as the communication and interaction between the observed events stabilizes to become highly organized (Figure 4 -bottom left). The epidemic shows stopping. TWC(θ) can also provide a hypothesis on the diffusion path of epidemic in Italy, according to the data processed ( Figure 5 ). Through the use of a Markov chain it is possible to generate a directed weighted graph of how each city was infected in prevalence from which other city ( Figure 6 ). The graph shown in Figure 6 shows a possible network of influences, reciprocal or not, of contagions occurred in Italy. Although the largest number of cases occurred around the city of Lodi, for the system, Cremona seems to play a more central role. Cremona and TWC alpha, which we remember to be the estimated outbreak, highlighted in red within the graph, mutually affect each other. The epidemic then J o u r n a l P r e -p r o o f Journal Pre-proof spreads to other areas of northern Italy. The southernmost areas of Italy, Rome and then Palermo, seem to have been involved through the cases of Rimini and Ancona, which were mutually infected. According to the hypothesis of the TWC iota ( Figure 4 -bottom right), towards its final stages the epidemic will tend to return to the areas from which it originated but slightly further north. TWC alpha map seems to have identified an outbreak zone rather close (about 40 km) to the currently estimated epicenter in the municipality of Codogno (Figure 2 ). Also the TWC theta elaborations seem to suggest an area slightly further south-east approaching Cremona. By the way it is not entirely sure that Codogno is actually the main outbreak of the epidemic as the Italian epidemiologists themselves talk about the presence in Codogno of Case 1 while Case 0 has not yet been found. The result seems very interesting especially if you consider that the system only takes into account the coordinates of the events, without knowing the frequency for each coordinate. It should be noted that, until February 26 th , the city of Lodi counted more than 100 cases while Palermo only 1. The Beta map (Figure 4 -top left), which indicates the probability distribution of contagion episodes at the time the data are collected, shows how the "hot" zone was, as expected, to be mainly located in northern Italy and extending towards the Emilia-Romagna region. The Gamma map (Figure 4 -top right) , which corresponds to the hypothesis of propagation in the immediate future, seems to indicate a certain stability of the epidemic phenomenon. The Theta map (Figure 4 -bottom left) instead would seem to indicate a certain regression of the phenomenon that, in the "anterior" future, could return to concentrate in the Lodi area. This trend seems to be confirmed also by the Iota map (Figure 4 -bottom right) which still reduces the diameter of the red zone. TWC iota shows some very interesting aspects. The areas with the most intense red color show a peak of infectivity in the northernmost area of Italy. Even considering only very few data such as the latitude and longitude of cities where at least one case of covid-19 was detected, it was possible to make analyses with interesting results. This shows that, even in the presence of extremely poor data, if you have sufficiently powerful tools, it is possible to carry out real data mining by actually determining new useful information. In this case, it was possible to obtain the coordinates and heat map of the area considered to be the outbreak of the epidemic. This point appeared less than 40 km from the location of Codogno, currently considered central for the spread of the virus in Italy. Through the TWC beta, gamma and theta, it was possible to build a prediction for the near future and the future of the future. What we do not know yet is the "delta time" between one prediction and another. For this reason, the monitoring activity continues over time. TWC iota shows how the maximum level of infectivity seems to correspond to northern Italy without spreading too much to the rest of the Italian peninsula. Indeed, at the time of the submission, i.e. April 2020, there was no massive expansion of contagions in the central-southern areas of Italy. Italian+economic+outlook+2020_2021_Summary+and+main+conclusions_31032 0_Confindustria.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE-01f1ad4b-d609-4728-b975-9e2d4a184e01-n4NCcZg The Topological Weighted Centroid (TWC): A topological approach to the time-space structure of epidemic and pseudo-epidemic processes A new algorithm for identifying possible epidemic sources with application to the German Escherichia coli outbreak Locating the source of public health events using intelligent adaptive systems: 2011 United States listeriosis outbreak linked to whole cantaloupes Unraveling the space grammar of terrorist attacks: A TWC approach The Complex Dynamic Evolution of Cultural Vibrancy in the Region of Halland Novel Applications of Spatial Mapping to Chemical or Biological Outbreaks Analysis of the Ebola Outbreak in 2014 and 2018 in West Africa and Congo by Using Artificial Adaptive Systems Environmental criminology Patterns in crime Geographic profiling A new mathematical technique for geographic profiling PMB: ideation of the mathematics of the algorithms used; software implementation; data management and processing; application of artificial adaptive systems, analysis of results, manuscript preparation. FDT: Data collection, data preparation, analysis of results, manuscript preparation. MB: Analysis of results, manuscript preparation. GM: Analysis of results, manuscript review. EG: Analysis of results, manuscript review.