key: cord-0796313-7ijaxk4v authors: El Mouden, Zakariyaa Ait; Taj, Rachida Moulay; Jakimi, Abdeslam; Hajar, Moha title: Towards Using Graph Analytics for Tracking Covid-19 date: 2020-12-31 journal: Procedia Computer Science DOI: 10.1016/j.procs.2020.10.029 sha: e2faaca2bea4f7c3402bf65d5f51eb6c29a2cd52 doc_id: 796313 cord_uid: 7ijaxk4v Graph analytics are now considered the state-of-the-art in many applications of communities detection. The combination between the graph’s definition in mathematics and the graphs in computer science as an abstract data structure is the key behind the success of graph-based approaches in machine learning. Based on graphs, several approaches have been developed such as shortest path first (SPF) algorithms, subgraphs extraction, social media analytics, transportation networks, bioinformatic algorithms, etc. While SPF algorithms are widely used in optimization problems, Spectral clustering (SC) algorithms have overcome the limits of the most state-of-art approaches in communities detection. The purpose of this paper is to introduce a graph-based approach of communities detection in the novel coronavirus Covid-19 countries’ datasets. The motivation behind this work is to overcome the limitations of multiclass classification, as SC is an unsupervised clustering algorithm, there is no need to predefine the output clusters as a preprocessing step. Our proposed approach is based on a previous contribution on an automatic estimation of the k number of the output clusters. Based on dynamic statistical data for more than 200 countries, each cluster is supposed to group countries having similar behaviors of Covid-19 propagation. In late December 2019, an increasing number of pneumonia cases was noticed in Wuhan city, China [1, 2] . Initially, those cases were classified as caused by unknown sources, but after one week, the novel coronavirus was identified and temporarily named 2019-nCoV [3] . Coronavirus or CoV is a large family of viruses discovered in 1930s in mammals and birds, later in 1960s, coronaviruses were discovered in humans. A novel coronavirus or nCoV is a new race that has not been previously iden- In late December 2019, an increasing number of pneumonia cases was noticed in Wuhan city, China [1, 2] . Initially, those cases were classified as caused by unknown sources, but after one week, the novel coronavirus was identified and temporarily named 2019-nCoV [3] . Coronavirus or CoV is a large family of viruses discovered in 1930s in mammals and birds, later in 1960s, coronaviruses were discovered in humans. A novel coronavirus or nCoV is a new race that has not been previously iden-tified in humans. Thereafter, the novel coronavirus disease was named COVID-19 in February 11, 2020 [2, 4] which is caused by Severe Acute Respiratory Syndrome Coronavirus 2 or SARS-CoV-2. After two months, the novel virus was characterized as a pandemic as the statistics exceed 100,000 cases and 4,000 deaths in 114 different countries according to the World Health Organization (WHO). The symptoms of COVID-19 disease can be divided into two parts; i) Systematic disorders such as fever, cough, fatigue, headache, hemoptysis, acute cardiac injury, hypoxemia, dyspnea, diarrhea and lymphopenia. ii) Respiratory disorders such as rhinorrhea, sneezing, sore throat, pneumonia, ground-glass opacities, RNAaemia and acute respiratory distress syndrome [5, 6] . From infection to the first symptoms, the incubation period is from 2 days to 14 days in most cases, but a maximal value of incubation period was observed with 27 days in 22 th February 2020, other values of 19 and 24 were noticed after, which proves that the incubation period can vary widely among patients and makes the subject of serious of coronavirus and its ability to spread. While searching for vaccines, the majority of countries implemented quarantine and travel restrictions to influence to spread of this novel coronavirus [7, 8] . While writing this paper, the number of COVID-19 cases in the world has reached 6149738 including 370497 deaths (6%) and 2729904 recovered cases (44.4%) according the last update on Worldometers in May 31, 2020. After more than 5 months, China has moved from the top of the table of cases to the 17th place, leaving the first place to USA with more than 1816122 cases (29.53%) of total cases in the world) and 105584 deaths (25% of total deaths in the world). Our contribution focuses on the use of machine learning algorithms to manipulate Covid-19 data; the existing approaches classifies countries according to predefined classes and using statistical data in function of time, which is a very basic classification that requires to define the classes before processing the algorithm. The high success of multiclass classification for Covid-19's medical images [9] doesn't make this approach applicable for other situation using other formats of data, the use of each approach will be discussed in the section of related works. Our work is based on Spectral Clustering (SC) which is a clustering approach and not a classification one; the difference is that in advance, we do not have any idea about the number or the structure of the output groups, it is the combinison between the features that makes a set of countries have similar behaviors and then form a strong cluster with minimal links with countries from other clusters. SC is not a foreign tool in medicine, many previous works linked SC to protein-to protein interactions [10, 11] , medical imaging [12, 13] and we wish that our study will open doors on using SC in tracking epidemics. The paper is structures as follows; After this introduction, Section 2 details the literature review, Section 3 presents briefly the graph analytics thematic and locates our approach in this thematic. Including a graphical abstract, Section 4 presents the different processes of our proposed approach. Section 5 concludes the paper and presents some perspectives for future works. Since the first appearance of Covid-19 disease, many disciplines have joined the scientific community of Covid-19. Artificial Intelligence (AI) in turn, contributed with various works to support Covid-19 challenges such as modeling, simulation, predictions, social networks analytics, Geographic Information Systems (GIS) for spatial segmentation and tracking, etc. In [14] , the authors highlight the importance of GIS and big data challenges against Covid-19 challenges represented for GIS by spatial tracking of confirmed cases, predictions of the epidemic transmission from region to region, spatial data visualization and segmentation, . . . etc. For big data, the problem is the format of data collected from different sources which produces heterogeneous dataset that can't be processed with traditional techniques, to remedy this issue, the authors proposed that it is indisputable for the different interveners to discuss the formulation of those data especially the government and the academics. The study was presented for three scales; individual, group and regional. Another contribution in the same field is [15] , where the authors studied the spatial variability of Covid-19 in the level of the continental United States, the authors implemented a Multiscale Geographically Weighted Regression for both global models and local models. Medical images analysis can be considered as the AI field with the higher number of contributions since the first appearance of the novel coronavirus. In [16] the authors presented a comparative study between seven recent deep learning methods for Covid-19's detection and classification using chest X-ray images. Using the same chest Xray images, the authors of [17] proposed their deep learning model CovidX-Net based on the Convolutional Neural Networks (CNN) to classify the patients to either positive or negative cases, as results interpretation, the authors recommended the use of the Dense Convolutional Network models (DenseNet, F 1 = 0.91) which gives better results in comparison with other models such as InceptionV3 (F 1 = 0.67). Neural network models were not limited to chest X-ray images analysis, other works used those models to study the efficiency of quarantine systems in different countries, such as [18] , where the authors analyze the results of quarantine and isolation in China, Italy, South Korea and USA. The authors concluded that any relaxing of quarantine measures before estimated dates will lead to higher spread of Covid-19 disease in USA. The study was elaborated for data available from the start of the epidemic to March, 2020. In the other hand, Spectral clustering has also proven its efficiency against big data challenges, with numerous applications in computer science such as communities' detection [19, 20] , bioinformatics [21] , image processing [22, 23] , . . . etc. Recent works combined between spectral methods and deep learning models, such as the case of [24] where the authors presented their deep clustering approach to cluster data using both neural networks and graph analytics. The image segmentation presented by the authors of [22] is a combinison between SC and regularization which gives good results for high-dimensional image segmentation in comparison the state-of-art algorithms. Graph analytics is a mathematical field of study that regroups all the algorithms and approaches based on graphs, we cite for example discovering the meaningful patterns using mathematical properties of graphs. One of the many powerful sides of graphs, is that those data structures can be built from any type of structured, semi-structured or even heterogeneous data (See Fig. 1) , also graphs can be imported and exported as objects using different formats, one of the main formats to describe graphs is GEXF (Graph Exchange XML Format), more details about building GEXF graph schemas from relational data can be found in our previous contribution [25] . How easy is to break the graph by removing a few nodes or edges? How can we compare two or more graphs? Those two questions can summarize the connectivity analytics problems. A graph is connected if it contains a path from u to v or from v to u for each pair of nodes (u, v) . In Connectivity analytics we study the robustness of a graph, the disconnection of graphs based on their nodes or edges and the similarity analysis (Fig.2) . As comparison between two graphs, degrees' histogram are widely used to analyze the connectivity between the nodes. In Fig. 2 , the two graphs are ε-neighborhood graphs with different values of thresholding ε = 0.3 (left) and ε = 0.5 ( right), the two graphs describe the same data and only the connectivity is different. A graph with smaller ε contains higher number of edges in comparison with a higher value of ε and higher degrees, which explains the difference between the two dergrees' histograms. A centrality is the measure of importance of a node or an edge based in its position in the graph. A centralization is measured for the entire graph or network as the variation in the centrality scores among the nodes or edges. We distinguish between four types of centrality; i) Degree centrality, which is a measure that tells how much the graph is maximally-connected or close to a clique. ii) Group degree centrality is close the first type of centrality but we consider a set of nodes as a single node and we measure how much is that subset of nodes close to a clique. iii) Closeness centrality is the sum of shortest paths from all other nodes to a single node, a low closeness centrality means that our node has short distances from other nodes which makes it a key player in the graph by receiving information sooner and influencing other nodes directly or indirectly. iv) Betweenness centrality is the ratio of pairwise shortest paths that flows through a node i and count all the shortest paths in the graph, a low betweenness centrality means that the node i can reach other nodes faster than other nodes in the graph. A community or a cluster is dense subgraph where the nodes are more connected to each other than nodes outside the graph. From this definition, we can model a communities' detection in a graph in a graph as a multi-objective optimization problem; Let G be a graph and c a connected subgraph of G, For each cluster c, we compute its intra-cluster density σ int and its inter-cluster density σ ext , where: With n the number of nodes in G and n c the number of nodes in the cluster c. A good communities' detection (also called graph clustering) is the optimization function with minimal value of σ ext and maximal value of σ int . Louvain community detection [26] and normalized spectral clustering [27, 28] are the most used communities' detection algorithms for graphs and complex networks. Our proposed approach consists of a SC based communities detection where the objective is to have an unsupervised grouping of countries having similar behaviors of Covid-19 spreading. The approach is meant to be applied dynamically to track the coronavirus behaviors in the world. The approach is divided into four steps (see Fig. 3 ). "It's not who has the best algorithm that wins. It's who has the most data". Andrew Ng. Data processing is a key process in every machine learning model, even with an efficient model, a bad data collection, formatting or scaling can lead to very bad results. As we know, data sources are no longer consistent, data are available and easily accessible from different devices, but each source provides different format of data (text, csv, multimedia, html, . . . etc.) with produces heterogeneous collected data. With the high spread of the novel coronavirus, data related to this disease keep growing, which causes two major problems, the selection of trusted sources and data formatting. For the application, data were collected from EU Open Data Portal which is a trusted source where data can be collected and reused free of charge and without any copyright restrictions. This portal offers a daily updated data in XLSX format, which can be easily converted to CSV (Comma-Separated Values) for further processing. The portal presents Covid-19's data for statistical studies, where the most important feature is time, the presentation of data according to time produces a high redundancy for a lot of static data, such as the country name, its geographical code, the continent, the population value, . . . etc. To remedy the redundancy problem, we proposed a preprocessing step of data formatting where we applied the object-oriented programming concepts; each country was converted to an object with a set of attributes and methods. For each country object, we are interested in collecting a set of features (Table 1) , then those features will be used to calculate another set of features (Table 2) in have more data to manipulate (See Table 1 and Table 2 ). The vectors vT ests, vCases, vDeaths and vRecovers store the daily value for Covid-19's tests, cases, deaths and recovers respectively for each country, from the first appearance of the coronavirus in the country to the day of executing the model which produces vectors with different sizes for different countries, but for each single county, the vectors will have the same size. The size of the vectors is stored as an extra feature called Contamination days. The sums of the values of each vector are stored in the features sumT est, sumCases, S umDeaths, S umRecovers. For Integer ∈ N * sumCases sumDeaths sumRecovers ContaminationDays each day we store the value of the Infection Fatality Rate (IFR) which produces a vector with same size of the four previous vectors, we named this vector as vIFR. In Machine learning models, the use of features with different scales can never lead to a meaningful result, the features with higher range of values (Population feature for example) will always have more coefficient than features will smaller scales (IFR for example). To remedy this problem, a feature scaling process is necessary, especially when we plan to use a similarity kernel in the next step of the approach. The use of a mean normalization will produce a second CSV dataset which will be used for the rest of the process. After the feature scaling, we apply the Gaussian similarity kernel to measure the similarities between each pair of countries c i and c j basing on the vectorization of the predefined features in the first step of the process. Gaussian similarity is defined as follows: With ||c i − c j || the Euclidean distance between the corresponding vectors of the pair of countries and σ a manuallychosen positive scaling parameter to control the size of the neighborhood of our produced graph. The result of the similarities measurement is a similarity square matrix S ∈ R n×n where n is the number of the countries in the dataset and each element of the matrix S i j = s(c i , c j ). This matrix will be used to build our graph, where the set of vertices is the countries of our dataset, and the edges are weighed with the similarities of S . For spectral clustering process, we prefer the Absolute SC algorithm with an unsupervised choice of k number of clusters. This version of SC is based on the use of the Absolute Laplacian matrix defined as follows: Where D is the diagonal degrees matrix and W the weights matrix, both D and W are square matrices of size n × n. The Absolute SC can be defined by the following steps: • Compute Labs; • Extract k eigenvectors associated to the k largest eigenvalues of Labs in absolute values; • Store the extracted eigenvectors as columns of a matrix U ∈ R n×k ; • Using k-means clustering, cluster the lines of U into k clusters. The output is a set of clusters where each cluster groups the countries having high similarities, which means that those countries have lived similar spreading of the coronavirus. As contamination day feature gets higher values, our model produces better results in comparison to the first days of Covid-19 disease; This is due to the vectors getting more values day after day allowing the model to have more data for similarity measurement. "A picture is worth a thousand words". Data visualization is a supplementary task, as the main objective of the model is to compute the clusters. But with data visualization, a visual graph can describe the output of our model better than the mathematical results or any textual description. A graph can be built from different sources and using different tools and programming languages. In R, a graph ca be built from an adjacency matrix of types Matrix or dgCMatrix of the package matrixcalc, graph functions are available in the package igraph and plotting functions in the package gplots, Pajek [29] network files are also an available option to create graphs in R. In Gephi [30] , a graph can be constructed from a pair of two CSV files, the first file for the nodes and the second for the edges, or from a GEXF file which is an XML-based document to describe graphs. Graph databases are also an important tool for data visualization. In addition to visualization, graph-based systems provide querying languages to interact with graphs. Neo4j [31] is a widely used NoSQL system which manipulates data modelled by graphs and offers Cypher language to query the stored graphs. Neo4j supports CSV files and manually creation using Cypher querying language which makes it the most powerful tool for graph creation, manipulation and visualization. In this paper, we proposed a graph-based approach for clustering Covid-19 data using spectral clustering. Starting with data processing, we highlighted the importance of data collection and feature scaling in increasing the efficiency of each machine learning model. Then, we described the transitional phase between the heterogeneous data collected from different sources and the graph data structure, we were based on the use of Gaussian similarity kernel to build our graph. Next, spectral analysis was applied to the built graph, we selected the absolute spectral clustering, which is based on the absolute Laplacian matrix to cluster the vertices of the graph based on the similarities between their properties. The last process was the visualization of the output data, we proposed three different tools and programming languages, then we recommended the graph-based system Neo4j as it supports a querying language called Cypher to intact with the graph. Ongoing work intends to link the different processes of the model, developed with two different programming languages (Java and R) to build a model able to cluster heterogeneous data based on graph analytics and spectral clustering for communities' detection. Furthermore, The data processing needs more improvement to collect more data and expand the set of features, as the model still gives better clustering for countries with more data whose are the countries that discovered coronavirus earlier, than the fresh contaminated countries whose gather less data and generally appears as common nodes between all the clusters due to the fact that majority of the countries had a very similar spreading of coronavirus in the first days of the contamination. Prevalence of comorbidities and its effects in coronavirus disease 2019 patients: a systematic review and meta-analysis World Health Organization declares global emergency: A review of the 2019 novel coronavirus (COVID-19) 2019 novel coronavirus (2019-nCoV) pneumonia A novel coronavirus outbreak of global health concern Clinical characteristics of COVID-19 patients with digestive symptoms in Hubei, China: a descriptive, cross-sectional, multicenter study The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak Preliminary risk analysis of 2019 novel coronavirus spread within and beyond China The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak COVID-Net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest radiography images Spectral clustering for detecting protein complexes in protein-protein interaction (PPI) networks Community detection in protein-protein interaction networks using spectral and graph approaches Spectral clustering for medical imaging Oriented grouping-constrained spectral clustering for medical imaging segmentation COVID-19: challenges to GIS with big data GIS-based spatial modeling of COVID-19 incidence rate in the continental United States Using X-ray Images and Deep Learning for Automated Detection of Coronavirus Disease Covidx-net: A framework of deep learning classifiers to diagnose covid-19 in x-ray images Neural Network aided quarantine control model estimation of global Covid-19 spread Towards for Using Spectral Clustering in Graph Mining An application of spectral clustering approach to detect communities in data modeled by graphs Polycystic ovarian syndrome novel proteins and significant pathways identified using graph clustering approach Kernel cuts: Kernel and spectral clustering meet regularization Graph laplacian for spectral clustering and seeded image segmentation Deep spectral clustering using dual autoencoder network An Algorithm of Conversion Between Relational Data and Graph Schema Generalized louvain method for community detection in large networks On spectral clustering: Analysis and an algorithm An automated spectral clustering for multi-scale data Pajek: Program for analysis and visualization of large networks Gephi: an open source software for exploring and manipulating networks A programmatic introduction to neo4j