key: cord-0774025-p7kvqchy
authors: Kumar Das, Jayanta; Tradigo, Giuseppe; Veltri, Pierangelo; H Guzzi, Pietro; Roy, Swarup
title: Data science in unveiling COVID-19 pathogenesis and diagnosis: evolutionary origin to drug repurposing
date: 2021-02-17
journal: Brief Bioinform
DOI: 10.1093/bib/bbaa420
sha: c984f59e65cbd0f51f4744c71efc11f21996a376
doc_id: 774025
cord_uid: p7kvqchy

MOTIVATION: The outbreak of novel severe acute respiratory syndrome coronavirus (SARS-CoV-2, also known as COVID-19) in Wuhan has attracted worldwide attention. SARS-CoV-2 causes severe inflammation, which can be fatal. Consequently, there has been a massive and rapid growth in research aimed at throwing light on the mechanisms of infection and the progression of the disease. With regard to this data science is playing a pivotal role in in silico analysis to gain insights into SARS-CoV-2 and the outbreak of COVID-19 in order to forecast, diagnose and come up with a drug to tackle the virus. The availability of large multiomics, radiological, bio-molecular and medical datasets requires the development of novel exploratory and predictive models, or the customisation of existing ones in order to fit the current problem. The high number of approaches generates the need for surveys to guide data scientists and medical practitioners in selecting the right tools to manage their clinical data. RESULTS: Focusing on data science methodologies, we conduct a detailed study on the state-of-the-art of works tackling the current pandemic scenario. We consider various current COVID-19 data analytic domains such as phylogenetic analysis, SARS-CoV-2 genome identification, protein structure prediction, host–viral protein interactomics, clinical imaging, epidemiological research and drug discovery. We highlight data types and instances, their generation pipelines and the data science models currently in use. The current study should give a detailed sketch of the road map towards handling COVID-19 like situations by leveraging data science experts in choosing the right tools. We also summarise our review focusing on prime challenges and possible future research directions. CONTACT: hguzzi@unicz.it, sroy01@cus.ac.in

The massive outbreak of SARS-CoV-2 viral infections in the world has led to a life-threatening pathogenic disease, which has been named COVID-19 (COronaVIrus Disease 2019) by the World Health Organization (WHO) [1] . The surprisingly rapid human-to-human transmission has created an alert due to the exponential increase in the number of cases in relatively short time [2] (https://www.who.int/emergencies/diseases/no vel-coronavirus-2019/situation-reports). Since December 2019, Figure 1 . The trends of COVID-19 related research publications from two sources: Dimensions [4] and Google Scholar [5] as of 28 September 2020. We searched by using the following keywords: COVID-19, COVID-19 and Machine Learning and COVID-19 and Data Science. The search filter includes published articles, preprints, edited books, monographs, proceedings and chapters.

(https://coronavirus.jhu.edu/map.html) has become one of the primary data sources for disease monitoring from an epidemiological perspective. The rapid accumulation of data and the need to support wet-lab investigations has implied an increasing effort in exploiting computational-based approaches (e.g. deep learning, artificial intelligence, network medicine) on COVID-19 related datasets [6] . These methods help to understand the pathogenesis of the disease and hopefully will lead to the development of a vaccine or new drugs. This, however, has resulted in an accumulation of data, algorithms, software and tools that need to be categorised and organised. A complete and exhaustive categorisation of all the approaches is undoubtedly a tough task due to the high publication rate.

We aim therefore to present the main characteristics of the current landscape, as depicted in Figure 2 : (i) data sources, (ii) repositories, (iii) data science models, (iv) decision-making and (v) interpretation. Figure 2 also illustrates data analysis processes. Some processes return data and/or models that can be used as input (feedback) for the data analysis workflow. The integration and analysis step uses data science models to infer knowledge from the results of the previous steps (see Decision row in Figure 2 ). For instance, the drug-disease association needs a network biology approach to determine the relationship between drug molecules and their impact on patients. Many laboratories are producing a massive amount of heterogeneous data, considering both format and content. Viral sequences are represented as strings. Usually, raw clinical data (i.e. clinical records, biological analyses) are highly unstructured and heterogeneous, while medical images are more standardised data. Such data are accumulated into public databases or websites, which can be integrated with other existing databases (e.g. virus-host interaction databases, clinical and epidemiological databases) to enrich knowledge or correlate information. Such an approach can be useful in the drug-disease association, screening possible candidate vaccines or supporting healthcare decisions (e.g. management of resources such as intensive care units [ICUs] ).

Contributions of the current paper can be summarised as follows:

• we present a comprehensive review of the in silico approaches adopted so far to handle COVID-19 in genomics, proteomics, interactomics, epidemics, clinical imaging and drug design;

• we present data analytics tasks related to COVID-19;

• we present an overall landscape suggesting strategies to integrate heterogeneous COVID-19 data sources;

• we report data sources, models and tools, which can be used to study SARS-CoV-2 and COVID-19;

• we take a look at computational biology and bioinformatics approaches available in the literature.

In conclusion, we aim to offer to data scientists, medical doctors, healthcare advisers and drug/vaccine designers a landscape on data and tools that are useful for their activities on COVID-19.

The recent pandemic has promoted the collaboration across different research communities (e.g. virologists, computational biologists, medicine specialists, data scientists, etc.) to shed light on the pathogenesis and contrast strategies for the COVID-19 disease. A large number of recent research efforts has generated bulk experimental data. In order to unveil the basic molecular mechanism of the disease, there is the need to use data sciencerelated technologies to improve knowledge on virus. Thus, it is crucial to have the right tools and data for understanding the biology behind SARS-CoV-2. Here, we try to briefly introduce the SARS-CoV-2 virus genetically related to COVID-19 and how omics, bioimages and epidemiological data related to COVID-19 are generated and stored in publicly available data repositories. We introduce the concept of data science as a useful set of tools and methodology to gain new insights from the available COVID-19 related datasets.

COVID-19, a highly infectious viral disease caused by SARS-CoV-2, was discovered at the end of 2019. The symptomatic COVID-19 patients usually experience mild to moderate respiratory problems together with fever, dry cough and tiredness. A few non-severe patients also experience aches, pains, sore throat, diarrhoea, skin rashes, conjunctivitis, headaches, discolouration of fingers or toes and most significantly loss of taste or smell. Infection is transmitted through close proximity of an infected person, droplets generated by infected persons through coughs, sneezes or exhaling or touching contaminated surface. It enters through eyes, nose or mouth. Trend shows [7] that patients on and above 65 years of age with comorbidities are more vulnerable and may need ICU admission.

Viruses are small microorganisms that use living cells to replicate. Viruses cause many infectious diseases responsible for millions of deaths every year [8] . They exist in the form of small independent particles named virions. Each virion consists of two main components: (i) the genetic information, encoded as DNA or RNA, and (ii) a protein coat, named capsid, which wraps the genetic material. Sometimes the capsid is surrounded by an envelope of lipids. Virions have different shapes that are used in their classification [9] . As viruses are not able to replicate by themselves, they need to use the cell metabolism of a host organism. The virus replication cycle may be summarised in the following six steps [10] : Figure 2 . A data science landscape for SARS-CoV-2 and COVID-19 studies. Many different technologies produce a large quantity of data related to patients at different scales (e.g. molecular data, medical images and clinical data and epidemiological data). The accumulation of this data is the pre-requisite for a substantial rise of data science approaches (e.g. deep-learning and classical data mining) that often integrate existing data stored in databases or a priori knowledge (e.g. domain experts or ontologies). Such approaches produce new information about molecular interactions, phylogenetic analysis, in silico design of drugs or healthcare management decisions. The output may guide the execution of novel experiments closing the loop of the whole process. 5. Assembly. Virus particles may self-assembly with host proteins, causing the modification of some proteins. 6. Release. Viruses can be released from the host cell by lysis, a process that kills the cell.

Viruses are classified into major classes using phenotypic characteristics, such as morphology, nucleic acid type (e.g. RNA or DNA), etc. The International Committee on Taxonomy of Viruses (ICTV) is in charge of updating the viral taxonomy. The Baltimore classification system is also used where viruses are grouped into seven groups based on mRNA synthesis.

SARS-CoV-2 belongs to β-coronaviruses that are a subgroup of the coronavirus family. These are giant enveloped positivestranded RNA viruses that are usually able to infect a wide variety of mammals and avian species. All the members of the family cause respiratory or digestive and enteric diseases [11] . The infection mechanism is based on the action of surface spikes constituted by glycoproteins (named S or spike proteins), responsible for binding to host cell receptors. The literature reports seven β-coronaviruses that are responsible for causing diseases in humans. Four strains cause mild respiratory apparatus infection, which can be usually treated without lethal consequences (HCoV-229E, HKU1, NL63 and OC43). More recently, three strains of betacoronaviruses have severe and potentially fatal consequences: SARS-CoV, MERS-CoV and SARS-CoV-2 [12] . SARS-CoV caused an outbreak in China in 2002 characterised by a severe respiratory syndrome. MERS-CoV caused an outbreak in the Middle East in 2012. Both viruses caused similar disease symptoms, which led to pneumonia. MERS-CoV infected patients also presented gastrointestinal complications and kidney failure. The third member of the family, SARS-CoV-2, appeared in December 2019 in Wuhan, Hubei Province, China [13] . From the initial steps, it presented a surprisingly rapid diffusion rate. Until now, COVID-19 has killed more people than SARS and MERS combined, despite a lower fatality rate [14] . By the end of April 2020, the COVID-19 virus had already caused over 1500 000 confirmed cases around the world, of which around 350 000 hospitalised, and over 94 000 deaths. In China there has been more than 80 000 confirmed cases, with more than 3000 deaths.

The sequence and structural analysis revealed a marked similarity between SARS-CoV and SARS-CoV-2, as confirmed by the evidence that the new coronavirus binds with the ACE2 receptor. Unfortunately, it presents a closer affinity than the previous virus [14] . Moreover, the expression pattern of ACE2 in human respiratory epithelia and oral mucosa may represent the cause of human-to-human transmission. Clinical manifestations of COVID-19 may be severe since they seem to impact all of the tissues and organs that express the ACE2 receptor. Some of the clinical conditions are severe pneumonia, kidney failure, anaemia, neurological problems, cardiovascular complications and also a severe inflammation known as cytokine storm, occurring in the most serious cases [15] [16] [17] .

COVID-19 pandemic has contributed to massive, unprecedented and rapid growth in data generation and research publications across the world. Figure 1 illustrates publication trend considering keywords related to COVID-19, whereas Figure 3 depicts a snapshot of the distribution of publications considering biological related issues. As reported in the academic search engine named Dimension (https://app.dimensions.ai/discover/publica tion), a total of 730 datasets related to COVID-19 are publicly available. Published data can be primary (e.g. sequences, clinical images, medical reports) or secondary data (e.g. protein structures, interactomes, epidemiological). We consider data to be primary when it is directly generated from the virus or the patient. From the primary data, more refined, summarised and inferred outputs are elaborated and then stored as secondary data. During COVID-19 data analysis and inference play a significant role as a reference set to extract or infer new knowledge. Next, we briefly discuss primary and secondary data types, the ways they are generated and published in available repositories for COVID-19-related data mining processes for further information. 

High-throughput omics technologies (http://omics.org) use biochemical assays (i.e. analytical procedure to detect and quantify cellular processes) to discover molecules in the biological samples. Omics data fall (but are not limited to) into the following classes: genomics, transcriptomics, proteomics and metabolomics. Both omics technologies and data are used for insights into new unknown viruses. Thus, a preliminary activity to study COVID-19 disease has been the sequencing of the genome of the SARS-CoV-2 to elucidate how the virus grows, mutates and replicates [18] . Blood or throat swab specimens are collected both from population and patients showing compatible symptoms or suspected to have been infected. The extracted RNA material is then further sequenced, for instance, by using next generation sequencing (i.e. NGS). The output of NGS is the SARS-CoV-2 genome, which is usually stored in public repositories (see available SARS-CoV-2 nucleotide and protein sequence repositories in Table 1 ). NGS enables the retrieval of complete RNA information, including transcription and expression levels, functions, locations, trafficking and degradation. In a recent study, the architecture of SARS-CoV-2 transcriptome [19, 20] is reported. Transcriptomic data are highly effective in furthering understanding of the processes of cellular differentiation, carcinogenesis, transcription regulation and SARS-CoV-2, followed by the discovery of important biomarkers. The two main sources of SARS-CoV-2 RNA expression can be found in NCBI (http://www.ncbi.nlm.nih.gov/geo/) and OmicsDI (https://www.omicsdi.org/). Finally, metabolomic data analysis may be used to help in identifying potential chemical biomarkers for COVID-19. Metabolomics is used for the analysis of phenotypes and to gain insights into the metabolic state of biological systems. However, there are very few works in the literature providing metabolic data from COVID-19 samples. For instance, Shen et al. [21] reported proteomic metabolomic profiling of sera from 46 COVID-19 and 53 control patients data extracted from ProteomeXchange Consortium dataset (https:// www.iprox.org/).

Interactomics is the study of biochemical interactions among biological molecules (e.g. proteins, transcription factors, small molecules). Since these interactions are the elementary building blocks of almost all cellular processes, their elucidation appears as an essential step in describing SARS-CoV-2 mechanisms of infection and replication [22] . There are many experimental platforms for deriving physical interactions among proteins [23] , such as affinity purification mass-spectrometry (AP-MS) and yeast-two-hybrid (Y2H). Gordon et al. [24] have expressed 26 out of 29 SARS-CoV-2 proteins and used an AP-MS to identify 332 human proteins to which the viral proteins bind. There are several bioinformatics tools enabling the prediction of interactions using biological information coupled with network science (see for instance [25] for a more detailed comparison). Different proteomic technologies can be used to study the complete set of interactions for several viruses [26] [27] [28] . For instance, research projects have elucidated quite a large map of interactions for SARS-CoV-2. Such interactions are usually modelled by using graphs and stored in a growing number of databases, such as Virus Mint [29] , String Viruses [30] , HpiDB [31] , Virus Mentha [32] and VirHostNet [33] .

Despite the existence of such platforms, the rapid diffusion of the SARS-CoV-2 virus makes the extraction of reliable information (e.g. correlations, interactions) from these databases particularly difficult. Consequently, the first studies mainly used interaction predictions performed by using bioinformatic tools. For instance, in [34] , the homology among SARS-CoV-2 and other coronaviruses has been used to infer putative interactions among viral proteins and host-viral proteins. Differently, a wet-lab approach [24] has been used where SARS-CoV-2 proteome is cloned and AP-MS is used to identify 332 protein interactions between SARS-CoV-2 and human cells. Finally, in [35] , a preliminary tool consisting of a curated SARS-CoV-2 interactions dataset extracted by the IMEx consortium is presented.

Chest images data X-ray and computer tomography (CT) technologies can be used to detect lungs and respiratory tracks infected by COVID-19. Ground-glass (GGO) pattern is the most common finding in a chest CT image of a COVID-19 infected patient. Patterns are usually multifocal, peripheral and bilateral. However, during COVID-19, GGO may appear as a unifocal lesion, most commonly located in the inferior lobe of the right lung [36] . Chest X-ray images have been observed to be insensitive in the early phase of the disease. However, they become useful in tracking the progression of the disease.

The urgent need for an automatic diagnostic tool for the rapid detection of COVID-19 patients encourages the data science community to develop novel machine learning-based diagnostic frameworks. TrainingData.io (https://www.trainingdata.io/) is a platform offering a free collaborative tool that allows data scientists and radiologists to share training data annotations that can be used for developing machine learning models. We report a non-exhaustive list of repositories containing annotated COVID-19 infected chest images in Table 1 .

Epidemiological data are a collection of non-experimental observations obtained by gathering any health-related data source by domain experts, where such data include environmental, clinical and laboratory data, geographic spread and so on. Thus these data can be associated with the geographic spread as well as the risk associated with co-morbidities. Consequently, many independent groups have started to collect epidemiological data produced and made available by healthcare providers. Dong et al. [37] have designed and developed the first dashboard hosted at the Johns Hopkins University, providing free access to health data collected from almost all nations. Data are related to COVID-19 reported cases, i.e. infected, dead and recovered patients. For instance, the Italian government has provided a similar dashboard together with raw data related to COVID-19 [38] . Similarly, Xu et al. [39] have realised an open-access database for storing patient information produced in laboratories. Stored data are related to movements (for retrieving travel history), symptoms and demographics. All of these projects share some common characteristics: (i) the use of simple formats (e.g. tabular formats), (ii) the possibility of export in standard data sharing format (e.g. comma-separated values), (iii) simple query interfaces, (iv) the integration of geographic data and (v) demographic information [39] , as reported in Table 1 .

A drug is a small organic molecule [40] that can activate or inhibit the function of a therapeutically relevant protein during the onset of a disease. The discovery of a new novel drug and the subsequent approval by the Food and Drug Administration (FDA) is a complex, expensive and time-consuming process. Drug discovery involves two main steps: (i) drug-target identification and (ii) development of small molecules able to interact with the target [41] . The approved drug molecules and targets are often stored in publicly available databases, usually in the simplified molecular input line entry system (SMILE) format, for drug repurposing or commercial development. We report some drug databases that are useful for drug-discovery process. For example, DrugBank is a repository containing both drug and drug target information. The latest release contains 13 596 drug entries, including 2640 approved small molecule drugs, 1389 approved biological molecules (i.e. proteins, peptides, vaccines and allergenic), 131 nutraceuticals and over 6377 experimental drugs still in discoveryphase. Additionally, 5225 non-redundant protein sequences (i.e. drug target/enzyme/transporter/carrier) are linked to these drug entries. PubChem is a collection of freely accessible chemical information that stores chemical and physical properties, biological activities, safety and toxicity information, patents and literature citations. It contains 111 million compounds, 287 million substance descriptions and details, 273 million of bioactivities conducted on compounds and over 32 million of pieces information relating to drugs with published papers and 25 million patent descriptions. Excelra is an opensource COVID-19 Drug Repurposing Database that stores a list of approved small molecules being at an early stage of experimentation. Pharmaceutical Technology (https://www.pharmaceuti cal-technology.com/coronavirus-drug-trials-studies/) is a coronavirus drug tracker that lists drugs at all stages of pre-clinical and clinical development (from discovery through to preregistration) for COVID-19. This list is updated dynamically, based on the Global Data Pharma Intelligence Center Drugs database (https://www.globaldata.com/). CAS (https://www.cas.org/) contains a connection of nearly 50 000 chemical substances, stored in the SD file (.sdf) format, along with related metadata such as CAS Registry Number and physical properties for each element. Other relevant non-exhaustive drug database instances are listed in Table 1 .

Data science is a novel interdisciplinary research field that leverages methods, processes and algorithms supporting the extraction of relevant knowledge from (big)-data. The term data . Major phases of data science pipeline towards decision making and analysis. Data initially collected and integrated from many sources. Then they need to be pre-processed to filter uninformative or possibly misleading values (e.g. outliers or noise). Then existing models are used to explain data or extract relevant patterns describing data or predicting associations. Finally, results need to be interpreted and explained by domain experts. Each step of analysis may generate corrections or refinements that are applied to precedent steps. science was introduced for the first time in 2008 by DJ Patil and Jeff Hammerbacher [43] . Data science pipelines are made of four major phases: (i) raw data collection, (ii) preprocessing, (iii) descriptive or predictive modelling and (iv) interpretation. An illustrative representation of a typical data science workflow for COVID-19 management is reported in Figure 4 .

The integration of heterogeneous data is a crucial step in data science. The success of any data science model depends on the quality of data. Due to flaws, noise or errors in data generation, the outcome of a data science workflow can lead to incorrect results and/or interpretations. It is crucial therefore to apply different scrubbing and cleaning processeses on the data. Data standardisation and transformation are required if data have been generated by multiple and varying sources. Collectively, all of the above steps are called preprocessing [44] . Data exploration, which consists in feeding data to the computing model, can then be performed. Any data science process should include the statistical description (e.g. type, distribution, significant features, relationship among the data variables) of the input dataset, which leads to a better understanding of the data itself and of the preprocessing and analysis phases. Feature selection helps in identifying relevant attributes in the dataset. Visualisation aims to explore the possible relationship between features in the dataset or among different datasets. Dimensionality and data reduction help to make the data science workflow more efficient and resistant to noise.

Data can be analysed through the use of both descriptive (unsupervised) and predictive (supervised) models that often may be merged together by ensembling [45] . It is worth mentioning a few deep learning frameworks that could be helpful for COVID-19 predictive data analysis. Convolutional neural networks (CNNs) [46] are extensively used to analyse radiological chest images of COVID-19 patients. A different kind of CNN model, specifically designed for graph or network data, is a graph convolution network (GCN) [47] . Due to the lack of adequate training samples during COVID-19 to train deep models, the synthetic data play a significant role [48, 49] . A recent breakthrough in deep model architecture is given by the generative adversarial networks (GANs) [50] for generating synthetic data akin to real data. A GAN is made of two simultaneously trained neural networks: generator and discriminator. A discriminator recognises training samples, whereas a generator creates fake instances to challenge the discriminator, which enhances both modules in learning crucial discriminating features in the original dataset. Regression analysis is a supervised model that can also be used to predict pandemic trends [51] . Clustering [52] is a well-known unsupervised learning model that describes and summarises the hidden pattern inside data based on certain proximity metrics. Results of the previous methods have to be validated by domain experts. Expert feedbacks can also be iteratively suggested to refine the phases of the data science pipeline (see Figure 2 ).

The main objective of the omics study is to discover the proximal origin of SARS-CoV-2, its mutational variants, and develop a predictive model for identifying SARS-CoV-2 from an isolated strain or sequence or effective chemical biomarker identification through transcriptomics and metabolomics studies. In addition to that, nucleotide sequences are used to determine the SARS-CoV-2 viral genome and 3D protein structure prediction as depicted in Figure 5 .

Discovering the evolutionary origin of SARS-CoV-2 is currently the most urgent research topic. It aims to generate an evolutionary tree by using its nearest species [53] . Clustering is one of the most popular techniques to create a phylogeny among the coronavirus family-like SARS-COV and MERS-COV [54] [55] [56] . The evolution of viruses is mapped onto in silico algorithms using phylogenetic analysis. In general, the creation of a phylogenetic tree primarily relies on the analysis of sequence through alignment. Several algorithms have been developed to produce high-quality sequence alignments for both global alignments (focusing on the whole sequence) and local alignments (focusing on local regions of the sequence). Few common established methods and software have been used for SARS-CoV-2 genome alignment, such as DNAMAN (https://www.lynnon.com/dnama n.html), ClustalW [57] , MUSCLE [58] , Jalview [59] and MAFFT [60] . From alignment data, the evolutionary history is inferred by using the neighbour-joining (or maximum likelihood) method [57, 61] . Alignment approaches are also frequently used for the identification of single nucleotide polymorphisms (SNPs) of rapidly evolving viruses like SARS-CoV-2 [12, 12, 62] . On the other hand, alignment-free methods use feature-based approaches and compare the sequences by using the derived features. Due to the low mutation rate and a high degree of similarity among SARS-CoV-2 genomes, very few studies have been performed using alignment-free methods [63, 64] . Studies reveal that SARS-CoV-2 genome is similar to SARS-related coronaviruses (https:// www.ecohealthalliance.org/2020/01/) found in the Pangolin [65] or Bat [66, 67] . Scientists are studying the new strains variants of coronavirus to understand its mutant variants. Despite the close similarity of the different SARS-CoV-2 genomes [68, 69] , significant variations are also reported in the literature [70] . These studies, important for an understanding of the spread of the disease, can be useful for antiviral drug design [71] .

For the elucidation of the infection mechanism it is important to individuate the complete sequence of virus genome and its circulating variants by means of machine learning models coupled with comparative genomics. Both the alignment and alignmentfree methods are applied to generate features. In order to train machine learning models, some of the most used features are k-mers (i.e. subsequences of length k) and N-grams (i.e. a contiguous sequence of N items from a given sample), amino acid chemical properties and mutation information extracted by alignment methods.

In [63] an integrated approach is used to identify key genomic features that differentiate SARS-CoV-2 from SARS-COV and MERS-COV [72] coupled with decision trees that are also applied for sequence classification based on over 5000 unique viral genomic sequences, totaling 61.8 million bp (base pairs) that include 29 COVID-19 virus sequences. Recent works have sought to combine five well-known classification models in selecting features derived from a set of genomes belonging to a large set of coronavirus families and genomes of SARS-CoV-2 for detecting novel SARS-CoV-2. For instance, Fang et al. use a bi-path CNN (BiPathCNN) [73] .

Mutations of the genome may alter the encoded amino acid sequence, and the so-called non-synonymous mutations can influence the structure and function of the resulting protein [71] . Understanding the protein structures is required for identifying functional motifs and elucidating the possible binding mechanisms with the host proteins and for discovering antiviral drugs [74, 75] . The elucidation of protein structures by wet lab experiments requires a considerable amount of time. Structure prediction therefore is performed by using computational methods. Recently, SARS-CoV-2 proteins have been predicted (https://deepmind.com/) by the AlphaFold and the structures are deposited in the Protein Data Bank (https://www.rcsb.org/). AlphaFold [76] is a deep two-dimensional dilated convolutional residual network that predicts the inter-residue distances between pairs of amino acids and the angles between chemical bonds that connect those amino acids. trRosetta [77] is also used to predict SARS-CoV-2 protein structures. In addition to this, other existing protein structure and homology, modelling tools like COMPOSER [78] , SWISS-MODEL [79] , PyMOL c [80] and I-Tasser [81] are used for rapid prediction and comparison of Spike (S) protein [82, 83] , Envelope (E) protein [2] and ab initio homology modelling [81] .

Alongside sequence data and structural analysis, several researchers have focused on transcriptomics and metabolomics data analysis to design better therapeutic strategies for COVID-19. The aims of the transcriptomic data analysis are to investigate the activity of the set of genes in different organs and functional pathways and their possible role in causing infections and the regulation of various immunological factors inside the cell of SARS-CoV-2 patients during COVID-19 disease.

A study of transcriptomic data analysis of COVID-19 lungs and bronchoalveolar lavage fluid samples revealing predominant B-cell activation responses to infection is presented in [84] . The authors have used Metascape [85] for functional enrichment analysis of experimental data to determine the transcriptomic signature of lung tissues from COVID-19 patients. Further, xCell software [86] is used for computational deconvolution analysis to evaluate the relative proportions of immune cell subsets inCOVID-19 and healthy control samples.

For a better understanding of the pathophysiology of COVID-19, Gardinassi et al. [87] analysed public transcriptome datasets. They have considered the transcriptional signature of COVID-19 infected with SARS-CoV-1 and Influenza A (IAV) viruses. A core transcriptional signature induced by the respiratory viruses in peripheral leukocytes has been identified and the absence of significant type I interferon/antiviral responses has also been noted for SARS-CoV-2 infected. They also have identified the higher expression of genes involved in metabolic pathways including heme biosynthesis, oxidative phosphorylation and tryptophan metabolism.

Based on the publicly available high throughput gene expression data of several respiratory infection viruses, including SARS-CoV-2, a host transcriptome-based drug repurposing strategy has also been proposed [88] . The two main areas are interaction network construction using functional enrichment analysis and drug repurposing. STRING data repository is used for interaction network construction and functional enrichment analysis. For drug repurposing, the authors have used DrugBank and Connectivity Map (CMap) provided by the CLUE (https://clue.io/cmap) web tool. Finally, they suggested six approved PTGS2 inhibitor drugs for the treatment of COVID-19. Similar studies [89] were conducted, where analysis of major histocompatibility complexes and innate immune system gene expression from SARS-CoV-2 infected RNA-seq data of human cell line and virus transcriptome data is utilised to predict T-and B-cell epitopes.

To the best of our knowledge, few studies involve metabolomic data on COVID-19 patients. A metabolomic data analysis coupled with proteomics data is conducted for metabolomic characterization of SARS-CoV-2 patient sera in [21] . A random forest based learning model is applied on proteomic and metabolomic data derived from 18 non-severe and 13 severe patients. Finally, the authors find 29 important variables including 22 proteins and 7 metabolites. In another attempt, metabolomic and lipidomic data [90] are used revealing that metabolite and lipid alterations are correlated during COVID-19 disease.

The majority of the COVID-19 related research relies on genomics and proteomics data of SARS-CoV-2 and other coronaviruses more than transcriptomic and metabolomic data. Existing tools have been extensively used to analyse SARS-CoV-2 omics data, and very few innovative approaches have been developed. Of course, the omics data generation and free distribution have made the major contribution to omics research. With the availability of high throughput and high resolution omics data it is now possible to perform a microlevel investigation of COVID-19 pathogenesis. Multi-omics data integration [91] together with effective data science and machine learning models [92, 93] is one possible way to improve understanding of the pathogenesis of COVID-19 or other viral diseases.

Interactomics research related to SARS-CoV-2 has two main goals: (i) development of possible therapies for helping affected people, and (ii) introduction of a novel vaccine for blocking the spread. Despite the existence of many different laboratories that have sequenced the whole genome and the availability of such data, the above-described issue may be successfully addressed only by looking at a molecular scale through the elucidation of interactions of viral and host proteins. As aforementioned, during the replication step, the virus proteins use the host environment, interact with each other and the host proteins, causing loss of function or even the death of the cells. The complete elucidation of the whole set of such interactions is therefore a crucial step for combatting the viruses. Understanding the interplay between host and virus proteins is crucial to identify virusrelated diseases and potential targets for therapeutic strategies.

Such information may clarify the viral molecular machinery during the viral infection, survival within the host and replication. This knowledge can also help to discern the protein interactions that are crucial for transmission and replication [34] .

The literature contains many examples of the use of mass spectrometry for determining SARS-CoV-2 protein interactions [24, 94] . For instance, AP-MS based SARS-CoV-2 -host interactomes reported for 26 SARS-CoV-2 proteins with 332 host proteins [24] . The study aimed to identify possible drug targets. Therefore they isolate 66 possible drug targets in human proteins suggesting potential 69 compounds (of which, 29 drugs are approved by the US Food and Drug Administration, 12 are in clinical trials and 28 are pre-clinical compounds). As it is a time-consuming and expensive task to elucidate experimentally validated complete host protein interactions with viral protein.

The in silico prediction is the only viable alternative. Some studies presented the investigation of virus-host interactomes using tools and methodologies from graph theory [27, 95] , demonstrating the importance of studying virus-host interplay at network level [96] [97] [98] [99] [100] . Data related to the interactions (or functional associations) among biologically relevant macromolecules (e.g. proteins, genes, etc.) are usually modelled by means of graph theory and its related formalism [101, 102] . Consequently, biological entities are represented as nodes, while edges model their associations [103] . Such networks may contain a single kind of molecule, such as protein-protein Interactions (PPI), or genegene interactions [23, 104] . More recently, it has been shown that biological processes are formed by the synergistic interplay of different molecules (i.e. genes, non-coding RNA, proteins, mi-RNA, etc.) [105] . Consequently, novel models that integrate such diverse aspects and describe the interplay of the heterogeneous actor have been introduced. The use of more complex network models comprising different nodes and the various interactions is growing [106, 107] . The SARS-CoV-2 scenario, as we describe in the following, also contains such models (e.g. [10] ). In Figure 6 we report a summarised view of in silico interactome graph inference workflow, involving different interactome and omics data sources.

From a bioinformatics perspective, a few key questions need to be addressed [9] :

• Are the infected proteins central or peripheral? • Do all of the viruses attach to similar proteins from a network point of view?

• What happens in an infected host interactome?

These considerations guided the first attempts to produce an interactome-wide map of SARS-CoV-2 proteins and their interactions with human proteins enabling scientists to answer the above questions. Thus, the interactions may be categorised in two main classes: (i) intra-viral interactions, i.e. interactions that occur among viral proteins that are in general limited and easy to determine; (ii) host-virus interactions, i.e. interactions that occur among viral and host proteins, which may potentially be numerous. One of the main challenges in this area is represented by the different speeds between the spread of SARS-CoV-2 and the time needed for wet-lab experiments. Therefore, all of the approaches we discuss below integrate both in silico and wet-lab experiments.

One of the first approaches of building a SARS-CoV-2 interactome is described in [34] . The authors have derived the first map of both intra-viral and host-viral proteins using a bioinformatics approach based on the homology between SARS-CoV-2 and the previous 2002 SARS-CoV virus. The hypothesis underlying Multiple interactome analysis is another method used to integrate data obtained from heterogeneous protein or gene networks. In a similar attempt the authors in [108] proposed the integration of PPIs and gene expression data that are both obtained from available databases. Authors started with data related to three existing viruses (SARS-CoV, MERS-CoV, HCoV-229E) to infer the interactome of SARS-CoV-2. They also integrated an additional PPI database in order to reconstruct the action of SARS-CoV-2 at the proteome level, obtaining a network consisting in 13 020 nodes and 71 496 interactions. In parallel, the authors inferred a gene co-expression network using random walk with restart (RWR) algorithm and S-glycoproteins of SARS-CoV, MERS-CoV and HCoV-229E as seeds. Similarly, the HCoV-host interactome network was built by assembling known networks (e.g. SARS-CoV, MERS-CoV, HCoV-229E, HCoV-NL63, mouse MHV, avian IBV) obtaining a SARS-CoV-2 phylogenetically close interactome. As a novel attempt [109] , the codon pattern is used to infer possible interactions between 26 SARS-CoV-2 proteins and selective host proteins involved in 17 major cell signalling pathways.

For the described reasons above, and those connected with the previous studies into the SARS-CoV-2 virus, the number of known protein interactions is constantly growing. Discovering these interactions constitutes the first step in determining targeted therapies. Despite the large number of investigations in the literature, virus-host interactomes are far from exhaustive and the impact of mutations in both virus and humans has not yet been completed unravelled. Therefore, research in this field benefits from any increase in the discovery of novel protein interactions. Nonetheless the development of targeted therapies suffer from certain limitations. Finally, it should be noted that the integration of data collected from different omics sources [110] and medical images may contribute to understanding the evolution of the disease.

Image analysis transforms digital images into measurements providing meaningful information of the images themselves. Automated chest image analysis may help in the early diagnosis of COVID-19 thereby assisting physicians in case of emergency. Two kinds of chest radiography images, obtained through Xray and CT scanners, are recognised as useful for diagnosing pneumonia in COVID-19 patients. Data manipulation techniques can be used for the automatic (or semi-automatic) analysis of large amounts of images requiring substantial quantitative assessment and computation. Chest images can be used in many COVID-19-related scenarios, e.g. predicting the need for ICU resources, predicting survival rates, studying the patient's trajectories during treatment [1] . Imaging is a fast, non-invasive, relatively cheap and already a widely adopted clinical practice that can be used to monitor the evolution of the disease. The ultimate goals are improving patient healthcare, biomarker design for the COVID-19 and, most importantly, early COVID-19 detection in patients.

Machine learning methods or CNN-based methods have also been used for this aim. For instance, COVID-Net [111] is the first open source CNN-based framework designed using deep learning techniques for COVID-19 detection. It has been used to analyse X-ray chest images; authors developed COVIDx, an open-access benchmark dataset composed of 13 975 CXR images from 13 870 patients. Model performance has been evaluated with other deep neural network architectures for comparative purposes. The model predicts three possible outcomes for each input image: (a) healthy, (b) non-COVID-19 (e.g. viral, bacterial, etc.) infection and (c) viral COVID-19 infection.

A transfer learning-based CNN has also been applied in [1] for detecting various anomalies in small medical image datasets. The authors collected 1427 X-ray images (224 COVID-19, 700 common pneumonia and 504 normal cases) from several sources such as Cohen (https://github.com/ieee8023/covid-chestxray-da taset), Radiological Society of North America (RSNA), Radiopaedia and the Italian Society of Medical and Interventional Radiology (SIRM) (https://www.kaggle.com/andrewmvd/convid19-xra ys).

In [112] , a deep-learning based method has been applied to pulmonary CT images to distinguish patients affected pneumonia related COVID-19 from healthy cases. At first, candidate infection regions have been isolated from the pulmonary CT image set by using Residual CNN (ResNet-23) , a pre-trained neural network to identify image features. A combined CNN-long short term memory (LSTM) architecture is also used to detect infected patients X-ray images in [49] . CNN is used for extracting features from images, while LSTM is used for analysing these features. Similarly, in Inf-Net [113] , a CNN has been used to perform the segmentation of lung CT images of COVID-19 patients. Moreover, in the absence of training images, synthetic Chest X-Ray (CXR) images can be generated by using the GAN model proposed in [48] or by using the statistical techniques described in [114, 115] . A binary classifier, using manta-ray foraging optimization (MRFO) for features extraction and KNN for classification, has been used to classify COVID-19 affected chest X-ray images in [116] .

Deep learning is one of the most commonly adopted choices by the data science community. Due to the availability of deep models and easy-to-use frameworks, researchers are able to use them to set up and develop methods for helping in COVID-19 diagnosis. The integration of both X-ray and CT scans may improve the quality of detection. Due to the similar lung damage and symptoms between COVID-19 and common pneumonia or influenza, the chances of false positives are quite high. One option for more accurate COVID-19 diagnosis, which is yet to be fully explored, is to integrate information extracted from chest images with transcriptomic and metabolomic data [110] . Despite the increasing demand of rapid COVID-19 diagnostic systems, most of the data science approaches described above suffer from low data availability. The larger the sample data volume, the more reliable the diagnostic system should be. In order to compensate for the data scarcity, both GAN and statistical models can be successfully used the hand the issue.

From the beginning of the outbreak, a significant source of information has come from observation of novel COVID-19 cases and has been used to predict the evolution of the disease diffusion. Such data have been used with both existing and ad hoc mathematical models [39, 117] . The main aims of these approaches are (i) controlling the diffusion of COVID-19; (ii) supporting healthcare providers in allocating resources (e.g. planning ICUs); and (iii) evaluating the impact of containment measures.

From a data science perspective view, almost all of these efforts use real data to build and fit both predictive and observatory models. Most of them use deterministic models based on classical epidemiological studies. Consequently, real data are used to calculate model parameters based on ordinary differential equations (ODE) [118] [119] [120] . The diffusion of such works has been very rapid; for instance, simple queries on Google Scholar or on preprint servers returned more than two thousands papers on average.

In [121] authors integrated information of existing data sources provided by the Johns Hopkins University, WHO, Chinese Center for Disease Control and Prevention, National Health Commission and Dingxiangyuan (DXY, a Chinese epidemiological database). The proposed tool allows scientists to perform exploratory data analyses, using visualisation techniques to highlight differences in the reported cases (e.g. infected, dead and recovered people), in different countries.

Moreover, more sophisticated models have tried to integrate epidemiological data with other data to study the impact of other information (e.g. environmental, geographical, clinical) [122, 123] . A COVID-19 outbreak forecasting model has been developed using LSTM networks [124] . The John Hopkins University and Canadian health datasets have been used to extract significant features able to predict trends and possible duration of the current global COVID-19 pandemic.

The COVID-19 outbreak has been characterised by a notable accumulation of epidemiological data publicly available on the web. The data are mainly related to infected patients as well as to the number of deaths. However, it should be noted that such datasets present some main drawbacks: (i) there is in general little attention paid to their reliability and (ii) there has been little effort in data format standardisation and semantics. For instance, there is no standardised way to determine the exact number of infected; moreover, variations in diagnosis the causes of death at country level may skew mortality figures at different countries. The vast majority of the described approaches integrating epidemiological and environmental data shows the above-mentioned limitations. From a computer science point of view, as far as principled data integration is concerned, the use of technologies and methodologies from the data-warehouse and on-line analytical processing (OLAP) communities may constitute a valid and stable choice, since most of the data are available in text format and could be stored on a relational or NoSQL database.

Drug discovery aims to identify small molecules that potentially modulate the functions of target proteins. The development of a new drug molecule for COVID-19 is a time-consuming and costly task. In the COVID-19 era, the long process for the determination of a novel drug is not feasible, due to the rapid spread of the virus. It is of utmost importance to identify candidate anti-viral drugs more rapidly; these may control the adverse effects of COVID-19, thereby hopefully reducing the mortality rate. The best alternative is to look for already FDA-approved drugs that may bind with the therapeutic target (viral or host) proteins.

Data analysis for discovering possible candidates from the existing drugs is a well-known process referred to as 'drug repurposing'. It involves the identification of new uses for approved (or experimental) drugs as a possible cure for novel pathologies. The process, as depicted in Figure 7 , is based on the integration of molecular data (e.g. interactomes, co-expression networks), concerning the existing drug-disease association.

The availability of high-resolution proteomics, interactomics and drug-target association data makes it now feasible to quickly find a suitably small molecule in silico with the help of advanced (deep) neural network models. A good number of deep learning-based drug-target associations and repurposing tools are available for other viral diseases and thus can also be used for COVID-19 data analysis (see Table 2 ).

Recent trends have adopted network-embedding techniques [125] and DNN to produce lists of possible candidate drugs that will be confirmed through wet-lab experiments and clinical trials. It should be noted that, after the in silico identification, the drug repurposing process requires time and funds for the subsequent clinical trials, but the overall time required is shorter than developing a new molecule from scratch [126] .

The authors in [24] have used an experimentally validated host-viral network to test 69 existing drug compounds constituting potential drug targets to treat COVID-19. Multiple network-based strategies coupled with GCNs have been explored [133] , kGCN [134] , DeepChem [135] , D3Targets-2019-nCoV [136] , CoVex [137] to rank drug repurposing candidates. At first, COVID-19 interactome modules have been identified, considering 56 different human tissues. Existing drug molecules have then been ranked by means of a proximity measure based on their ability to interact with their protein targets. In [125] network proximity analyses have been performed on drug targets and HCoV-host interactions and 16 potential anti-HCoV repurposable drugs have been selected. They have used host proteins from four known HCoVs (SARS-CoV, MERS-CoV, HCoV-229E and HCoV-NL63) based on phylogeny analysis and performed functional enrichment followed by drug association analysis. Another network-based approach for deriving possible drug targets has been attempted in [10] , where both protein interaction and gene co-expression networks have been used to identify master regulators [127] involved during SARS-CoV-2 infection. Physical interactions of proteins were extracted from [34] . The co-expression network has been generated by using SARS-CoV-2-human interactome proteins, derived from [128] and the largest human lung RNA-Seq dataset available from the GTEX consortium (www.gtexportal.org). The authors identified a number of key proteins involved during an infection such as ACE2, TMPRSS and MOCK. They hypothesised that these proteins may be potential therapeutics targets, evidencing that COVID-19 is characterised by a large inflammation process, not limited to the respiratory apparatus.

As discussed in Section 6, drug repurposing is a crucial methodology for COVID-19-related therapies, since the adoption of the classical and very time-consuming drug-discovery approach [129] is unfeasible. Data science and computational intelligence are the two most useful building blocks of all the other drug repurposing approaches. Unfortunately, drug repurposing process needs, to the best of our knowledge, the introduction of up-to-date medical guidelines to become widely adopted. Despite the fact that data science will clearly help in accelerating drug repurposing, some challenges remain. For instance, in silico drug repurposing, being based on simplified models for both humans and viruses, may not reproduce all of the possible side effects. Moreover, the vast majority of drug repurposing approaches do not consider the impact of dosage or the responses on different tissues (the original drugs could be optimised for other scenarios). Therefore, clinical tests, which are mandatory for candidate molecules, may well lead to a slowdown in the process, since short term trials may not have sufficient statistical power (e.g. due to the small number of patients). Moreover, existing drug repurposing approaches do not often consider, among other factors: (i) heterogeneous populations with different genetic backgrounds, (ii) the existence of different phenotypes (e.g. patients with a different level of illness), (iii) as well as genetic differences on the SARS-CoV-2 circulating variants. Nevertheless, some examples have produced positive results such as [130, 131] .

As evidenced before, the potential applications for data science, deep-learning and artificial intelligence are countless in this field. However, due to the speed of the spread of the virus and the number of novel approaches proposed worldwide, it could seem that data science may fail to slow down the pandemic, hence the urgent need for a comprehensive vademecum for practitioners, industry experts, as well as researchers. In this work, we provided an in-depth overview of the data sources and methods that are currently used to elucidate the primary mechanism of pathogenesis and development of COVID-19. We included data types from the molecular scale to patient (medical imaging) and population-scale (epidemiological data) and discussed the main approaches for modelling COVID-19 infection, drug repurposing, population surveillance, disease and treatment. Table 2 provides an overall summary of data science models, types of tasks and data and various software tools. We also discussed some relevant challenges for data science applications in healthcare, including the need to introduce more standards and the need for more straightforward data integration. Finally, we firmly believe that data science can be valuable support in fighting COVID-19.

• Ontology-based federation of data: The current scenario is characterised by many data formats that differ both in schemas and content; there is, therefore, the need to introduce a novel data federation mechanism that is able to integrate data both horizontally and vertically.

• Development of graph-based models: The integration of data into a unified model (ideally including patients molecular and clinical information) could be a key feature in gaining more precise and effective modelling the diffusion of the virus [91, 110] and the definition of more 'models';

• Leveraging the use of efficient and high-throughput analysis workflows: The rapid spread of the virus and the unprecedented production of data require the introduction of novel efficient and high-throughput data analysis environments, possibly structured as virtual laboratories federated by means of cloud infrastructures [138] .

• Analysis of circulating variants and their impact: Due to the rapid mutation rate of the virus, a large number of mutations are constantly appearing for SARS-CoV-2 which may alter its protein structures. Structure-based drug development depends on the structural coherence between drug molecules and target proteins; hence the study of viral structure variations is essential for stable anti-viral drug development. By predicting strain variations with machine learning methods, domain experts will be able to design anti-viral drugs or reuse those known to be effective in similar contexts.

• Low data deep models for drug discovery: The accuracy of deep-learning models depends on the availability of wellsized training datasets. Unfortunately, these large datasets are often unavailable, or unbalanced (e.g. far more positive examples than negative ones). Therefore, there is a need for generating reliable and statistically sound synthetic datasets. For instance, synthetic sample data (X-ray) is generated by using generative adversarial networks during the training phase of the model. In a recent attempt, a one-shot LSTM framework [139] has been proposed [140] for repurposed drug discovery in cases of low data availability [141] . A similar method has yet to be designed and implemented for COVID-19.

• Explainable artificial intelligence for a more reliable diagnostic systems: Diagnosis and drug discovery are two of the most sensitive tasks requiring very high predictive accuracy. Due to the phenotype similarity between COVID-19 infection and pneumonia, differentiating early symptoms of COVID-19 chest infection can be a challenging task. Explainable artificial intelligence [142] is an innovative concept enabling both researchers and domain experts to trace back the results obtained from the AI model from output features to the input features, thus allowing for a clearer interpretation of data and information. Explainable AI models may be implemented in COVID-19 image-based clinical diagnostic systems for earlier and more reliable prediction.

Pietro H. Guzzi and Swarup Roy conceived and designed the study and equally shared work responsibility and guidance. Swarup Roy defined the manuscript structure. Jayanta Kumar Das and Giuseppe Tradigo equally contributed to the definition of the study and writing. Pierangelo Veltri supervised the study, and he is responsible in reviewing the final version.

• We present a comprehensive review of the in silico approaches adopted so far to handle COVID-19 in genomics, proteomics, interactomics, epidemics, clinical imaging and drug design; • We present data analytics tasks related to COVID-19;

• We present an overall landscape suggesting strategies to integrate heterogeneous COVID-19 data sources; • We report data sources, models and tools, which can be used to study SARS-CoV-2 and COVID-19; • We take a look at computational biology and bioinformatics approaches available in the literature.

Covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks

Sars-cov-2 envelope and membrane proteins: structural differences linked to virus characteristics?

Association of the covid-19 pandemic with internet search volumes: a Google trendstm analysis

Dimensions: building context for search and evaluation

Google scholar: the new generation of citation indexes

Network medicine: a network-based approach to human disease

Comorbidity and its impact on patients with covid-19

Human viruses in water, wastewater and soil

Data mining and life sciences applications on the grid

Master regulator analysis of the sars-cov-2/human interactome

Sars-cov-2 is an appropriate name for the new coronavirus

The proximal origin of sars-cov-2

The novel coronavirus originating in Wuhan, China: challenges for global health governance

Characteristics of and important lessons from the coronavirus disease 2019 (covid-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention

Neuroinfection may contribute to pathophysiology and clinical manifestations of covid-19

Covid-19 and the cardiovascular system

Covid-19 for the cardiologist: a current review of the virology, clinical epidemiology, cardiac and other clinical manifestations and potential therapeutic strategies

Sharing mass spectrometry data in a grid-based distributed proteomics laboratory

The architecture of sars-cov-2 transcriptome

Coresnp: parallel processing of microarray data

Proteomic and metabolomic characterization of covid-19 patient sera

Comparative analysis of nuclear estrogen receptor alpha and beta interactomes in breast cancer cells

Protein-to-protein interactions: technologies, databases, and algorithms

A sars-cov-2 protein interaction map reveals targets for drug repurposing

Biological Network Analysis: Trends, Approaches, Graph Theory, and Algorithms

Hepatitis c virus infection protein network

Epstein-Barr virus and virus human protein interaction maps

Virus-host interactomes and global models of virus-infected cells

Virusmint: a viral protein interaction database

Viruses. string: a virus-host protein-protein interaction database

Hpidb 2.0: a curated database for host-pathogen interactions

Virusmentha: a new resource for virus-host protein interactions

Virhostnet 2.0: surfing on the web of virus/host molecular interactions data

Structural genomics of sarscov-2 indicates evolutionary conserved functional regions of viral proteins

The imex coronavirus interactome: an evolving map of coronaviridae-host molecular interactions

Ct features of coronavirus disease 2019 (covid-19) pneumonia in 62 patients in Wuhan, China

An interactive web-based dashboard to track covid-19 in real time

Open access epidemiological data from the covid-19 outbreak

Drugs, their targets and the nature and number of drug targets

Machine learning applications in drug development

Epidemiological data from the covid-19 outbreak, real-time case information

Social media, open science, and data science are inextricably linked

Pre-processing: a data preparation step

Functional module extraction by ensembling the ensembles of selective module detectors

Deep learning in neural networks: an overview

Semi-supervised classification with graph convolutional networks

Covidgan: data augmentation using auxiliary classifier gan for improved covid-19 detection

A combined deep CNN-LSTM network for the detection of novel coronavirus (covid-19) using x-ray images

Generative adversarial nets

Linear regression analysis to predict the number of deaths in India due to sars-cov-2 at 6 weeks from day 0 (100 cases

An approach to find embedded clusters using density based techniques

Genomic variance of the 2019-ncov coronavirus

A comparative sequence analysis to revise the current taxonomy of the family coronaviridae

Cluster of Middle East respiratory syndrome coronavirus infections in Iran

Complete genome sequence of transmissible gastroenteritis coronavirus pur46-mad clone and evolution of the purdue virus cluster

Analysis of therapeutic targets for sars-cov-2 and discovery of potential drugs by computational methods

Probable pangolin origin of sars-cov-2 associated with the covid-19 outbreak

Jalview version 2-a multiple sequence alignment editor and analysis workbench

Mafft multiple sequence alignment software version 7: improvements in performance and usability

Distribution of covid-19 and phylogenetic tree construction of sars-cov-2 in Indonesia

Genotyping coronavirus sars-cov-2: methods and implications

Genomic determinants of pathogenicity in sars-cov-2 and other human coronaviruses

Comparative analysis of human coronaviruses focusing on nucleotide variability and synonymous codon usage pattern

Molecular evolution and phylogenetic analysis of sars-cov-2 and hosts ace2 protein suggest Malayan pangolin as intermediary host

Sars-cov-2/covid-19: viral genomics, epidemiology, vaccines, and therapeutic interventions

Bat origin of a new human coronavirus: there and back again

Sars-cov-2 molecular and phylogenetic analysis in covid-19 patients: a preliminary report from Iran

Phylogenetic network analysis of sars-cov-2 genomes

Genomewide analysis of sars-cov-2 virus strains circulating worldwide implicates heterogeneity

Investigating the genomic landscape of novel coronavirus (2019-ncov) to identify non-synonymous mutations for use in diagnosis and drug design

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: Covid-19 case study

Ppr-meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning

Structural and functional properties of sars-cov-2 spike protein: potential antivirus drug development for covid-19

Composition and divergence of coronavirus spike proteins and host ace2 receptors predict potential intermediate hosts of sars-cov-2

Improved protein structure prediction using potentials from deep learning

Improved protein structure prediction using predicted interresidue orientations

Knowledge based modelling of homologous proteins, part i: threedimensional frameworks derived from the simultaneous superposition of multiple structures

Swiss-model: homology modelling of protein structures and complexes

Pymod 2.0: improvements in protein sequence-structure analysis and homology modeling within pymol

The i-tasser suite: protein structure and function prediction

Structure, function, and antigenicity of the sars-cov-2 spike glycoprotein

Prediction of sars-cov2 spike protein epitopes reveals HLA-associated susceptibility

Transcriptomic analysis of covid-19 lungs and bronchoalveolar lavage fluid samples reveals predominant b cell activation responses to infection

Metascape provides a biologist-oriented resource for the analysis of systemslevel datasets

xcell: digitally portraying the tissue cellular heterogeneity landscape

Immune and metabolic signatures of covid-19 revealed by transcriptomics data reuse

Host transcriptome-guided drug repurposing for covid-19 treatment: a meta-analysis based approach

Sars-cov-2 transcriptome analysis and molecular cataloguing of immunodominant epitopes for multi-epitope based vaccine design

Plasma metabolomic and lipidomic alterations associated with COVID-19

Multi-omics data integration, interpretation, and its application

Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine

Integration of metabolomics, lipidomics and clinical data using a machine learning method

Humanized single domain antibodies neutralize sars-cov-2 by targeting the spike receptor binding domain

Herpesviral protein networks and their interaction with the human proteome

The landscape of human proteins interacting with viruses and other pathogens

Herpesviral protein networks and their interaction with the human proteome

Interactome networks and human disease

Preprocessing of mass spectrometry proteomics data on the grid

µ-cs: an extension of the tm4 platform to manage affymetrix binary data

Network propagation: a universal amplifier of genetic associations

Multiple network alignment via multimagna++

Cluenet: clustering a temporal network based on topological similarity rather than denseness

Integrated analysis of micrornas, transcription factors and target genes expression discloses a specific molecular architecture of hyperdiploid multiple myeloma

ProphTools: general prioritization tools for heterogeneous biological networks

Integrative methods for analyzing big data in precision medicine

Covid-19: viral-host interactome analyzed by network based-approach model to study pathogenesis of sars-cov-2 infection

Impact analysis of sars-cov2 on signaling pathways during covid19 pathogenesis using codon usage assisted host-viral protein interactions

Integrating imaging and omics data: a review

Hi-covidnet: deep learning approach to predict inbound covid-19 patients and case study in South Korea

Deep learning system to screen coronavirus disease 2019 pneumonia

Inf-net: automatic covid-19 lung infection segmentation from ct images

Efficient and effective training of covid-19 classification networks with self-supervised dualtrack learning to rank

Initial ct findings and temporal changes in patients with the novel coronavirus pneumonia (2019-ncov): a study of 63 patients in Wuhan, China

New machine learning method for image-based diagnosis of covid-19

Modelling the covid-19 epidemic and implementation of population-wide interventions in Italy

The reproductive number of covid-19 is higher compared to sars coronavirus

In PCOS patients the addition of low-dose spironolactone induces a more marked reduction of clinical and biochemical hyperandrogenism than metformin alone

Critical care utilization for the covid-19 outbreak in Lombardy, Italy: early experience and forecast during an emergency response

Analyzing the epidemiological outbreak of covid-19: a visual exploratory data analysis approach

Case-fatality rate and characteristics of patients dying in relation to covid-19 in Italy

Epidemiological characteristics of covid-19: a systemic review and metaanalysis

Time series forecasting of covid-19 transmission in Canada using LSTM networks

Network-based drug repurposing for novel coronavirus 2019-ncov/sars-cov-2

Coronavirus puts drug repurposing on the fast track

Master regulators used as breast cancer metastasis classifier

A humanized neutralizing antibody against mers-cov targeting the receptorbinding domain of the spike protein

Artificial intelligence in covid-19 drug repurposing

Repurpose open data to discover therapeutics for covid-19 using deep learning

Baricitinib as potential treatment for 2019-ncov acute respiratory disease

Covid-predictor: RNA sequence based prediction of coronavirus

deepdr: a network-based deep learning approach to in silico drug repositioning

kgcn: a graph-based deep learning framework for chemical structures

Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More

D3targets-2019-ncov: a webserver for predicting drug targets and for multi-target and multi-site based virtual screening against covid-19

Exploring the sars-cov-2 virus-host-drug interactome for drug repurposing

Intelligent healthcare informatics in big data era

Is one-shot learning a viable option in drug discovery?

Low data drug discovery with one-shot learning

Deep transferable compound representation across domains and tasks for low data drug discovery

Defense Advanced Research Projects Agency (DARPA), and Web