key: cord-0699342-276nonzo
authors: Caruso, Francesca P; Scala, Giovanni; Cerulo, Luigi; Ceccarelli, Michele
title: A review of COVID-19 biomarkers and drug targets: resources and tools
date: 2020-12-07
journal: Brief Bioinform
DOI: 10.1093/bib/bbaa328
sha: 63b526ff311c23f32f7d61b6363e070e854fe894
doc_id: 699342
cord_uid: 276nonzo

The stratification of patients at risk of progression of COVID-19 and their molecular characterization is of extreme importance to optimize treatment and to identify therapeutic options. The bioinformatics community has responded to the outbreak emergency with a set of tools and resource to identify biomarkers and drug targets that we review here. Starting from a consolidated corpus of 27 570 papers, we adopt latent Dirichlet analysis to extract relevant topics and select those associated with computational methods for biomarker identification and drug repurposing. The selected topics span from machine learning and artificial intelligence for disease characterization to vaccine development and to therapeutic target identification. Although the way to go for the ultimate defeat of the pandemic is still long, the amount of knowledge, data and tools generated so far constitutes an unprecedented example of global cooperation to this threat.

The crisis generated by the spread of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and the corresponding COVID-19 disease was declared a pandemic by the World Health Organization on 11 March 2020. The origin of SARS-CoV-2 was traced to the Huanan Seafood Wholesale Market in the city of Wuhan, China. The causative pathogen was identified as a betacoronavirus with high sequence homology to bat coronaviruses (CoVs) using angiotensin-converting enzyme 2 (ACE2) receptor as the dominant mechanism of cell entry [1] . Human-to-human transmission events were confirmed with clinical presentations ranging from no symptoms to mild fever, cough and dyspnea to cytokine storm, respiratory failure and death. The scientific community responded to the crisis with an extraordinary effort involving thousands of scientists and hundreds of laboratories worldwide. This produced a vast amount of biological data allowing the computational biology community to characterize the molecular bases of the diseases, the spread and evolution of the virus and the identification of potential drugs.

The identification of biomarkers for stratification of patients at risk of progression of COVID-19 and their molecular characterization is of extreme importance to optimize treatment and to identify therapeutic options.

We refer to a biomarker as a measurable characteristic-e.g. expression level of a group of genes-used as an indicator of normal biological processes, pathogenic processes or responses to an exposure or intervention [2, 3] . Depending on the context of use, a biomarker can be categorized as susceptibility/risk, diagnostic, monitoring, prognostic, predictive, pharmacodynamic/response and safety biomarker. It is important to distinguish between the prognostic biomarkers that are useful to identify patients more likely to have a particular outcome independently from treatment and predictive biomarkers that involve a comparison of a treatment to a control in patients with and without the biomarker.

Several prognostic COVID-19 biomarkers predicting disease severity have been already validated in clinical settings [4] . Among biomakers that segregate severe from non-severe patients, obtained by retrospective analysis of large cohorts, of particular interest are those associated to dysregulation of immune response. Infection-related biomarkers, such as inflammatory cytokines TNFα, interleukines IL-2R and IL-6 and other blood cell counts, are seen in much higher dosage in severe groups with respect to the non-severe group [5] , whereas the platelet count tends to be significantly decreased in severe cases [6] . Genomewide association studies have also identified a gene cluster on chromosome 3 as a the major genetic risk factor for severe SARS-CoV-2 infection and hospitalization [7, 8] . This genomic segment of 50 kb is inherited from Neanderthals and is carried by about 50% of people in South Asia and about 16% of people in Europe today [9] . Other prognostic biomarkers of disease progression and mortality are related to cardiovascular damages involved in COVID-19 and make use of the cardiac troponin [10] or to the occurrence of chronic kidney diseases where an increase of creatinine levels is observed in severe patients [11] . Other than these clinical biomarkers, there is already a vast literature of molecular biomarkers that characterize the disease associated with SARS-CoV-2 viral infection and that can be exploited to identify therapeutic targets.

In this paper, we focus on Bioinformatics resources, tools and approaches connected to molecular COVID-19 biomarkers. To this aim, we needed to address the vast amount of information produced by the recent explosion of COVID-19-related scientific literature.

The paper is organized following the induced set of biomarker-related topics as follows: the next section describes the methods adopted to mine COVID-19-related scientific literature and to extract relevant topics; in Section 5, we report machine learning tools developed to characterize COVID-19 disease, especially from the image scans; Section 6 describes the relevant molecular datasets available for the characterization of COVID-19 biomarkers from genomics and proteomics profiling; Section 7 focuses on immune repertoire sequencing and antibody isolation; Section 8 collects methods and tools related to vaccine development; and finally Section 9 reports approaches and tools for the discovery of therapeutic targets.

We adopted latent Dirichlet analysis to extract relevant topics from over 27 000 research papers, appeared in the past 10 months and indexed in PubMed or uploaded on preprint servers, such as bio and med rxiv [12] . The overall procedures implemented in Python and R, including details about the adopted analysis, are reported at https://github.com/bioinformaticssannio/covidLiterature. We started with a set of 27 894 articles downloaded on 20 June 2020 from LitCovid, a curated openresource literature of PubMed research papers related to COVID-19 [13] and from the COVID-19/SARS-CoV-2 collection of medRxiv and bioRxiv preprint servers. The document text content, composed by joining article's title and abstract, was tokenized, stemmed and filtered by stopwords. Duplicates, due to PubMed edited paper available also on preprint servers, have been removed by comparing vectors of term frequencies with cosine distance obtaining a consolidated corpus of 27 570 papers. In this corpus, we discovered an optimal set of 36 topics showing the lowest perplexity (Supplemental Figure 1 ). Among them, we selected five topics, topic #0, strongly related to computational models, and four topics related to biomarker research ( Figure 1 ). From the consolidated corpus, we selected papers with a content associated with such a set of topics. Specifically, we considered papers not distributed on many topics (i.e. Shannon entropy less than half of its maximum 1 2 log2( 1 36 )) and having one of the biomarker-related topics shown in (Figure 1 , Table S1 ) as the top most probable. The final set of 3032 papers, which we made available as Table S2 , was manually evaluated and the most relevant discussed in this work.

COVID-19 literature can also be mined to extract valuable information regarding molecular elements (i.e. gene, proteins, etc.) that received more attention in this particular subset of scientific literature. Here we considered gene attention and reported two different analyses: the first showing genes that received more attention in the selection of manuscripts reported in this review compared to all manuscript published in the same time-frame and the second showing genes that received more attention in the first half of 2020 (the pandemic time-window) compared to the whole 2019. The concept of attention can be formulated in different ways; here we choose the number of manuscripts citing a gene as a proxy for the attention received by the gene.

The association between genes and citing manuscripts can be obtained by the NCBI NIH gene2pubmed table [14] while the temporal information associated to manuscript was obtained using the NCBI NIH PMC-ids table [15] and the RISmed R package [16] . For the analyses presented in this review, we only considered human, mouse, rat and SARS-CoV-2 genes by filtering the gene2pubmed table for accession IDs of genes annotated for these species and mapping the corresponding ENTREZ gene IDs to gene symbols. Given a set of manuscripts, we computed the attention score of a given gene in that set, by summing up the number of times it was cited the manuscripts from the set.

For the 1st analysis, we select 15 652 gene/manuscript associations from the filtered gene/manuscript table, covering 2904 manuscripts published from the 1 January 2020 to 17 June 2020. This latter set of articles was intersected with the selection of COVID-19 manuscripts provided here, generating a partition of 182 COVID-19 gene citing manuscripts and 2722 non-COVID-19 gene-citing manuscripts. For each symbol, we than compared the number of times COVID-19 manuscripts cited the gene with the corresponding number of citations in non-COVID-19 manuscripts and computed its statistical enrichment by using a Fisher's exact test and finally correcting all P-values with False Discovery Rate (FDR) correction.

For the 2nd analysis, we selected 76 658 gene/manuscript associations from the gene/manuscript table, with 16 480 associations covering 3324 manuscripts published during the year 2020 and 60 178 gene/manuscript associations from the same table, covering 20 208 manuscripts published in the year 2019. We then selected the genes being cited in at least five manuscripts during 2020 and ranked them based on their attention score in each considered year (2019 and 2020). We defined -rank as the difference (positive or negative) in rank of each paper between the 2 years as a measure of gain or loss of attention for each gene between 2019 and 2020 was. We assigned an empiric P-value to the -rank of each gene using a bootstrap procedure (1000 iteration) where the same procedure describe above was applied to a random selection of 16 480 gene/manuscript associations and shuffling of gene IDs with respect to manuscripts in each realization.

Genes that received significant attention in the subset of scientific literature considered are shown in Figure 1A -B, both in terms of enrichment of attention score and significant -rank, include important genes involved in SARS-CoV-2 parthenogenesis ( Figure 1C ). The main mechanism of adhesion and viral entry into the cell involves the viral protein Spike (S), which binds the human ACE2 receptor through its receptor-binding domain (RBD) with a binding affinity 10 times higher than that of the spike protein of the SARS virus. The very efficient cellular entry of SARS-CoV-2 is also due to the action of the Furin enzyme that is expressed in significant concentrations in the lung and activate the spike protein [17, 18] . Some recent evidence suggests that many other genes may contribute to virus entry and are being studied as potential therapeutic targets in the treatment of coronavirus infections. For example, the host cell protease TMPRSS2 acts as a primer for the spike protein [19, 20] ; the membrane protein DPP4 acts as a co-receptor of SARS-CoV-2 and is a key factor for the hijacking and virulence in the respiratory tract [21] ; the AAK1 gene is a known regulator of the clathrin-mediated endocytosis [22] . The uncontrolled and excessive release of proinflammatory cytokines and chemokines (like IL-1β, IL-6, IL-12, CXCL8, CXCL9, CXCL10, IFNs, TNF, etc.) is the most damaging and potentially fatal effect related to the COVID-19 and therefore it is the subject of several studies. The IL-6 gene is the main prognostic biomarkers since it plays a key role in cytokine storm, and high levels of this cytokine are associated with respiratory failure and mortality risk [23] . Unfortunately, the efficacy of cell-mediated immunity against SARS-CoV-2 is still unclear and many studies are aimed at clarifying the role of T cells in the resolution of COVID-19 [24] . Some recent evidence has shown an increase in the expression of the CD8 T cell marker (CD8A) in COVID-19 patients to support hyper-activation of cytotoxic T lymphocytes [25] .

We adopted a semiautomatic approach, based on topic analysis, to select from over 27 000 research papers, appeared from September 2019 and indexed by PubMed or uploaded on preprint servers such as bio-and med-rxiv, a manageable set of resources that can be manually revised. From the overall corpus of documents, we induced 36 relevant topics, 5 of which are associated to biomarkers and are depicted in Figure 2 , whereas the breakdown of the papers per topic is summarized in Table S1 .

Topic #0 refers to the use of artificial intelligence (AI), in particular deep learning approaches for the analysis of biomedical images, such as computed tomography (CT) scans or ultrasonography (LUS) images, to diagnose and predict the prognosis of COVID-19 patients. Topic #1 is related to the study of neutralizing antibodies and cellular immune response to SARS-CoV-2 and focuses on the design of serological tests to identify seroconversion prognostic biomarkers. Topic #20 is about drug discovery and is specific to structural and functional analysis of SARS-CoV-2 to identify therapeutic targets. Topic #27 is related to the discovery of biomarkers that trigger an immune response and could be adopted for vaccine development. Topic #33 encloses genome-and proteome-wide studies with publicly available datasets, a valuable source of information for biomarker discovery.

Imaging is the main tool for the identification of patients with higher risks of developing acute respiratory failure due to SARS-CoV-2 virus pneumonia [26] . Lesion characteristics such as number, size, density and bilateral and multi-lobar glass ground opacifications (mainly posteriorly and/or peripherally distributed) are indicators of lung damage and remaining lung reserve [27] . They are effectively used as biomarkers to train an automatic diagnostic system or to assist the accurate diagnosis of disease severity and to distinguish between normal and SARS-CoV-2 virus pneumonia. In [28] the authors collected a dataset of 532 506 CT scans from 3777 patients for the purpose of training a diagnostic system ( Table 1 ) and showed that a convolutional neural network, adapted from 3D ResNet-18, trained on lunglesion maps, obtained by different automatic segmentation algorithms, achieves 92.49% accuracy, 94.93% sensitivity, 91.13% specificity and an area under the curve (AUC) of 0.9797 [29] . The use of multiple features, such as texture, surface, volume histogram and intensity, has also been shown to improve the diagnostic accuracy [30] of chest CT scans up to 93.9%. As an alternative to CT scans, lung ultrasound (LUS) has been shown to be a more widely available, cost-effective, safe and real-time imaging technique [31] .

The genomic sequence of SARS-Cov-2 has 29 903 nucleotides [1] and is available with accession number NC_045512.2. It has 89.1% similarity with a bat SARS-like coronavirus (CoV) isolate-bat SL-CoVZC45 (accession number MG772933) and is organized in replicase ORF1ab (21,291 nt), spike (3,822 nt), ORF3a (828 nt), envelope (228 nt), membrane (669 nt) and nucleocapsid (1260 nt). As of 21 June 21 2020, a total number of 49 239 sequences have been deposited on GISAID EpiFlu Database (www.gisaid.org), which is the main source of genomic data associated with SARS-Cov-2 [32] . To get insight into the complex pathogenesis caused by novel coronavirus, sequencing of single cells (scRNA-seq), RNA (RNA-seq), adaptive immune receptor repertoire (AIRR-seq), image datasets and proteomic assay have been massively adopted to unveil the characteristics of the immune response triggered in patients affected by COVID-19. Single cell sequencing is often combined with RNA or AIRR sequencing as the pulmonary microenvironment and peripheral immune response allow to reveal potential mechanisms underlying the pathogenesis of COVID-19 and the identification of diagnostic and therapeutic biomarkers. Most of data are available on public databases, such as Gene Expression Omnibus (GEO) [33] , Sequence Read Archive (SRA) [34] , European Nucleotide Archive (ENA) [35] , European Genome-phenome Archive (EGA) [36] and Genome Sequence Archive (GSA) [37] . Single cell transcriptomic data can be interactively explored through the Single Cell Portal [38] . Table 1 provides a curated list of 28 transcriptomic, 2 image datasets and 6 proteomic studies, publicly available datasets. The list contains 14 scRNAseq studies derived from peripheral blood mononuclear cells (PBMCs) (n = 8), nasopharyngeal swabs and bronchial branches (n = 1), bronchoalveolar lavage fluid (BALF) (n = 1) and lung tissue (n = 1) in COVID-19 patients. There are also scRNA-seq datasets of lung organoids (n = 2) and human cell lines infected with SARS-CoV-2 (n = 3). Similarly, there are 9 RNA-seq studies that include datasets of infected human cell lines (n = 4) and organoids (n = 3), nasopharyngeal swabs (n= 1), BALF and PBMC (n = 1) and several tissue (i.e. lung, heart, liver, kidney, bowel, skin, fat, marrow) (n = 2) from COVID-19 patients. The AIRR-seq datasets, including data of BCR, TCR, IGH and antibody sequencing, were derived from PBMCs (n = 10) and BALF (n = 1) in COVID-19 patients.

Proteomics datasets are created in this context to characterize the set of SARS-CoV-2 encoded proteins and to investigate their interaction with the human proteome during the different phases of the infection. Gordon et al. [71] recently developed a protein interaction map by expressing all of the 29 SARS-CoV-2 proteins in human cells and then assessing their affinity with human proteins by means of affinity-purification mass spectrometry, obtaining a list of 332 SARS-CoV-2-human proteinprotein interactions that is available as a supplementary file at [72] . Bojkova [79] . The machine learning-based approach has highlighted important changes in the serum of COVID-19 patients involving the deregulation of complement system processes, macrophage and platelet activity and metabolic suppression. All data are deposited in ProteomeXchange Consortium (Table 1) .

SARS-CoV-2 infection affects adaptive immunity, immune cell architecture and function [80] . Exposure to viral antigens stimulates the cellular immune response of T cells and the humoral immune response of B cells, which can be studied in detail through the immune repertoire high-throughput sequencing. The analysis of the sequences of T and B cells repertoires for different cohorts of patients, from non-hospitalized infected patients to patients with severe symptoms, may reveal the nature of protective versus detrimental B and T cell responses and can be be used as a prognostic biomarker. For example, significant highly clonal T cell repertoires in active COVID-19 patients versus patients recovered from COVID-19 without medical intervention has been recently reported [62] . The Adaptive Immune Receptor Repertoire Community (AIRR-C) has defined standards for sharing and interoperability of B-cell and T-cell receptor repertories [81] , and sequences of are being deposited in multiple repositories such as [82] which (at the date of writing this paper) contains 178 190 149 sequences from 285 patients. T and B cell sequencing is important for the development of monoclonal antibodies against SARS-CoV-2 but also to determine the optimal T cell engagement strategy for vaccine development. SARS-CoV-2-reactive and neutralizing antibodies have now been isolated from COVID-19 survivors. Neutralizing antibodies could block viral entry by preventing the S protein from binding to host cell receptors, such as ACE2. Neutralizing antibodies could also mimic receptor binding and prematurely trigger fusogenic conformational changes in the S protein before it engages ACE2. The Coronavirus Antibody Database, CoV-AbDab [83] , is a publicly available resource to query and download coronavirus-binding antibody sequences and structures. It actually contains 460 records. A recent study isolated 19 antibodies with high neutralizing power from infected SARs-Cov-2 patients [84] . This collection includes antibodies directed towards the spike protein RBD domain, which compete strongly with the ACE protein and are promising candidates for vaccine development, and non-RBD antibodies, which are instead mainly directed towards the NTD domain. The sequences of these 19 antibodies are deposited on Genbank.

Although several research groups around the world are engaged in the development of a vaccine against SARS-CoV-2, currently there are no approved treatments for humans.

Reverse vaccinology is a methodology that uses bioinformatics tools and genomic data for the identification of pathogen antigens [85] . In silico vaccine development improves the potential for successful vaccine design reducing time and cost to identify the effective epitopes that could trigger the immune response without causing disease [86] . Figure 3 shows a general workflow of in silico vaccine development, including the main resources used in COVID-19 vaccine discovery so far. Initially, amino acid sequences of proteins that are potentially antigenic or essential for virus replication must be retrieved from sequence databases, such as GenBank [87]. The nucleocapsid (N) protein of SARS-CoV-2 is a suitable vaccine candidate because it is a crucial structural protein, highly conserved with antigenic properties [88] . Also, other structural and non-structural proteins, such as the membrane (M) protein, spike glycoprotein (S), open reading frame 3a (ORF3a), etc., are putative antigenic targets in vaccine design [89] . The identification of antigenic proteins and prediction of T-cell and B-cell epitopes are major steps in developing in silico vaccine. Supplemental Table S3 provides a list of the main bioinformatics resources useful for the prediction of MHC Class-I and II epitopes. Prediction tools for continuous B cell epitopes and T cell epitope are very similar and include algorithms based on (i) machine learning and artificial neural network (ANN) approaches (i.e. NetMHC, NetMHCII, NetCTL, nHLAPred, BepiPred, MHC2Pred, SVMHC, etc.); (ii) the amino acid properties and secondary structure (i.e. VaxiJen, MHCPred, Bcepred, SEPPA, etc.); (iii) position-specific scoring matrix (PSSM) matrix (i.e. RANKPEP). Instead, discontinuous B cell epitope prediction employs resources based on 3D structure resolution of the antigen (i.e. Discotope, ElliPro, etc.) Many other on-line tools are also available to analyze the physiochemical properties and allergenicity and to predict secondary and tertiary structure of vaccine candidate (Figure 3) . The EPV-CoV19, a candidate vaccine in the clinical trial phase, was entirely designed using the iVax Toolkit [90] , a web-based work environment including several computational immunology tools to develop epitope-driven in silico vaccine.

Biomarkers for drug repurposing (or drug targets) are molecular elements that are part of the pathophysiologic mechanism of action of a disease. In the context of viral infection, such elements are represented by (i) viral targets, proteins encoded by the viral genome that are essential to the infection process; (ii) viral/host interactors, host proteins that directly interact with viral proteins acting as entry-points for the infection process; and (iii) host response targets, host proteins not directly interacting with the viral proteins but whose inhibition/activation is able to block the signaling pathways that are essential for the infection process to succeed. Table 2 shows a list of bioinformatics tools developed for therapeutic target identification that have been applied in the context of COVID-19 disease. Most of them have developed in different contexts (e.g. cancer) and can be virtually applied to the targets categories described above. Each of the proposed approach/tool is based on different input structures that can be classified in the following categories: (i) protein-protein networks along with a selection of subsets of proteins of interest (e.g. COVID-19 direct interactors and drug targets); (ii) transcriptomic networks inferred from infected samples; and (iii) proteins/ligands structure and composition.

In the case of viral infections, at least three pieces of information should be modeled within the network structure: (i) virus-host protein interactions, (ii) host protein-protein interactions and (iii) drug-protein interactions. Pure (unimodal) protein-protein network based approaches consider only proteins as nodes of the network and protein-protein interactions as edges. In multi-modal networks, nodes can be proteins, drugs and diseases, while edges represent interactions among them (protein-protein, drug-protein, drug-diseases, drug-drug, disease-protein, disease-disease). The basic idea is that the closer are the drug targets to disease-related components (such as viral-host interactors), the higher are the odds for the drug to affect the adverse phenotype. A commonly used distance measure between nodes over a graph is the length of the shortest path connecting them. By extending this notion to a set of nodes (e.g. candidate biomarker nodes and COVID-19 nodes), [ 109] the length of the minimum connecting shortest path (MSP) is a proxy for the biomarker or target relevance [110] . The MSP approach has been proved to be an effective metric for ranking and re-purposing drugs against COVID-19 infection [92, 93] . This approach has been originally developed to repurposing drugs in cancer-derived networks with the GPSnet tool [91] and can be easily adapted to COVID-19 networks as shown in [92] and [93] where the authors integrated virus/host interactome data from [72] (Table 1 ) with a human protein-protein interaction network to rank repurposable drugs based on the distance of their molecular targets from COVID-19 nodes.

Shortest path methods ultimately rely their estimate on the length of a single path (i.e. the minimal one); other methods, such as the TrustRank method [94] , try to define the relevance of a node-set (the candidate bio-marker) with respect to another (COVID-19 targets) based on global characteristics like the connectivity level between the two. It is a variant of the Google's PageRank algorithm and is implemented in the CoVex tool [95] . This method can be used to rank a set of protein nodes based on how well they are connected to a set of trusted seed proteins (e.g. SARS-CoV-2 target proteins from [72] (Table 1 ). In particular, the algorithm propagates such a trustiness information from seed nodes to other non-seed nodes and, based on these propagated values, ranks the all other protein nodes based on their connectivity with the seeds.

The Steiner tree problem aims at finding the minimum cost subgraph connecting a given set of seed nodes. In the case of COVID-19 derived networks, it can be mapped to the problem of finding the minimal subgraph connecting a selection of COVID-19 interactors (acting as seeds), in order to have a representation of the mechanism of action related to such interactors and consequently identify potential drug targets and drug candidates. The Steiner tree problem belongs to the class of NPhard problems, but different efficient approximation algorithms exist for this problem. An implementation based on finding and merging multiple 2-approximate solutions to the Steiner tree over a protein-protein network and seed nodes selected from [72] (Table 1) is presented in the CoVex tool [95] .

Diffusion based methods can be used to rank candidate drugrelated biomarkers based on a graph diffusion state similarity measure. A diffusion state can be obtained for a node x by computing for all the other nodes y the expected number of random walks originating in x and passing through y. This approach has been employed by [93] to score a set of drugs based on a the similarity of diffusion states between each drug target node-set and COVID-19 target nodes. This methods can be easily implemented using the diffusion state distance (DSD) tool available in the MONET toolbox [96] Another interesting approach to drug-related biomarker definition is the possibility to numerically encode all of the semantic contained in the network under study in a low-dimensional space and look for similarities between encoded entities in this new space using vector-based distance measures. Graph embedding methods are based on neural networks implementing an encoder-decoder architecture; this latter able to translate network entities in numeric vectors. It is possible to represent the knowledge network containing interactions between proteins, drugs and diseases in a low-dimensional space (an hyperplane) where each node of the graph can be represented as a scalar vector and distances between points in the encoded feature space are representative of (i) the association between drugs and diseases, (ii) the similarity between diseases and (iii) similarities between drugs' mechanism of action. Gysi et al. [93] report an example of drug repositioning based on the embedding of a multi-modal graph containing information on three distinct types of biomedical entities (i.e. drugs, proteins, diseases) and edges representing four types of relationships between the entities (i.e. protein-protein interactions, drug-target associations, disease-protein associations and drug-disease treatments). This approach can be implemented by using an adaptation of the Decagon tool [97] that implements a graph convolutional neural network model for detecting polypharmacy side effects.

While the previous strategies can be more suited to target viral/host interactors, functional annotation-based approaches can be used to identify biomarkers related to the host response to the infection. These approaches can exploit omics data generated from infected samples to infer activated protein modules and/or biochemical pathways that in turn can be used to produce biomarker-targets for drug repositioning.

Li et al. [111] followed this kind of approach using transcriptomic data of infected NHBE, A549_ACE2 and Calu3 human lung epithelial cells from [47] (Table 1) and their normal counterparts to identify differentially expressed genes and dysfunctional signaling KEGG pathways activated by these latter. Drug bank data were then exploited in order to find drugs potentially inhibiting one or more of discovered pathways.

Master regulator analysis (MRA) exploits network models derived from omics assays [112] . In the context of viral infection, a master regulator (MR) can be identified as a regulatory protein whose activity is sufficient to determine the success of the infection process. In this setting, also the concept of tumor checkpoints [112] (i.e. a hyperconnected and autoregulated module built around MR proteins) can be translated in the concept of infection checkpoint and thus regarded as a biomarker. In particular, it is possible to extrapolate a set of crucial biomarkers of the infection process, constituted by modules (subnetworks) linked to an MR, i.e. a key-responsive transcriptional regulator along with its direct targets. The VIPER tool [98] can be used to identify transcription factors controlling the infection process given a regulatory network built over infection transcriptome data. This approach has been implemented in Laise et al. [99] where the authors used transcriptome data from Calu-3 lung adenocarcinoma cells infected with SARS-CoV to identify master regulator proteins related to SARS-CoV infection process.

These models can be further enhanced by integrating omics derived regulators with functional networks (e.g. known proteinprotein networks, pathway-based networks, etc.) thus obtaining functional modules linked to MRs. Such an approach has been successfully adopted in [101] where the authors used their corto algorithm [100] to identify disease sub-modules related to SARS-CoV infection derived co-expression network.

The availability of omics data from infected samples makes it possible to derive biomarkers based on omics signatures (i.e. omics profiles that are characteristics of the particular infection). In this case, the biomarker is represented by the set of molecular features (e.g. genes, proteins, miRNAs) differentiating COVID-19 infected tissues from the normal counterparts. Complex biomarker such as gene signatures can also be used to discover potential drugs that can inhibit its components' activity. In particular, it can be compared to known drug signatures (e.g. drug gene expression signatures from the Connectivity Map dataset [113] ) by using a Gene Set Enrichment-based Analysis against transcriptional signatures associated to known drugs. This approach is implemented in the MANTRA tool [102] and has been applied to COVID-19 in Napolitano et al. [103] where the authors exploited the transcriptomics data from primary human bronchial epithelial cell line (NHBE) [47] (Table 1) .

Structure-based approaches rely on the study of the structural affinity between proteins and drug molecules that in turn drives the interaction potential between these two. This approach is particularly suited in targeting viral proteins and, in particular, in the discovery of drugs with potential inhibitory effects against these latter. Structure-based approaches for targeting viral proteins goes through three main steps: (i) identifications of target viral proteins; (ii) modelling of the 3D target proteins structures; (iii) search for potentially interacting ligands. Several data sources reporting the protein sequences of all known COVID-19 proteins along with models of their 3D structures are listed in Section 6 and Table 1 .

The search for drug repurposing biomarkers in this case is reduced to the identification of (viral and/or host) proteins involved in the infection process and showing structural affinity for known compounds.

The task of determining the structural affinity can be addressed following a rule-based approach, using molecular docking screens, or by indirect approaches, inferring possible protein/drug interacting pairs from molecular derived features of these latter, given a statistical model trained on known and validated molecule/protein interaction. Docking simulations work by generating different poses between a ligand and a protein given their 3D structure, obtained by testing different orientations and conformations, and scoring all these poses to determine the ligand affinity between the two structures. This approach can be implemented by using protein models from [76] (Table 1 ) and through different tools like AutoDock Vina [104] , applied by Yu et al. [105] to SARS-CoV-2 structural and non-structural proteins, and the Deep Docking tool [106] applied by Ton et al. [107] to SARS-CoV-2 main protease.

Indirect approaches for structural affinity screening can be implemented by means of machine learning tools.

These methods are capable of learning the high-dimensional structure of a molecule starting from its raw sequence and encode (embed) it in a low dimensional space, where the relationships between interacting proteins/ligands can be learnt by means of (deep) neural networks or other machine learning approaches. This approach has been implemented in the MT-DTI tool [108] that is based on the natural language processing based Bidirectional Encoder Representations from Transformers (BERT) framework [114] and has been applied to SARS-CoV-2 protein sequences extracted from the SARS-CoV-2 genome [1] (accession NC_045512.2, see Section 6) to discover six coronavirus-related targeted by FDA approved antivirals in [109] .

The Bioinformatics community responded to the SARS-CoV-2 emergency with an unprecedented amount of work and research outputs. We have shown that the vast amount of scientific literature related to the computational approaches for the identification of biomarkers can be classified in five main categories. Some categories are more focused on the data generation and sharing such as transcriptomics profiling to identify the markers of the viral infection in host tissues and to characterize the T cell repertoire. A vast amount of work has also been performed to develop AI-based automatic diagnostic tools to characterize the severity of the disease image scans. However, the area where the computational biology community has exploited all the arsenal of approaches that were also developed in other fields such as cancer and neuroscience is the identification of therapeutic targets of existing molecules. However, we want also to mention some potential limitations and opportunities for improvements in some areas. In order to make significant inroads in terms of diagnostic development, it would be necessary for profiles of hundreds, if not thousands, of patients to be available. And it seems that 9 months into this pandemic, we are still very far from the mark. For example, regarding AIRR-seq, while sequencing performed on bulk samples can be informative it will be at some point necessary to determine repertoires among subpopulations separately. For TCR-seq, for instance, it would be quite important to consider separately the repertoire of T helper cells, effectors, memory or regulatory populations.

Overall, we have briefly described the most advanced approaches, mainly based on the inhibition of the signaling cascades activated by viral infection using the knowledge encoded in gene regulatory networks and/or protein-protein interaction networks. Indeed, a plethora of algorithms developed in the area of systems biology has been successfully exploited to prioritize existing drug and molecules; some of the predicted drug are already in clinical trials. Finally, we have also reported the main bioinformatics tools needed in the process of vaccine development that is the ultimate way to combat the emerging COVID-19 pandemic.

• A vast amount of literature about COVID-19 biomarkers has been already published so far; automatic text categorization methods are useful to identify key topics • The analysis of a corpus of 27 000 papers resulted in 36 topics, five of them related to biomarker discovery and drug target identification • Selected topics span from machine learning and AI for disease characterization to vaccine development and to systems biology for therapeutic target identification.

• We include an up to date catalog of public transcriptomics and proteomics dataset available to the computational biology community for discovery of biomarkers and disease characterization.

Supplementary data are available online at Briefings in Bioinformatics. 

A new coronavirus associated with human respiratory disease in China

Biomarkers definitions working group. Biomarkers and surrogate endpoints

Biomarkers as drug development tools: discovery, validation, qualification and use

The role of biomarkers in diagnosis of covid-19-a systematic review

Dysregulation of immune response in patients with covid-19 in Wuhan, China

Thrombocytopenia is associated with severe coronavirus disease 2019 (covid-19) infections: a meta-analysis

Genomewide association study of severe covid-19 with respiratory failure

The covid-19 host genetics initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the sars-cov-2 virus pandemic

The major genetic risk factor for severe COVID-19 is inherited from Neanderthals

Clinical course and risk factors for mortality of adult inpatients with covid-19 in Wuhan, China: a retrospective cohort study

Kidney disease is associated with in-hospital death of patients with covid-19

Latent dirichlet allocation

Keep up with the latest coronavirus research

RISmed: Download Content from NCBI Databases

Cell entry mechanisms of sarscov-2

Why does the coronavirus spread so easily between people?

Sars-cov-2 cell entry depends on ace2 and tmprss2 and is blocked by a clinically proven protease inhibitor

Tmprss2: a potential target for treatment of influenza virus and coronavirus infections

Dipeptidyl peptidase-4 (dpp4) inhibition in covid-19

Covid-19: combining antiviral and anti-inflammatory treatments

Interleukin-6 as prognosticator in patients with covid-19: Il-6 and covid-19

T cell responses in patients with covid-19

Increased expression of cd8 marker on t-cells in covid-19 patients

Imaging and clinical features of patients with 2019 novel coronavirus sars-cov-2

Coronavirus disease 2019 (covid-19): a systematic review of imaging findings in 919 patients

Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography

Learning spatio-temporal features with 3d residual networks for action recognition

Diagnosis of coronavirus disease 2019 (covid-19) with structured latent multi-view representation learning

Deep learning for classification and localization of covid-19 markers in point-of-care lung ultrasound

Comparative genomics suggests limited variability and similar evolutionary patterns between major clades of sars-cov-2

Gene expression omnibus: NCBI gene expression and hybridization array data repository

The sequence read archive

The european nucleotide archive

The European genome-phenome archive of human data consented for biomedical research

GSA: genome sequence archive

A single-cell atlas of the peripheral immune response in patients with severe covid-19

Identification of candidate covid-19 therapeutics using hpsc-derived lung organoids

Single-cell longitudinal analysis of sars-cov-2 infection in human bronchial epithelial cells

Singlecell analysis of two severe COVID-19 patients reveals a monocyteassociated and tocilizumab-responding cytokine storm

Covid-19 severity correlates with airway epithelium-immune cell interactions identified by single-cell analysis

Disruption of the ccl5/rantes-ccr5 pathway restores immune homeostasis and reduces plasma viral load in critical covid-19

Viral Invasion and Type I Interferon Response Characterize the Immunophenotypes During Covid-19 Infection

Single-cell sequencing of peripheral blood mononuclear cells reveals distinct immune response landscapes of covid-19 and influenza patients

Imbalanced host response to sars-cov-2 drives development of covid-19

Transcriptomic characteristics of bronchoalveolar lavage fluid and peripheral blood mononuclear cells in covid-19 patients

In vivo antiviral host transcriptional response to SARSCoV-2 by viral load, sex, and age

Human iPSC-Derived Cardiomyocytes Are Susceptible to SARS-CoV-2 Infection

Generation of human bronchial organoids for sars-cov-2 research

Sars-cov-2 productively infects human gut enterocytes

B cell clonal expansion and convergent antibody responses to sarscov-2

Immunologic perturbations in severe covid-19/sars-cov-2 infection

Longitudinal high-throughput TCR repertoire profiling reveals the dynamics of T cell memory formation after mild COVID-19 infection

Convergent antibody responses to sars-cov-2 in convalescent individuals

Sars-cov-2 epitopes are recognized by a public and diverse repertoire of human t-cell receptors

Deep sequencing of b cell receptor repertoires from covid-19 patients reveals strong convergent immune signatures

Next generation sequencing of t and b cell receptor repertoires from covid-19 patients showed signatures associated with severity of disease

Bulk and single-cell gene expression profiling of sars-cov-2 infected human cell lines identifies molecular targets for therapeutic intervention

A human pluripotent stem cell-based platform to study sars-cov-2 tropism and model virus infection in human cells and organoids

Blood single cell immune profiling reveals the interferon-mapk pathway mediated adaptive immune response for covid-19

Immune cell profiling of covid-19 patients in the recovery stage by single-cell sequencing

Potent neutralizing antibodies against sars-cov-2 identified by high-throughput singlecell sequencing of convalescent patients' b cells

Single-cell landscape of bronchoalveolar immune cells in patients with covid-19

A SARS-CoV-2 protein interaction map reveals targets for drug repurposing

Proteomics of SARS-CoV-2-infected host cells reveals therapy targets

Virus-host interactome and proteomic survey of PMBCs from COVID-19 patients reveal potential virulence factors influencing SARS-CoV-2 pathogenesis

Ligand-centered assessment of sars-cov-2 drug target models in the protein data bank

Proteomic and metabolomic characterization of covid-19 patient sera

Immunology of covid-19: current state of the science

Adaptive immune receptor repertoire community recommendations for sharing immune-repertoire sequencing data

AbDab: the coronavirus antibody database

Potent neutralizing antibodies directed to multiple epitopes on sars-cov-2 spike

The impact of bioinformatics on vaccine design and development

Designing of a next generation multiepitope based vaccine (mev) against sars-cov-2: Immunoinformatics and in silico approaches

Biochemical characterization of sars-cov-2 nucleocapsid protein

Reverse vaccinology approach to design a novel multi-epitope vaccine candidate against covid-19: an in silico study

Ivax: an integrated toolkit for the selection and optimization of antigens and the design of epitope-driven vaccines

A genome-wide positioning systems network algorithm for in silico drug repurposing

Network-based drug repurposing for novel coronavirus 2019-ncov/sars-cov-2

Network medicine framework for identifying drug repurposing opportunities for COVID-19

Combating web spam with trustrank

Exploring the sars-cov-2 virus-host-drug interactome for drug repurposing

MONET: a toolbox integrating top-performing methods for network modularization

Modeling polypharmacy side effects with graph convolutional networks

Functional characterization of somatic mutations in cancer using network-based inference of protein activity

The host cell virocheckpoint: identification and pharmacologic targeting of novel mechanistic determinants of coronavirus-mediated hijacked cell states

Corto: a lightweight R package for gene network inference and master regulator analysis

Master regulator analysis of the sars-cov-2/human interactome

Discovery of drug mode of action and drug repositioning from transcriptional responses

Computational drug repositioning and elucidation of mechanism of action of compounds against sars-cov-2, arXiv 2020

Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading

Computational screening of antagonists against the sars-cov-2 (covid-19) coronavirus by molecular docking

Deep docking: a deep learning platform for augmentation of structure based drug discovery

Rapid identification of potential inhibitors of sars-cov-2 main protease by deep docking of 1.3 billion compounds

Self-attention based molecule representation for predicting drug-target interaction

Predicting commercially available antiviral drugs that may act on the novel coronavirus (sars-cov-2) through a drug-target interaction deep learning model

Network-based in silico drug efficacy screening

Repurposing drugs for covid-19 based on transcriptional response of host cells to sars-cov-2

The recurrent architecture of tumour initiation, progression and drug sensitivity

A next generation connectivity map: L1000 platform and the first 1,000,000 profiles

Bert: pre-training of deep bidirectional transformers for language understanding