key: cord-1041144-x1hcnox3
authors: de Siqueira Santos, Suzana; Torres, Mateo; Galeano, Diego; Sánchez, María del Mar; Cernuzzi, Luca; Paccanaro, Alberto
title: Machine Learning and Network Medicine approaches for Drug Repositioning for COVID-19
date: 2021-11-09
journal: Patterns (N Y)
DOI: 10.1016/j.patter.2021.100396
sha: 1849ab72edd65919aebaf27f0d0ef69fb83e7f1b
doc_id: 1041144
cord_uid: x1hcnox3

We present two machine learning approaches for drug repurposing. While we have developed them for COVID-19, they are disease-agnostic. The two methodologies are complementary, targeting SARS-CoV-2 and host factors, respectively. Our first approach consists of a matrix factorisation algorithm to rank broad-spectrum antivirals. Our second approach, based on network medicine, uses graph kernels to rank drugs according to the perturbation they induce on a subnetwork of the human interactome that is crucial for SARS-CoV-2 infection/replication. Our experiments show that our top predicted broad-spectrum antivirals include drugs indicated for compassionate use in COVID-19 patients; and that the ranking obtained by our kernel-based approach aligns with experimental data. Finally, we present the COVID-19 Repositioning Explorer (CoREx), an interactive online tool to explore the interplay between drugs and SARS-CoV-2 host proteins in the context of biological networks, protein function, drug clinical use, and Connectivity Map. CoREx is freely available at: https://paccanarolab.org/corex/.

c e ll c u lt u re / c o -c u lt u re o rg a n o id s a n im 

Drug discovery and development present several challenges including high attrition rates, long development times, and substantial costs 1 . Drug repositioning involves the use of de-risked compounds in human, which translates in lower costs and shorter development times 2 . Computational methods can assist 5 drug repurposing research projects by providing rankings of drugs based on predicted therapeutic efficacy, as well as tools to help scientists reason about drug effectiveness by integrating diverse available biomedical knowledge.

Coronaviruses are notoriously difficult to manage, as there is no specific antiviral treatment that has been proven effective against the infections they 10 induce 3 . Identifying commercially available drugs with therapeutic effects for COVID-19 could provide early treatment options until effective therapies become widely available. A growing corpus of literature identifies several categories of treatment that revolves around the use of drugs with a mode of action that targets the molecular structure of the virus (virally-targeted agents), or its 15 cellular processes in the host (host-targeted agents), or those based on combinatorial therapies 4, 5, 6, 7 .

In this paper, we present two different machine learning approaches, and a webtool, for drug repurposing for COVID-19. Our first machine learning approach focuses on virally-targeted agents and aims at ranking Broad Spectrum 20 Antiviral (BSA) drugs. Given a small number of drugs associated with a virus, and their stage in the drug development process, our matrix decomposition algorithm assigns scores to a larger group of drugs with previously unknown associations with the virus. Our method predicts BSAs against SARS-CoV-2 by exploiting information about stages of drug development that are interpreted 25 as probabilities of drug approval. To our knowledge, our matrix decomposition model is the first that integrates developmental stages information to predict 2 J o u r n a l P r e -p r o o f the efficacy of drugs against viral diseases, and we show that this is crucial to obtain better predictions.

Our second machine learning approach focuses on host-targeted agents, and 30 prioritises FDA-approved drugs based on ideas from network medicine 8 . In particular, it exploits the concept of disease module, that has been instrumental in the prediction of disease genes for hereditary diseases 9, 10, 11, 12 . For a virus, a disease module can be defined as the set of human proteins (hereafter, host proteins) that interact with viral proteins, allowing the infection and replication 35 processes. Recently, Gysi et al. 13 have shown that, for SARS-CoV-2, most of the experimentally identified human host proteins 14 form a distinct COVID-19 disease module in the interactome. Our network medicine-based approach is based on the idea that the binding of drugs to their protein targets causes a perturbation that propagates through the interactome. By quantifying this per-40 turbation, it is possible to calculate the extent of the effect that a drug induces on the COVID-19 disease module. Our method ranks FDA-approved drugs based on this effect, which is estimated using graph kernels. An important aspect of our method is that it offers a natural way to model the relative importance of host proteins for the disease, and we show that our network medicine approach 45 benefits from this prioritisation of host proteins.

Finally, we present the COVID-19 Repositioning Explorer (CoREx), an online tool that enables scientists to analyse and reason about drug repurposing in a functional context on the interactome and thus allows the exploration of our results as well as the formulation of novel repurposing hypotheses. CoREx 50 integrates several sources of information, connecting functional protein modules with drug targets and host proteins. CoREx also provides additional evidence for a drug of interest, such as whether the drug is on clinical trials for COVID -19, or whether the drug could reverse the gene-expression signature of SARS-CoV-2 infection based on the Connectivity Map (CMAP) 15, 16 . such as the NMF with L2 regularisation by Bakal et al. 22 , the TriFactor NMF by Ceddia et al. 23 , and the Indicator Regularized non-negative Matrix Factorization 75 (IRNMF) method by Tang et al. 24 , which was developed to repurpose drugs for COVID-19. Our aim is also to build a recommender system that recommends BSA drugs to viruses and the novelty of our approach lies in the realisation that the stages of drug development for drug-virus associations can be related to the probability of reaching the final stage of drug development (hereafter, 80 probability of success). This observation is motivated by the empirical evidence (e.g., Dowden et al. 25 ) that the probability of success of a candidate drug increases as the candidate drug moves to the next developmental stage in the drug development process. This led us to develop a novel objective function that models the probabilities of success of drug-virus associations using their stage 85 4 J o u r n a l P r e -p r o o f in the drug development process. In this paper, we show how the integration of this type of information greatly improves prediction performance.

In recommender systems based on matrix decomposition, the fundamental assumption is that users and movies can be represented as latent feature vectors in a low-dimensional space, and that a rating value for a specific user-movie we learn a low-dimensional feature vector p i ∈ R k (the drug signature) and for each virus j a low-dimensional feature vector q j ∈ R k (the virus signature) such that y ij ≈ p T i q j . Therefore, our algorithm amounts to decomposing the n × m matrix Y into the product of two matrices P ∈ R n×k in which each row is a 100 drug signature p T i , and Q ∈ R k×m in which each column is a virus signature q j , and k << min(n, m). Indicating their product withŶ , we have Y ≈ P Q =Ŷ .

Matrices P and Q are learned by minimising the following cost function:

In vitro, animal model, clinical trials

zero-driven regularisation subject to non-negative constraints P, Q ≥ 0,

where (1) is attempting to find a decomposition P Q to reconstruct the associations in set A exactly. The second term in Equation (1) has an equivalent role for the remaining known associations in Y , corresponding to earlier stages in the drug development process -sets B, C and D contain entries in clinical trials phases I, II and III, respectively, while set E contains associations in in vitro and an-120 imal model stages. Here the corresponding M s matrices are used to apply the summations only to entries belonging to the corresponding sets (M s ij = 1 if the entry y i,j belongs to set s). However, for these sets, their contributions to the loss are weighted differently using the parameters α s ∈ [0, 1]. These parameters have the key role of downweighting these terms in the minimisation, in a 125 way that reflects their higher uncertainty of success due to their earlier stage of drug development, thus effectively coding probabilities of success for each subset. Similarly, the third term in Equation (1) is used to downweight the importance of the zero entries of Y while also serving as a regularisation term 19 .

Finally, we impose non-negative constraints on P and Q in order to favour the 130 interpretability of the learned representations 20, 19 . Thus, our model is closely related to Non-negative Matrix Factorization (NMF) 20 . Both models seek to decompose a data matrix Y into the product of two non-negative matrices P and Q. However, the NMF model considers all the entries in Y equally during the learning -this works well when entries have the 135 same meaning e.g., pixels in an image 20 . Instead, in our approach, we assign different levels of importance to subsets of entries to reflect the drug stages of development, thus coding the probability of drug success, which is what we are trying to predict. This gives rise to a loss function in Equation (1) Our starting point is the matrix Y containing binary drug-virus associations.

We learn the matrices P and Q that minimise the loss function in Equation (1) Experimental Procedures). Having learned P and Q such that Y ≈ P Q, we calculate the matrixŶ = P Q. Note that, while Y contains binary entries,Ŷ contains real positive numbers that are our predicted scores.

To perform an in silico evaluation of the performance of our model, we We compared the performance of our algorithm with the other drug repurposing approaches that we mentioned earlier, namely the NMF with L2 regularisation 22 , the TriFactor NMF 23, 26 and the Indicator Regularized non-165 negative Matrix Factorization (IRNMF) 24 which was also developed for COVID- 19 . Moreover, we also included standard NMF and Truncated Singular Value Decomposition (tSVD) 18 as baselines. The relations between previous NMFbased drug repositioning methods and our model is explained in Note S1.

Following other works that used LOOCV evaluations 27, 9, 28 , we evaluated We then trained the model, and scores were predicted for all the drugs. Finally, we ranked drugs that had no known association with that virus and checked the percentage of cases in which the correct (effective) drug for the virus was found among the top-K predictions. in the top-20. Overall, our method could recover 70% of the phase IV/approved 190 BSA drugs for 28 distinct viruses in the top-20 predictions. We also observed that, in some cases, tSVD and TriFactor NMF perform slightly better than NMF. The comparison of our method's performance with IRNMF was performed in a smaller subset of the matrix Y (see Figure S1 in Note S1).

The good prediction performance of our model prompted us to ask how much Repositioning FDA-approved drugs with Network Medicine

The majority of BSAs considered previously target viral proteins. In our 205 work, we also explored approaches that consider drugs targeting human proteins.

Human proteins interact with each other, forming a protein-protein interaction (PPI) network. This and other biological networks have been explored in relation to disease -this area of research has often been called network medicine. It has been shown that proteins associated with specific hereditary diseases tend The use of network medicine for assisting drug repositioning was originally 215 applied to genetic diseases 29 . A drug induces its effects on a human PPI subnetwork by binding to its target proteins 31, 32 , and this causes a perturbation in the interactome that is then propagated. Thus, drug efficacy for a genetic disease can be associated to how likely the drug is to affect its disease module through the perturbations propagated in the human PPI network 29 . To implement this 220 idea, Guney et al. 29 proposed a distance (hereafter, the Guney distance) based on the shortest path length between the disease module and the drug targets.

Recent studies suggest that an analogous approach can be useful for infectious diseases such as COVID-19 13, 33 . Viruses hijack host proteins to facilitate their replication, and hence the inhibition or knockdown of such host proteins 225 can block viral replication 34 . Gysi et al. 13 have shown that, for SARS-CoV-2, most of the experimentally identified host proteins 14 group together in a large connected component, forming a COVID-19 disease module, as illustrated in Figure 4a with red nodes (host protein subnetwork). Therefore, the idea here is to find drugs that, by binding to their targets (blue nodes in Figure 4a ), are 230 likely to perturb this module.

We can think of the perturbation caused by a drug as a process in which 9 J o u r n a l P r e -p r o o f the effect of the drug diffuses on the PPI network starting from its targets.

Thus, our drug repurposing problem translates into the problem of the diffusion between drug targets and the set of host proteins. Gysi et al. 13 implemented 235 this idea for COVID-19 by using the diffusion state distance (DSD) 35 .

Kernels on graphs are appealing for modelling a diffusion process on a network. They are theoretically well founded in statistical learning theory 36, 37 , and have shown good empirical results in many applications 38, 39, 35 . Graph kernels can be interpreted as measures of similarity between nodes in a network. There 240 are different types of kernels. The p-step random walk kernel, for example, is directly associated to the number of times a random walker starting from a node i visits a node j after p steps 36 . Another example is the diffusion kernel (or heat kernel), which can be thought of as a random walk with an infinite number of infinitesimally small steps. An alternative interpretation is that this 245 kernel corresponds to the amount of heat that reaches a node j after diffusing an initial heat from node i 36 .

Importantly, kernels on graphs can be applied in a natural way to nodes with weights. This property can be particularly useful for our problem: we can assign weights to the host proteins to model the different roles that they have In order to assist the repositioning of drugs for COVID-19, we used five different kernels on graphs and weighted the host proteins with differential gene 10 expression data (absolute value of the log fold change between the gene expression levels of COVID-19 patients, and controls -see Experimental Procedures 265 for details on the RNAseq data). We used the interactome assembled by Gysi et al. 13 , and a set of 336 human proteins that were identified as hosts of SARS-CoV-2 (see Experimental Procedures). Every FDA approved drug with known targets in this interactome was ranked by each of the kernels in our approach (see Experimental Procedures). The selected kernels are defined in terms of the 270 graph Laplacian (see Experimental Procedures), as shown in Table 1 . For each drug, we obtain the graph kernel-based similarities between each of its targets and each of the host proteins. The final score of a drug is the sum of these similarities weighted by the amount of change in the host protein expression levels after infection. Drug scores are then ordered, obtaining a drug ranking which is 275 evaluated. We also calculated an aggregated ranking, that we called avgRank,

where the ordinal position of each drug was obtained by simply averaging the ranking position that the drug had obtained in each of the kernels.

The mathematical formulation of this approach turns out to be quite simple.

Let n V the number of proteins in the PPI network, N the number of FDA 280 approved drugs, and T an N × n V matrix of drug target associations, where T ij = 1 if j is a target of drug i, and 0, otherwise (see Drug Targets box in Figure   4b ). Let K be a square matrix of dimensions n V × n V representing kernel-based similarities between proteins on the PPI network (see PPI Kernels box in Figure   4b ). Let h be an n V -dimensional column vector containing weights related to 285 the differential expression data of the host proteins and zeros for the remaining proteins (see Host Proteins box in Figure 4b) . We obtain prediction scores simultaneously for all drugs with the following matrix multiplication S d = T Kh (also illustrated in Figure 4c ), resulting in a vector of drug scores, S d . patients. Yet, they represent a proxy of effectiveness of drugs for COVID-19.

In vitro experiments involving drugs with antiviral efficacy indicate their potential to be effective at reducing viral infection and replication in the host cell. Evaluating our models with this kind of evidence allows us to assess whether 300 they prioritise drugs with molecular antiviral efficacy vs. other drugs.

Clinical trial studies are used to assess pharmacokinetics, dosage, therapeutic efficacy, and safety of drugs 43 . Each phase in clinical trials involves an increasing number of patients, thus achieving higher statistical significance while minimising the number of patients that risk developing side effects 44 . In-305 dicating a drug in a clinical trial requires satisfying several conditions set by biologists and medics, and arguments of why it might be effective. This suggests the investigators believe that the drug is safe and a potential candidate to treat the disease. Evaluating our models with clinical trial evidence allows us to determine if they prioritise drugs that would be included in such trials. 310 We use the Connectivity Map (CMAP) 15, 16 to contrast changes in gene expression levels caused by a drug (drug expression profile) with changes induced by SARS-CoV-2 infection (disease expression profile). The hypothesis is that, if a drug expression profile is opposite to a disease expression profile, then it could potentially "revert" the disease signature and have therapeutic effects -this 315 idea has already been used before 15, 42 to predict new therapeutic indications for drugs and has also been applied to COVID-19 45, 46 . Therefore, evaluating our models with this source of evidence allows us to assess whether they prioritise drugs with potentially therapeutic effects.

For the matrix decomposition approach, the evaluation was carried out using We used the types of evidence described above to create three datasets where 325 drugs were classified as either effective or non-effective for COVID-19 (see Experimental Procedures). This allowed us to assess the performance of a prediction method by formulating a binary classification problem, where the task is to discriminate the two sets of drugs, and then calculating binary classification metrics based on the analysis of the confusion matrix. 330 However, we note that the lack of a set of drugs with proven therapeutic effect against COVID-19 (i.e., a gold standard), poses a challenge for this type of evaluation -this problem has also been described before, e.g., in Zhou et al. 48 and Gysi et al. 13 . We hypothesized that drugs with evidence against COVID-19 should behave differently from the remaining drugs. This hypothesis has 335 an actionable consequence: a method can be evaluated by assessing whether it can discriminate between the two groups of drugs (effective and non-effective) -if it can, this is an indication that we can possibly trust the predictions it makes. Therefore, together with traditional metrics for binary classification, we also assessed whether prediction methods provided scores that were statistically 340 different for the two classes of drugs. Our results (Figures 5a, 5b , 5f, 5g, and 5h) show that the differences between the scores are significant for our matrix decomposition approach as well as our kernel methods across several evaluation settings. We observe that other network-based methods do not pass this test with such consistency (see Notes S5, S6, and S7). In the following, we present 345 the results for each type of evidence, separately.

A) In vitro evaluation. Of the 126 BSAs in the drug-virus dataset, 10 have shown in vitro efficacy against SARS-CoV-2 49, 13 . In our evaluation, these drugs were removed one at a time from the drug-virus matrix Y (by setting the corresponding entry to zero). We then trained our matrix decomposition model, and 350 scores were predicted for all the drugs. We used the Wilcoxon-Mann-Whitney pvalue to assess the difference between the scores obtained for those 10 drugs and the rest of the drugs. Figure 5a shows that our matrix decomposition method significantly assigns higher scores to BSAs with in vitro efficacy (Wilcoxon-13 J o u r n a l P r e -p r o o f Mann-Whitney p-value 4.92e−7). Precision and recall are shown in Figure S2 355 (Note S1).

Scores predicted by the kernel-based methods are shown in Figure 5f . Of the 2197 FDA-approved drugs considered by our network medicine approach, 81 have shown in vitro efficacy against SARS-CoV-2 49, 13 . We observed that the scores of drugs with in vitro efficacy against SARS-CoV-2 are significantly 360 higher than those of the remaining drugs for all kernels and the average ranking (avgRank).

In Figure 5c , we show that the kernel-based methods performed better than the competitors for the in vitro evaluation. The recall@150 of the average ranking is 49.71% higher than DSD, and 110.57% higher than the Guney distance.

The precision@150 of the average ranking is 50.54% higher than DSD, and 108.96% higher than the Guney distance. Figure   S2 in Note S1).

Scores predicted by the kernel-based methods are shown in Figure 5g . Of the 2197 FDA approved drugs considered by our network medicine approach, 375 170 are in clinical trials. We observed that the scores of drugs in clinical trials for COVID-19 are significantly higher than those of the the remaining drugs for all kernels and the average ranking (avgRank).

In Figure 5d , we show that the kernel-based methods performed better than the competitors for the clinical trials evaluation. The recall@150 of the average 380 ranking is 17.02% higher than DSD, and 117.11% higher than the Guney distance. The precision@150 of the average ranking is 16.88% higher than DSD, and 114.94% higher than the Guney distance. Figure 5h shows that the scores of FDA-approved drugs with strongly negative CMAP correlation are significantly higher than those of the remaining drugs for all kernels and the average ranking (avgRank).

In Figure 5e , we compared the performance of the kernel-based methods and 390 competitors for the CMAP evaluation. Our average ranking has the same performance as DSD and better performance than Guney's distance (recall@150 is 50.72% higher, and precision@150 is 47.43% higher). The regularised Laplacian kernel had the best performance, with recall@150 16.48% higher than DSD, and 60.7% higher than the Guney's distance, and precision@150 17.5% higher than 395 DSD, and 57.14% higher than the Guney's distance.

On the importance of integrating transcriptomics data. An interesting question is whether weighting host proteins by differential expression improves our network medicine approach. To answer this, we compared results based on 

As a further way to evaluate drug repurposing against SARS-CoV-2, we developed CoREx, a web based tool that enables scientist to study drug repur- CoREx is available at https://paccanarolab.org/corex and supporting datasets are updated every 2 weeks. The project is also open-source, and the repository is publicly available at https://github.com/paccanarolab/corex.

The development of computational approaches that can assist in the rational and fast discovery of treatments is critical for emergent infectious diseases such as COVID-19 55, 1, 2, 3, 6 . Drug repositioning, the re-use of drugs already on the market, can help to speed up the development of such treatments by 470 prioritising known safe-in-human drugs for clinical trials involving COVID-19 patients. In this paper, we proposed two machine learning approaches that can 17 J o u r n a l P r e -p r o o f assist in the prioritisation of drugs, together with a human-in-the-loop website tool, CoREx, to assist current research efforts for finding drugs with therapeutic efficacy against SARS-CoV-2. 475 Li and De Clercq 4 indicated that finding potential repositioning candidates for COVID-19 should be focused on two main strategies: virally-targeted agents, and host-targeted agents. Our matrix decomposition approach is aimed at the first repositioning strategy, whereas our network medicine approach, together with CoREx, is aimed at the second one. Our first approach ranks 126 Broad 480 Spectrum Antivirals by their predicted efficacy against SARS-CoV-2, and our second approach ranks 2 197 therapeutically diverse FDA-approved drugs by their predicted ability to perturb the COVID-19 disease module.

(1) is inspired by our recent work to predict the frequencies of drug side effects 56 .

The main feature of this new model is that it can account for varying levels of uncertainties in the data. We realised that different levels of drug developmental evidence can be thought of as indicating different levels of confidence in drugvirus associations and can be interpreted as probabilities. Our new model exploits the richness of this information and its outputs can be interpreted as prob-490 abilities of drug approval. Experiments in which we randomized or removed information about drug developmental stages show that such information is key to achieve a good performance (see Note S3). The implementation of our algorithm is freely available: https://github.com/paccanarolab/DrugRepoCOVID.

Our network medicine approach aims at prioritising FDA approved candi-495 dates based on their network-modulated effects on the COVID-19 disease protein module. In contrast to our first approach, our network medicine approach does not explicitly model the clinical efficacy of drugs, but rather their mechanistic effects on the protein interaction network. This means that a high score points to a high probability for the drug to perturb the disease module. Note, however, 500 that our kernel methods, like most network-based approaches 13 , can quantify the perturbation on the interactome, but cannot predict in which way the host will ultimately be affected by such perturbations (see Note S11.2).

An important advantage of our kernel approaches is that they offer a natural way to integrate gene expression data and thus allow us to focus the models on 505 particular proteins that play a key role in the infection. Our experiments show that the integration of transcriptomics data improves the results (see Note S8).

Furthermore, we have shown that our kernels have similar performance across multiple interactomes (see Note S7).

We have shown that our predictions from both approaches are aligned to presented in Note S11. A comparison with the set of drugs predicted by Gordon et al. 14 is also provided in Note S9. Finally, while the datasets that we used in our two approaches are different, a few drugs could be predicted by both 525 methodologies -these are analysed in Note S12.

Our computational approaches leverage available data to produce the predictions. As more reliable data becomes available, we expect the performance of our models to increase accordingly. Recently, COVID-19 atlases have been published, including single-cell transcriptomics data 65, 66 that could be exploited 530 with our approaches.

We also point out that while we have developed and tested our two approaches for COVID-19, both of them are disease-agnostic. The general principles underlying our matrix decomposition and network medicine approaches 19 J o u r n a l P r e -p r o o f will remain valid for any other viral disease, and therefore our methods could be 535 applied for drug repurposing in these scenarios, as long as the data is available. Limitations of the Study. Our matrix decomposition approach is applicable to any drug for which the developmental stage associating it to a viral disease is known. The drug may or may not be virally-targeted, and the model itself will not impose such a restriction. The main limitation of the method is that it relies on drug-virus associations annotated with their stage of development, 550 and publicly available data of this type is currently scarce -we only found this type of information in the manually curated dataset by Andersen et al. 17 that we used in our study. The main limitation of our network medicine approach is that it can only be applied to drugs with known targets on the host interactome.

Resource availability Lead contact. The lead contact for this work is Alberto Paccanaro at alberto.paccanaro@rhul.ac.uk.

Data and code availability. Table S7 (Note S13) describes where input datasets, as well as the prediction from our matrix decomposition and network medicine Table S7 ).

• In vitro data. We built a binary dataset, assigning positive labels to drugs that were reported to show efficacy against SARS-CoV-2 infection in vitro, and negative labels to all other drugs. Data for drug efficacy in vitro was built as the union of experiments reported by Riva et al. 49 and Gysi et 600 al. 13 . 81 FDA approved drugs show in vitro effects (see Table S7 ).

• Clinical trials data. We built a binary dataset and assigned positive labels to drugs that are involved in clinical trial studies, and negative labels to all other drugs. Information for clinical trials studies was downloaded from ClinicalTrials.gov on December 1 st , 2020 74 . Drugs were mapped to the 605 DrugBank database 47 by matching their names (see Table S7 ).

• CMAP data. For the CMAP query, we used a COVID-19 signature by Ghandikota et al. 75 . This gave us a list of 106 genes upregulated and 41 genes downregulated in three different models of SARS-CoV-2 infection from transcriptomics data. Two are models in vitro (Calu-3 and Vero E6 Table   615 S7).

The multiplicative learning algorithm for the matrix decomposition model

To minimise Equation (1) subject to non-negative constraints, we developed an efficient multiplicative learning algorithm inspired by the diagonally rescaled principle of non-negative matrix factorization 21 . The algorithm consists of iter-620 atively applying the following multiplicative update rules:

Following the guidelines to implement NMF 76 , a small number = 10 −8 was added to the denominators in Eq. 2 to prevent division by zero, and we initialised P and Q as random dense matrices uniformly distributed in the range [0, 0.1]. Furthermore, to avoid the well-known degeneracy 20 associated with 625 the invariance P Q under the transformation P → P Λ and Q → Λ −1 Q, for a diagonal matrix Λ, we normalised P at each iteration as follows:

where q a denotes the ath row vector of Q.

The stopping criteria of our algorithm was based on the maximum tolerance of the relative change in the elements of P and Q. The default value was 630 tolX < 10 −3 , which occurred typically in about 1000 iterations for k = 5.

Using a similar procedure to Galeano et al. 56 , it can be easily shown that our algorithm in Equation (2) 

Having set all these hyperparameters, we performed a LOOCV on the test set corresponding to drug-virus associations that have been approved or are in phase IV of clinical trials. The model selection for the competitors was performed on the same validation sets (see details in Note S1).

The trained model that we used in the Evaluation section was obtained 650 by training the model 1, 000 times using all the available data with optimal hyperparameters. We then selected the solution that gave the lowest value in the loss function.

A PPI network is represented by a graph G = (V, E), in which V = 655 {1, 2, . . . , n V } is the set of nodes (proteins), and E the a set of links connecting the nodes (protein interactions). If the graph is weighted, then for each edge e ∈ E, we associate a non-negative real value w(e). Let H ∈ V denote the set of host proteins. Our goal is then to perturb the sub-network induced by H, i.e., the host protein sub-network.

Here we rely on different graph kernels described in the literature 36, 77, 35 . In i, j ∈ V and any c i , c j ∈ R, we have that

We can use it to define distances or similarities on a latent feature space.

More specifically, there exists the feature mapping φ : V → F such that k(i, j) = φ(i), φ(j) , for all i, j ∈ V . A graph kernel can be represented by an n V × n V matrix K whose elements correspond to K i,j = k(i, j) for every i, j ∈ V . It is usually defined in terms of the Normalised Laplacian, which we explain below.

Let W be an n V × n V matrix denoting the weighted adjacency matrix of

if there is an edge e connecting i and j, and W i,j = 0, otherwise. If G is unweighted, we assume that w(e) = 1 for every edge e ∈ E. Let D denote an n V ×n V diagonal matrix in which each diagonal element corresponds to the node degree, that is,

Laplacian is defined as D − A, and its pseudoinverse (Moore-Penrose inverse) is denoted by L + . The Normalised Laplacian is defined asL

where I denotes the identity matrix.

There are different ways to define K and we focus on five graph kernels 36, 77, 35 :

Regularised Laplacian, Diffusion Process, and p-Step Random Walk in terms of 680 the normalised Laplacian 36 (see Table 1 ).

In the p-Step Random Walk, p ≥ 1 and a ≥ 2 are given parameters 36 .

The element K i,j measures how likely it is to go from node i to node j after p steps in a random walk. If we generalise it to a continuous time (infinitesimally small steps) and take an infinite number of steps, we have the Diffusion Process The development timeline for treatments against emergent viral diseases can be significantly reduced by re-using drugs already available on the market -a concept known as drug repositioning.

We present two complementary machine learning approaches for drug repositioning that target SARS-CoV-2 and host factors, respectively. Our matrix decomposition approach exploits drug developmental information to predict the effectiveness of broad-spectrum antiviral drugs. Our graph kernel-based approach, rooted in ideas from network medicine, predicts which FDA-approved drugs are more likely to perturb the human subnetwork that is crucial for SARS-CoV-2 infection/replication. We also introduce CoREX, a freely available online tool that enables scientists to reason and formulate hypothesis about drug repurposing in the context of biological networks and pharmacological information.

While we have developed these methodologies for COVID-19, our approaches can be applied to any viral disease.

A matrix decomposition model for repurposing broad-spectrum antivirals A graph kernel approach to model perturbations induced by drugs on the interactome Graph kernels can integrate transcriptomics data to improve drug repurposing CoREX: a free online tool to formulate hypothesis for drug repurposing for COVID-19 eTOC We present two complementary machine learning approaches for drug repositioning against COVID-19, that target SARS-CoV-2 and its cellular processes in the host, respectively. Our matrix decomposition approach exploits drug developmental information to predict broad-spectrum antivirals; our graph kernel-based approach, rooted in ideas from network medicine, predicts which FDA-approved drugs are more likely to perturb the human subnetwork that is crucial for SARS-CoV-2 infection/replication. We also introduce CoREX, a freely available online tool to reason and formulate hypothesis about drug repurposing in the context of biological networks and pharmacological information.

We presented two approaches for drug repurposing. We suggest level 3 (Development/preproduction) for the matrix decomposition approach, and level 2 (Proof of concept) for the network medicine approach.

Drug repositioning: Identifying and developing new uses for existing drugs

Drug repurposing: progress, challenges and recommendations

Coronaviruses -drug discovery and therapeutic options

Therapeutic options for the 2019 novel coronavirus (2019-ncov)

Pharmacologic Treatments for Coronavirus Disease 2019 (COVID-19): A Review

Current Strategies of Antiviral Drug Discovery for COVID-19

The race for antiviral drugs to beat COVID -and the next pandemic

Network Medicine: A Networkbased Approach to Human Disease

Disease gene prediction for molecularly uncharacterized diseases

Molecular networks in Network Medicine: Development and applications

A disease module in the interactome explains disease heterogeneity, drug re-790 sponse and captures novel pathways and genes in asthma

Network-Based Disease Module Discovery by a Novel Seed Connector Algorithm with Pathobiological Implications

Net-800 work medicine framework for identifying drug-repurposing opportunities for COVID-19

A SARS-CoV-2 protein interaction 825 map reveals targets for drug repurposing

The Connectivity Map: Using Gene-Expression Signatures to

Generation Connectivity Map: L1000 Platform and the First 845 1,000,000 Profiles

Discovery and development of safe-in-man broad-spectrum antiviral agents

Performance of recommender algorithms on top-n recommendation tasks

Predicting the frequencies of 855 drug side effects

Learning the parts of objects by non-negative matrix factorization

Algorithms for non-negative matrix factorization

Non-negative matrix factorization for drug repositioning: experiments with the repodb dataset

Matrix factorization-based tech-865 nique for drug repurposing predictions

Indicator regularized nonnegative matrix factorization method-based drug repurposing for covid-19

Trends in clinical success rates and therapeutic focus

The relationships among various nonnegative matrix factorization methods for clustering

Associating genes and protein complexes with disease via network propagation

Prodige: Prioritization of disease genes with multitask machine learning from positive and unlabeled examples

Network-based in silico drug efficacy screening

Network-based approach to prediction and population-based validation of in silico drug repurposing

Drug-target network

Network-based drug repurposing for novel coronavirus 2019-ncov/sars-cov-2

Medicinal chemistry strategies toward host targeting antiviral agents

Going the Distance for Protein Function Prediction: A New Distance Metric for Protein Interaction Networks

2013) e76339, publisher: Public Library of Science

Learning Theory and Kernel Machines

Graph Kernels

Cancer module genes ranking using kernelized score functions

A Fast Ranking Algorithm for Predicting Gene Functions in Biomolecular Networks

Structural basis for the recognition of sars-cov-2 by full-length human ace2

Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19

Discovery and preclinical validation of drug indications using compendia of public gene expression data

Food and Drug Administration

Food and Drug Administration, Drug development process

Underlying Mechanisms and Candidate Drugs for COVID-19

Based on the Connectivity Map Database

L1000 connectivity map interrogation identifies candidate drugs for repurposing as SARS-CoV-2 antiviral therapies

DrugBank 5.0: a major update to the Drug-Bank database for

Artificial intelligence in COVID-19 drug repurposing

Discovery of SARS-CoV-2 antiviral drugs through large-scale com-, Nature (2020) 1-11Publisher

A reference map of the human binary protein interactome

Mering, STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets

publisher: Oxford Academic

Detecting overlapping protein complexes in protein-protein interaction networks

prehensive gene set enrichment analysis web server 2016 update

Exploring the SARS-CoV-2 virus-host-drug interactome for drug repurposing

Artificial intelligence in covid-19 drug repurposing, The Lancet Digital Health

Predicting the frequency of drug side effects

The efficacy and safety of favipiravir in treatment of covid-19: A systematic review and meta-analysis of clinical trials

Repurposed antiviral drugs for covid-19-interim who solidarity trial results

Treating covid-19-off-label drug use, compassionate use, and randomized clinical trials during pandemics

Compassionate use of remdesivir 1035 for patients with severe covid-19

A High-Content Screen for Mucin-1-Reducing Compounds Identifies Fostamatinib as a Candidate for Rapid 1045

Safety and 1050 Impact of Nasal Lavages During Viral Infections Such as SARS-CoV-2

Ko-1055 valchuk, O. Kovalchuk, In search of preventive strategies: novel high-cbd cannabis sativa extracts modulate ace2 expression in covid-19 gateway tissues

Prevention and therapy of COVID-19 via exogenous estrogen 1060 treatment for both male and female patients: Prevention and therapy of COVID-19

A molecular single-cell lung atlas of lethal COVID-19

Subject_term_id: cellular-signalling-networks;infection;sars-cov-2

COVID-19 tissue atlases reveal SARS-CoV-2 pathology and cellular targets

A pneumonia outbreak associated with a new coronavirus of probable bat origin

SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor

Cathepsin L plays a key role in SARS-CoV-2 infection in humans and humanized mice and is a promising target for new drug development

In vivo antiviral 1125 host transcriptional response to SARS-CoV-2 by viral load, sex, and age

Gene Expression Omnibus: NCBI gene 1130 expression and hybridization array data repository

archive for functional genomics data sets-update

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data

Secondary analysis of transcriptomes 1145 of SARS-CoV-2 infection models to characterize COVID-19

Al-1150 applications for approximate nonnegative matrix factorization

Diffusion Kernels

Kernel Methods in Computational Biology