key: cord-0909199-y9g8qru2 authors: Zhang, Yuchen; Lei, Xiujuan; Pan, Yi; Wu, Fang-Xiang title: Drug Repositioning with GraphSAGE and Clustering Constraints Based on Drug and Disease Networks date: 2022-05-10 journal: Front Pharmacol DOI: 10.3389/fphar.2022.872785 sha: 1d662396f02709891f8899aba63be355df737d6d doc_id: 909199 cord_uid: y9g8qru2 The understanding of therapeutic properties is important in drug repositioning and drug discovery. However, chemical or clinical trials are expensive and inefficient to characterize the therapeutic properties of drugs. Recently, artificial intelligence (AI)-assisted algorithms have received extensive attention for discovering the potential therapeutic properties of drugs and speeding up drug development. In this study, we propose a new method based on GraphSAGE and clustering constraints (DRGCC) to investigate the potential therapeutic properties of drugs for drug repositioning. First, the drug structure features and disease symptom features are extracted. Second, the drug–drug interaction network and disease similarity network are constructed according to the drug–gene and disease–gene relationships. Matrix factorization is adopted to extract the clustering features of networks. Then, all the features are fed to the GraphSAGE to predict new associations between existing drugs and diseases. Benchmark comparisons on two different datasets show that our method has reliable predictive performance and outperforms other six competing. We have also conducted case studies on existing drugs and diseases and aimed to predict drugs that may be effective for the novel coronavirus disease 2019 (COVID-19). Among the predicted anti-COVID-19 drug candidates, some drugs are being clinically studied by pharmacologists, and their binding sites to COVID-19-related protein receptors have been found via the molecular docking technology. Traditional drug discovery is often based on a specific disease. It generally has a number of stages, including target discovery, target validation, lead compound identification, lead optimization, preclinical drug development, advancing to clinical trials, and clinical trials. Typically, the development of an effective drug takes an average of 15 years and costs 800 million to 1.5 billion US dollars (Dudley et al., 2011) (Yu et al., 2015) . However, the success rate is often not high due to the lack of systematic evaluation of other indications that drugs can treat, as well as the impact of our life, disease development, and market factors. These difficulties have caused pharmaceutical companies very worrisome when developing new drugs, and the development speed is slow (Booth and Zemmel, 2004) . From cheminformatics and life sciences (Bader et al., 2008) , it is well acknowledged that one drug may work on multiple target proteins, and one target protein is related to multiple diseases, which is the basis of drug repositioning. Actually, drug repositioning brings significant benefits to drug research and related pharmaceutical companies. For example, minoxidil (Varothai and Bergfeld, 2014) , a drug originally used to relieve hypertension and excessive tension was later found to effectively treat symptoms such as hair loss. Antifungal and antitumor drug itraconazole (ITZ) can act as a broad-spectrum enterovirus inhibitor (Strating et al., 2015) . However, this kind of drug repositioning is mostly based on clinical accidental discoveries and the experience of pharmacists, and it is difficult for large-scale investigation. With the development of cross-technology, more and more researchers tend to use computational technologies to predict new indications of existing drugs. These methods mainly include network propagation, low-rank matrix approximation, and graph neural network. Based on biological networks, similarity measures and bi-random walk were proposed for drug repositioning (Luo et al., 2016) . Yu et al. combined miRNAs and group specificity to predict potential therapeutic drugs for breast cancer . A genome-wide positioning systems network algorithm was developed for drug repurposing . Fiscon et al. presented a new network-based algorithm SAveRUNNER and applied it to COVID-19 (Fiscon et al., 2021) . However, due to the complexity and noise of interactions between organisms, the prediction accuracy based on those existing methods cannot meet the requirements. Some methods were developed based on low-rank matrix approximation. Luo et al. proposed a drug repositioning recommendation system (DRRS) to predict novel drug indications based on low-rank matrix approximation and randomized algorithms (Luo et al., 2018) . Wang et al. proposed a projection onto convex sets to relocate the functions of drugs. Weight graph regularized matrix factorization was also used in drug response prediction (Guan et al., 2019) . Wu et al. used meta paths and singular value decomposition to predict drug-disease associations (Wu et al., 2019) . Yang et al. used a bounded nuclear norm regularization (BNNR) method to complete the drug-disease matrix (Yang et al., 2019 ). An improved drug repositioning approach using Bayesian inductive matrix completion also was proposed . Meng et al. used the similarity-constrained probabilistic matrix factorization for drug repositioning and applied it to COVID-19 . However, these matrix-based methods did not take the biochemical properties of drugs and diseases into consideration. With the widespread application of artificial intelligence technology, more and more machine learning and deep learning methods are also applied to drug development and other fields of bioinformatics. Regularized kernel classifier was proposed to predict new drug-disease associations (Lu and Yu, 2018) . Madhukar et al. used a Bayesian machine learning approach to identify drug targets with diverse data types (Madhukar et al., 2019) . Huang et al. proposed a network embedding-based method CMFMTL for predicting drug-disease associations. CMFMTL handled the problem as multi-task learning where each task is to predict one type of association, and two tasks complement and improve each other by capturing the relatedness between them . Zhu et al. constructed a drug knowledge graph for drug repurposingand transformed information in the drug knowledge graph into valuable inputs to allow machine learning models to predict drug repurposing candidates (Zhu et al., 2020) . Zeng et al. developed a network-based deep learning approach, termed deepDR (Zeng et al., 2019) , for in silico drug repurposing. Li et al. used molecular structures and clinical symptoms via a deep convolutional neural network to identify drug-disease associations (Li Z et al., 2019) . A network embedding method called NEDD was proposed to predict novel associations between drugs and diseases using meta paths of different lengths. Graph convolutional network (GCN) methods have also been further used in the field of medicine. A layer attention graph convolutional network (LAGCN) (Yu et al., 2020) was also used by fusing heterogeneous information to the GCN. They introduced a layer attention mechanism to combine embeddings from multiple graph convolution layers for further improving the prediction performance (Cai et al., 2021) . Wang et al. also proposed a global graph feature learning method to predict associations (Wang et al., 2022) . Meta path-based methods such as metapath2vec and metastructure have also been developed Lei et al., 2021) . Algorithms based on graph neural networks (GNNs) or graph embeddings consider both biochemical characteristics and network interactions, but they often have high time complexity and do not consider the characteristic of drug clusters or combination drugs. At the same time, when extracting features of drug-disease associations, a large number of methods only directly connect drug features and disease features without considering the influence of different features. The feature representation of association needs to be improved. While existing methods cannot accurately predict the potential drug-disease associations, and the network is often unchangeable after model training, we proposed a drug repositioning method DRGCC based on network clustering constraints and GraphSAGE. First, we extracted the molecular structure features of drugs and the symptom features of diseases as the biological attribute features. After that, we used the associations between drugs and genes, as well as the relationships between diseases and genes, to reconstruct a drug-drug interaction network and establish a disease similarity network. The third step was to use a clustering algorithm to divide the two networks into some clusters. The network clustering features of drugs and diseases were obtained by matrix factorization with the divided cluster set as a condition constraint, respectively. Finally, we built two GraphSAGE models based on drug and disease networks and fed the attributes and clustering features of drugs and diseases to the two models, respectively, to obtain the potential treatment probability of the existing drugs for the diseases. The method was applied to the prediction of anti-COVID-19 drugs, and some case studies were conducted. The framework of the method DRGCC is shown in Figure 1 . The main contributions of this work are summarized as the following two points: 1) DRGCC integrates the clustering features of networks, which can effectively improve the prediction accuracy of drug-disease associations. 2) DRGCC can embed new nodes in the existing network and predict their associations. In addition, DRGCC is complementary to existing experimental methods to enable rapid and accurate discovery of drug candidates for anti-COVID-19 and other emerging viral infectious diseases. In this section, we introduce the database used in the study and how they were processed. The known associations of drug and disease were obtained. The drug-drug interaction network was reconstructed. The disease similarity network was calculated. Their attribute features and network features were also extracted. The purpose of our study is to predict potential associations from known drug-disease associations, which can be formulated as a classification problem. Therefore, we developed a GNN model based on GraphSAGE, which takes the obtained drug and disease attribute features and clustering features as input, and outputs the possibility of potential relationships between them. Known drug and disease relationship data can be obtained from the Comparative Toxicogenomics Database (CTD) (Davis et al., 2021) . CTD is a publicly available database that aims to advance understanding of how environmental exposures affect human health. It provides manually curated information about drug compound-gene/protein interactions, drug compound-disease, and gene-disease relationships. We first screened 36,392 drug-disease associations marked with therapeutic relationships in CTD (version 2021.2.26). They corresponded to 6,699 drugs and 2,472 diseases. In order to make it more focused and easier to verify the method later, we extracted drugs with more than 10 disease treatment effects and diseases that are affected by more than 10 drugs. We made the corresponding PubChem Compound ID (CID) and PubChem Substance ID (SID) (Kim et al., 2021) for each drug compound. In the end, we extracted 780 drugs, 717 diseases, and 17,594 therapeutic associations. The known drug-disease association matrix is marked as Y, if drug i has a therapeutic effect on disease j, then Y ij = 1; otherwise, it is 0. In addition, we also considered the relational database of viruses and drugs, HDVD , which includes 34 viruses, 219 drugs, and 455 human drug-virus interactions. In the HDVD database, SARS-CoV-2, which has recently attracted much attention, is included. The statistics of the two datasets are shown in Table 1 . In daily life, we have known for a long time that there are interactions between drugs and drugs. Some combinations of drugs can promote the cure of diseases. The interactions between drugs can also provide the basis for feature extraction and fusion of drugs. DrugBank (Wishart et al., 2018) provides us with a large number of drug-drug interactions (DDIs). We found 2,669,764 interactions in the database. We denote the drug-drug interaction matrix by M DDI . Due to the non-correspondence of IDs, only 489 of the 780 drugs were mapped to DrugBank. There are 56,439 interactions among 489 drugs. Therefore, we aimed to use other biological properties of drugs to infer possible associations between drugs. The clinical relevance of drug-drug interactions also depends on the patient's genetic profile. Drug-drug-gene and drug-gene-gene interactions affect the therapeutic properties of drugs (Hahn and Roll, 2021) . A method for calculating drug similarity using drug-gene associations was proposed by Groza et al. (2021) . Inspired by these studies, we aimed to use the drug-gene relationship to complement the existing drug interactions. The CTD also provides the relationships between drug compounds and genes. We obtained 383,525 drug-gene relationships from it. They covered 768 drugs and 34,184 genes. We denote the drug-gene association matrix by M drug−gene ; if drug i has an association with gene j, then M drug−gene ij = 1; otherwise, it is 0. The reconstructed drug-drug interaction (RDDI) matrix M RDDI is calculated as follows: These associated genes often encode target proteins, and thus, we considered the relationship between drugs and target proteins, making the drug interaction network more complete. There are also similarities between diseases, and a large number of calculation methods for disease similarity have been developed in the literature. In studying the relationship between miRNAs and diseases, Cui et al. successively developed two versions of the method (Wang et al., 2010) , both of which applied disease semantic similarity. All the denominations of diseases were in accordance with the MeSH (Yu, 2018) database (https://www.nlm.nih.gov/mesh/meshhome.html). Finally, we obtained the semantic similarity matrix M DS of diseases according to the method of Wang et al. (2010) . Different from the method in Disease Ontology (Schriml et al., 2019) that only builds an overall semantic tree, MESH divides diseases into 17 subcategories or sub-trees, so there are null values in the calculated disease similarity for some different subcategory diseases. Previous work has shown elucidating disease and gene associations (Li et al., 2021) . Similar to reconstructing the M RDDI , we use disease-gene relationship to reconstruct the disease similarity network. The CTD contains 13,775,363 disease-gene relationships, which cover 715 diseases and 50,827 genes. The disease-gene association matrix is denoted by M dis−gene . If disease i is related to gene j, then M dis−gene ij = 1; otherwise, it is 0. The reconstructed disease similarity matrix M RDS is calculated as follows: where N gene is the number of all genes. The attribute features of drugs can be described by their structures. The PubChem system generates a binary substructure fingerprint for chemical structures. These fingerprints are used by PubChem for similarity neighboring and similarity searching (Kim et al., 2021) . The structure of a drug can be described by 881 substructures, and a substructure is a fragment of a chemical structure. The fingerprint is an ordered list of binary bits (0/1). A Boolean value for each bit determines or tests the presence of a chemical structure. Binary data are stored in one-byte increments. Therefore, the length of the fingerprint is 111 bytes (888 bits), which include padding 7 bits at the end to complete the last byte. The four-byte prefix including the fingerprint bit length (881 bits) increases the size of the stored PubChem fingerprint to 115 bytes (920 bits). To learn embeddings of drugs, we also used latent semantic analysis (Deerwester et al., 1990) . Let N sub denote the number of substructures generated from all drugs. We employ a matrix M drug−sub ∈ R N drug ×N sub , and M drug−sub is defined as follows: where tf(i, j) stands for the strength of the i-th drug having j-th substructure. If substructure j appears in drug i, then tf(i, j) = 1/ idf (N drug , j) results in lower weights for more common substructures and higher weights for less common substructures. This is consistent with an observation in the information theory that rarer events generally have higher entropy and are thus more informative. Then, the matrix M drug−sub was decomposed by singular value decomposition (SVD) into three matrices R, Σ, and Q, such that M drug−sub RΣQ. Σ ∈ R N drug ×N sub is a diagonal matrix with the eigenvalues of M drug−sub , and R is an N drug × N drug matrix in which each column is an eigenvector R .j of M drug−sub corresponding to the eigenvalue Σ jj . Afterward, in order to embed the features into the low-dimensional space R d drug , we extracted the feature vectors corresponding to the top d drug largest singular values to form a new drug attribute feature matrix F drug . Similar to drug attribute feature extraction, disease attribute features are also extracted. Diseases are often accompanied by a large number of symptoms when they occur. Zhou et al. established a disease-symptom network when studying the commonalities between diseases (Zhou et al., 2014) . They gave 322 common symptoms for each disease, established a disease-symptom relationship matrix, and also used the term frequency-inverse document frequency method to weight. After that, we also used the SVD method to obtain a disease feature matrix F dis in R d dis space. The feature vectors corresponding to the top d dis largest singular values form the disease attribute feature matrix F dis . In the previous section, we have obtained attribute features of drugs and diseases. However, the network features between drugs and diseases were not involved. On the other hand, numerous studies have confirmed the modularity that exists between biomolecules (Groza et al., 2021) . Matrix factorization, as a commonly used low-rank matrix approximation method, can achieve the goal by adding expectation constraints. Therefore, we aimed to use the matrix factorization method to measure the features of the relationship between drugs and diseases and consider the modularity of drugs and diseases. Two constraints were added to matrix factorization, one is sparsity and the other is clustering constraints. For sparsity, it is desirable to obtain a basis matrix with fewer parameters and be able to restore the original associations. It can be written as follows: where U ∈ R N drug ×k , V ∈ R k×N dis are the feature matrices of drugs and diseases, k can be used as the embedded feature dimension, and P is the observation matrix. In this matrix, the elements corresponding to positive and negative samples are marked as 1, and the other elements are 0. ⊙ is the Hadamard product. For clustering attributes, we first need to cluster nodes in the drug network and disease network. MCODE (Bader and Hogue, 2003) is a very mature network clustering method, which has been widely used in a variety of network analyses. We used it to cluster the reconstructed drug-drug interaction network and disease similarity network. When extracting features for drug and disease networks, the embedded features should satisfy the property that drugs or diseases of different clusters have greater distinguishability. Using Euclidean distance as the measure function of similarity between features, the matrix factorization subject to clustering constraints can be written as follows: where c drug and c disease are the cluster number of drugs and diseases, respectively; is the average vector of all drug (disease) feature vectors; and α and β are control parameters. We set s i (s ' i ) to the node number of i-th drug (disease) cluster, N drug s 1 + s 2 + . . . + s c drug , and , so the average of the feature values of i-th cluster samples can be calculated as follows: where is the x-th feature vector of i-th drug (disease) cluster. The matrix formed by the average vector of all clusters can be represented by where Then, we defined matrices B drug and B disease as follows: (10) U all , ( V all ) can be written in the following matrix form: Therefore, the constraint term of clustering can be expressed by formula (12): As a result, the constraint matrix factorization in formula (6) has been transformed into The partial derivatives of J(U, V) with respect to U and V are calculated as follows: After the initial U and V are randomly given, solution is solved as per the following iterative rules until the stopping condition is met. Drug network clustering feature U and disease network clustering feature V are obtained. GraphSAGE (SAmple and aggreGatE) (Hamilton et al., 2017 ) is a new graph convolutional neural (GCN) (Defferrard et al., 2016) model proposed, which has two improvements to the original GCN. On the one hand, it used the strategy of sampling neighbors to transform the GCN from a full graph training method to a node-centric small batch training method, which made largescale data distributed training possible. On the other hand, the algorithm extended the operation of aggregating neighbors. In this study, we used the GraphSAGE model for the drug-drug interaction network and disease similarity network, respectively, to obtain their low dimensional embedding vectors and make predictions through a simple neural network. The feature x of each node v in these networks is marked as x v , vB, where B denotes a batch sample set. In each iteration, only the nodes in the batch set are trained. Assuming that the model has L layers when sampling the nodes in the batch set, a top-down sampling method is adopted. It collects n k nodes from each layer at a time. Neighborhood sampling functions H l of the l-th layer are defined by sampling the n k most similar neighbors of the source node B. H l (v) represents the sampling set of nodes around the node v of the l-th layer. The sampling process is from B L to B 0 shown in the sampling section of Algorithm 1. Then we extract the feature h 0 u of each node u in the B 0 set as training features. First, each node v aggregates the representations of the nodes in its sampling neighborhood, {h l−1 u , u ∈ H l (v)} into a single vector H l H l (v) . After aggregating the neighboring feature vectors, GraphSAGE concatenates the node's current representation, h l−1 v , with the aggregated neighborhood vector, H l H l (v) , and this concatenated vector is fed to a fully connected layer with a nonlinear activation function σ, which transforms the representations to be used at the next step of the algorithm for h l v . The embedding generation of a given drug node is shown in the embedding section of Algorithm 1. The different aggregator functions can be used in the aggregation steps: Mean aggregator: MeanPool aggregator: MaxPool aggregator: GCN aggregator: LSTM aggregator: where W l and b l are parameter matrix and bias of the l-th layer, respectively. The final model outputs a low dimensional embedding vector z v of node v. Since formula (19) is a linear approximation of local spectral convolution, it is called a GCN aggregator. It is important to note that LSTM is not inherently symmetric because it processes inputs in a sequential manner. GraphSAGE adopts LSTM to operate on an unordered set by simply applying the LSTM to a random permutation. Unlike GCN, GraphSAGE can perform batch sampling and save the required neighbor features before the node feature aggregation operation. After training, GraphSAGE can perform feature embedding for newly added network nodes. In this way, the network model is actually formed into a subnetwork model according to the sampled nodes, which can increase the learning speed of the model and is suitable for processing larger samples. In this study, the relationship prediction of two types of nodes is involved, and the number of samples is N drug × N dis , which is very large, so GraphSAGE has better performance. The GraphSAGE minibatch forward propagation is described in Algorithm 1. Algorithm 1. GraphSAGE minibatch forward propagation in drug-drug interaction or disease similarity network. Specifically, we feed the drug attribute feature F drug and drug network clustering feature U to the GraphSAGE to get the embedded features z F drug and z U drug and feed the disease attribute feature F dis and disease network feature V to the GraphSAGE to get the embedded features z F dis and z V dis . Then we connected the drug embedding features with the disease embedding features to obtain the association features, so as to learn their low dimensional features and predict their relationships. For example, to predict the association between drug i and disease j, we connect z F drug , z U drug , z F dis and z V dis as concat (z F drugi , z U drug i , z F disj , z V disj ), input it into a three-layer fully connected network, and finally use the SoftMax function to find its probability P ij . GraphSAGE can perform unsupervised learning , but this objective function is completely based on the topological properties of the network, ignoring the original features of the nodes. If it is applied to this research, each training needs to use a different network. Its essence can reflect the relationship of the features between nodes very well, but it cannot predict the relationship very well. Therefore, we still used the crossentropy function as the objective function. In order to prevent the over-fitting problem, an L2-regularization is also adopted: where P ij represents the associated probability of drug i and disease j, Y ij ∈ {0, 1} is the known associations, and N drug (N dis ) is the drug (disease) sample size. Since no negative samples are given in the two databases, extracting reliable negative samples is also an important part of the experiment. The usual operation is to randomly select the same number of negative samples as positive samples from unknown samples. But this will actually interfere with the model learning, so we used the network double random walk (Xie et al., 2012) method to determine the negative samples. After the random walk, the same samples with the smallest scores are regarded as negative samples. Based on previous works, we validate our method by answering the following questions: • Are the features we extracted valid, and can network clustering features improve the performance of the method? • Can DRGCC predict drug-disease associations with higher accuracy? • Can we verify that the predicted repositioning drugs are effective, especially for COVID-19? In our study, we used 5-fold cross-validation (5-fold CV) to evaluate the prediction performance of DRGCC and other competing methods. All samples were randomly divided into five equal-sized parts, four parts of them were used as training data, and the remaining one was used as test data. This process was repeated 5 times, with each part of the data tested once, and the average result of these 5 times was taken as the result of this cross-validation. After that, the samples were randomly divided again, cross-validation was also performed 5 times, and the results were averaged. We mainly used seven metrics: area under the receiver operating characteristic curve (AUC), area under the precision and recall curve (PRAUC), F1_SCORE, ACCURACY, SPECIFICITY, PRECISION, and RECALL (Yu et al., 2020) , to comprehensively evaluate the performance of the method. We took the prediction threshold that maximizes the F1_SCORE and built two-layer GraphSAGE models for drugs and diseases separately. After further statistical analysis of drug and disease features, we set some default parameters. The attribute feature dimension d drug of drugs was set to 300, while the attribute feature dimension d dis of diseases was set to 100. In constraint matrix factorization, the regularization of parameters α and β has an important influence on the extraction of network clustering features. We tested all possible combinations of α and β, as shown in Figure 2A . We found that if α = 0.2, β = 0.1, the method has the best AUC value on the CTD dataset. At the same time, since the DRGCC is sampled and trained in batches, the size of the batch is particularly important. If the batch is too small, it will be difficult to converge. If the batch is too large, it demands a large amount of computation. We tested the effect of different batch_size on the method, as shown in Figure 2B . The method has the best performance when the batch_size is equal to 128. For the GraphSAGE, there are a total of five different aggregation methods. We performed comparisons on dataset CTD and dataset HDVD, respectively. We can find that the performance of the aggregation methods based on mean, meanPool, and maxPool are similar in Figure 3 and are significantly higher than that of the aggregation based on the LSTM and GCN. This may illustrate that structural features between drugs and symptom features between diseases can be fused using linear methods. Finally, we used the mean method as the aggregation method of the DRGCC. We also evaluated the sampling number of network neighbors. Similar to Cui et al. (2021) , we tested 4 cases, where n k is {3, 5, 10, 15} and finally determined that it is better to take the nearest 5 neighbor nodes as aggregation nodes. Figure 4 shows the distribution of AUC values for a total of 25 times in 5 cross-validations. This test is run on the HDVD network because it is sparser, and the test on the CTD dataset has a similar effect. Frontiers in Pharmacology | www.frontiersin.org May 2022 | Volume 13 | Article 872785 8 To answer the first question of the experiment, we conduct ablation experiments using only attribute features DRGCC_Attribute and only network clustering features DRGCC_Cluster for prediction, respectively. Table 2 shows that the model with clustering features is slightly higher than the model with only attribute features, and the fusion of the two features has a prominent effect on the CTD database. In Figure 5 , the ROC curve of a 5-fold cross-validation is depicted. The average of 5 times is also calculated. It is clear that the performance of applying two features to DRGCC at the same time is better than using a single one, and the AUC is as high as 0.9809. The network clustering feature has a better effect on improving the performance of the method. To answer the second question of the experiment, we compared DRGCC with six state-of-the-art drug repositioning methods in this section, such as MBiRW (Luo et al., 2016) , DRRS (Luo et al., 2018) , BNNR , SCPMFDR , NIMCGCN , and LAGCN (Yu et al., 2020) on CTD and HDVD datasets. These methods are mainly divided into three categories: methods based on network propagation, methods based on low-rank matrix approximation, and methods based on the GNN. • MBiRW (Luo et al., 2016) integrates drug or disease feature information with known drug-disease associations, and the comprehensive similarity measures are developed to calculate similarity for drugs and diseases. They are incorporated into a heterogeneous network with known drug-disease interactions. Based on the drug-disease heterogeneous network, the bi-random walk (BiRW) algorithm is used to identify potential novel indications for a given drug. • DRRS (Luo et al., 2018 ) is a matrix completion-based recommendation system on a drug-disease heterogeneous network to predict drug-disease associations. • BNNR ) is a bounded nuclear norm regularization method to complete a drug-disease heterogeneous network. • SCPMFDR is implemented on an adjacency matrix of a heterogeneous drug-virus network, which integrates the known drug-virus interactions, drug chemical structures, and virus genomic sequences. SCPMF projects the drug-virus interactions matrix into two latent feature matrices for the drugs and viruses, which reconstruct the drug-virus interactions matrix when multiplied together, and then introduces similarity constrained probabilistic matrix factorization to predict associations. • NIMCGCN use GCNs to learn latent feature representations of miRNA and disease from the similarity networks and then put the learned features into a neural inductive matrix completion model to obtain a reconstructed association matrix. NIMCGCN is a GCNbased method proposed for the miRNA-disease association prediction, and we adopt it as the baseline method for the drug-disease association. • LAGCN (Yu et al., 2020) integrates the known drug-disease associations, drug-drug similarities, and disease-disease similarities into a heterogeneous network and applies the graph convolution operation to the network to learn the embeddings of drugs and diseases. It combines the embeddings from multiple graph convolution layers using the attention mechanism. In Table 3 , the results show that our method outperforms other methods on all 7 metrics for the CTD database. In large networks, only considering the relationship between nodes in the network and ignoring the biochemical properties of the nodes themselves have poor prediction performance. For the drug-virus prediction, disease similarity networks based on amino acid sequences and structure-based on drug similarity networks have been provided in HDVD. Since there are no features of viruses in the original dataset, we used DRGCC_cluster for the prediction problem. The prediction results are slightly lower than the results on the CTD due to a large number of unknown relationships and the inclusion of the new virus COVID-19, as shown in Table 4 . Except for the RECALL, the other evaluation values are the highest. The AUC reaches 0.9222, and the PRAUC reaches 0.9458. It can be seen that DRGCC has excellent performance. To answer the third question of the experiment, we presented an analysis of the predicted repositioned drugs. The top 10 predicted drug-disease relationships were extracted, as shown in Table 5 . Among the top 10 prediction results, we can find corroborations or explanations for 6 predictions from other studies. Early evidence in rats suggested that acetazolamide may inhibit sodium and water transport in the ileum in addition to inhibiting bicarbonate secretion (Sladen, 1973) . It may have an influence on duodenal ulcer treatment. Rimonabant was shown to be safe and effective in treating the combined cardiovascular risk factors of smoking and obesity (Cleland et al., 2004) . Hypoosmolar hyponatremia occurs in conditions of plasma volume depletion such as cirrhosis and heart failure and syndromes of inappropriate antidiuretic hormone secretion. Conventional proposals for euvolemic and hypervolemic hyponatremia consist of lithium carbonate (Gross, 2008) . Peyrani et al. believed that therapeutics beyond antibiotics (e.g., heparin or aspirin) may be indicated during and after hospitalization for the patients with community-acquired pneumonia (Peyrani and Ramirez 2013) . Newer antiemetic with prokinetic properties (cisapride) have also been introduced in the management of The novel coronavirus disease 2019 (COVID-19) pandemic has triggered a massive health crisis and upended economies across the globe. However, the research and development of traditional medicines for the new coronavirus is very expensive in terms of time, manpower, and funds. Drug repurposing emerged as a promising therapeutic strategy during the COVID-19 virus crisis. We also predicted the top 10 possible drugs for anti-COVID-19, as shown in Table 6 . Excitingly, seven of them have been reported by medical researchers, such as, triazavirin is a guanine nucleotide analog antiviral that has shown efficacy against influenza A and B, including the H5N1 strain. Given the similarities between SARS-CoV-2 and H5N1, health scientists are investigating triazavirin as an option to combat COVID-19 (Shahab and Sheikhi, 2021) (Valiulin et al., 2021) . Aspergillus-producing diseases range from allergic syndromes to chronic lung disease and invasive infections and are frequently observed following COVID-19 infection. Posaconazole has better efficacy with less toxicity for extensive infection and severe immunosuppression (Cadena et al., 2021 ). In the reports on possible drugs for COVID-19, Uddin et al. mentioned that mefloquine may be one of the options (Uddin et al., 2021) . In the research of Solanich et al., methylprednisolone and tacrolimus were considered that might be beneficial to treat those COVID-19 patients progressing into severe pulmonary failure and systemic hyperinflammatory syndrome (Solanich et al., 2021) . Molnupiravir (EIDD-2801) was originally designed for the treatment of alphavirus infections. Painter et al. described its evolution into a potential drug for the prevention and treatment of COVID-19 (Painter et al., 2021) . Umifenovir was deemed one of the most hopeful antiviral agents for improving the health of COVID-19 patients (Trivedi et al., 2020) . The studies of Lai et al. showed that the use of mycophenolic acid might be a strategy to reduce viral replication (Lai et al., 2020) . In addition, we also analyzed the docking state of unverifiable drugs and receptors. Angiotensin-converting enzyme 2 (ACE2) was considered an important functional receptor for SARS and other coronaviruses (Li et al., 2003) . Like SARS-CoV, SARS-CoV-2 infects human respiratory epithelial cells through invasion mediated by human cell surface s-protein and ACE2 protein receptors. Obstructing the combination of ACE2 and the virus has become one of the effective means to prevent the respiratory infection of the crown virus. The molecular docking technology allows us to clearly determine the binding sites and bond strengths between molecules (Meng et al., 2011) . We examined the binding of 4 drug compounds triazavirin, posaconazole, lopinavir, and cenicriviroc to the receptor protein ACE2. As shown in Figure 6 , triazavirin and ACE2 have 4 hydrogen bonds bound to amino acids ILE and ASP, respectively. Lopinavir has 2 hydrogen bonds bound to amino acid ARG in ACE2. Posaconazole and cenicriviroc also have binding sites to ACE2. It can be seen that only one of the 3 unreported drugs has not been corroborated. It can be seen that these drugs may provide some help in the treatment of COVID-19. In this article, we have proposed a drug repositioning method DRGCC to predict potential relationships between existing drugs and new diseases. The method first reconstructed the drug-drug interaction network, established the disease semantic similarity network, then extracted the structural features of drugs and disease symptoms as attribute features, and obtained network clustering features through matrix factorization. Finally, all features were fed to the GraphSAGE model to obtain predictions of drug-disease associations. With experiments testing on two datasets, it is found that our method has better performance than other competing methods. Experiments also demonstrated the importance of network clustering features for accurate prediction. At the same time, DRGCC is suitable for training and predicting large-scale samples and can add new nodes to the network after training, such as the SARS-CoV-2 virus. After analyzing the predicted repositioning drugs, we gave several possible drug treatment combinations and recommended several anti-COVID-19 drugs. These predictions have been supported or discussed by other studies. It can be seen that DRGCC has certain reliability in drug repositioning studies. Publicly available datasets were analyzed in this study. These data can be found here: http://ctdbase.org/; https://github.com/ luckymengmeng/HDVD; https://go.drugbank.com/; https:// pubchem.ncbi.nlm.nih.gov/. FIGURE 6 | Ligand-protein binding mode between the predicted drugs and the protein receptor ACE2. The purple part is the protein ACE2, the blue part is the drug compound, the yellow part is the amino acid residue, and the orange dotted line is the connecting hydrogen bond. The numbers represent atomic distances. Frontiers in Pharmacology | www.frontiersin.org May 2022 | Volume 13 | Article 872785 An Automated Method for Finding Molecular Complexes in Large Protein Interaction Networks Interaction Networks for Systems Biology Prospects for Productivity Aspergillosis: Epidemiology, Diagnosis, and Treatment Drug Repositioning Based on the Heterogeneous Information Fusion Graph Convolutional Network A Genome-wide Positioning Systems Network Algorithm for In Silico Drug Repurposing Clinical Trials Update and Cumulative Meta-Analyses from the American College of Cardiology: WATCH, SCD-HeFT, DINAMIT, CASINO, INSPIRE, STRATUS-US, RIO-Lipids and Cardiac Resynchronisation Therapy in Heart Failure Drug Repurposing against Breast Cancer by Integrating Drug-Exposure Expression Profiles and Drug-Drug Links Based on Graph Neural Network Comparative Toxicogenomics Database (CTD): Update 2021 Indexing by Latent Semantic Analysis Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering Exploiting Drug-Disease Relationships for Computational Drug Repositioning SAveRUNNER: A Network-Based Algorithm for Drug Repurposing and its Application to COVID-19 Treatment of Hyponatremia Drug Repurposing Using Modularity Clustering in Drug-Drug Similarity Networks Based on Drug-Gene Interactions Anticancer Drug Response Prediction in Cell Lines Using Weighted Graph Regularized Matrix Factorization The Influence of Pharmacogenetics on the Clinical Relevance of Pharmacokinetic Drug-Drug Interactions: Drug-Gene, Drug-Gene-Gene and Drug-Drug-Gene Interactions Inductive Representation Learning on Large Graphs Predicting Drug-Disease Associations via Multi-Task Learning Based on Collective Matrix Factorization PubChem in 2021: New Data Content and Improved Web Interfaces SARS-CoV2 and Immunosuppression: A Double-Edged Sword Clinical Pharmacokinetics of Drugs Used in the Treatment of Gastrointestinal Diseases (Part I) Predicting CircRNA-Disease Associations Based on Improved Weighted Biased Meta-Structure Neural Inductive Matrix Completion with Graph Convolutional Networks for miRNA-Disease Association Prediction MISIM v2.0: a Web Server for Inferring microRNA Functional Similarity Based on microRNA-Disease Associations Angiotensin-converting Enzyme 2 Is a Functional Receptor for the SARS Coronavirus Evaluating Disease Similarity Based on Gene Network Reconstruction and Representation Identification of Drug-Disease Associations Using Information of Molecular Structures and Clinical Symptoms via Deep Convolutional Neural Network DR2DI: a Powerful Computational Tool for Predicting Novel Drug-Disease Associations Computational Drug Repositioning Using Low-Rank Matrix Approximation and Randomized Algorithms Drug Repositioning Based on Comprehensive Similarity Measures and Bi-random Walk Algorithm A Bayesian Machine Learning Approach for Drug Target Identification Using Diverse Data Types Molecular Docking: a Powerful Approach for Structure-Based Drug Discovery Drug Repositioning Based on Similarity Constrained Probabilistic Matrix Factorization: COVID-19 as a Case Study Constructing Disease Similarity Networks Based on Disease Module Theory Developing a Direct Acting, Orally Available Antiviral Agent in a Pandemic: the Evolution of Molnupiravir as a Potential Treatment for COVID-19 What Is the Association of Cardiovascular Events with Clinical Failure in Patients with Community-Acquired Pneumonia? Human Disease Ontology 2018 Update: Classification, Content and Workflow Expansion Triazavirin -Potential Inhibitor for 2019-nCoV Coronavirus M Protease: A DFT Study The Pathogenesis of Cholera and Some Wider Implications Pragmatic, Open-Label, single-center, Randomized, Phase II Clinical Trial to Evaluate the Efficacy and Safety of Methylprednisolone Pulses and Tacrolimus in Patients with Severe Pneumonia Secondary to COVID-19: The TACROVID Trial Protocol Itraconazole Inhibits Enterovirus Replication by Targeting the Oxysterol-Binding Protein SBU Systematic Review Summaries Possible Treatment and Strategies for COVID-19: Review and Assessment Potential Drugs for the Treatment of COVID-19: Synthesis, Brief History and Application Aerosol Inhalation Delivery of Triazavirin in Mice: Outlooks for Advanced Therapy against Novel Viral Infections Androgenetic Alopecia: An Evidence-Based Treatment Update Inferring the Human microRNA Functional Similarity and Functional Network Based on microRNA-Associated Diseases Predicting Microbe-Disease Association Based on Heterogeneous Network and Global Graph Feature Learning DrPOCS: Drug Repositioning Based on Projection onto Convex Sets DrugBank 5.0: a Major Update to the DrugBank Database Prediction of Drug-Disease Associations Based on Ensemble Meta Paths and Singular Value Decomposition Prioritizing Disease Genes by Birandom Walk Inductive Representation Learning on Temporal Graphs Drug Repositioning Based on Bounded Nuclear Norm Regularization Using Meshes for MeSH Term Enrichment and Semantic Analyses Inferring Drug-Disease Associations Based on Known Protein Complexes Predicting Potential Drugs for Breast Cancer Based on miRNA and Tissue Specificity Predicting Drug-Disease Associations through Layer Attention Graph Convolutional Network deepDR: a Network-Based Deep Learning Approach to In Silico Drug Repositioning DRIMC: an Improved Drug Repositioning Approach Using Bayesian Inductive Matrix Completion CircRNA-disease Associations Prediction Based on Metapath2vec++ and Matrix Factorization NEDD: a Network Embedding Based Method for Predicting Drug-Disease Associations Human Symptoms-Disease Network Knowledge-driven Drug Repurposing Using a Comprehensive Drug Knowledge Graph YZ and XL proposed the concept and idea; YZ implemented the algorithm and wrote the draft manuscript; F-XW provided the method improvement strategy; XL, F-XW, and YP evaluated the results and revised the manuscript; and YP and F-XW supervised the whole study. All authors read and approved the final manuscript. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.