key: cord-0691816-xzh6ihjk authors: Chen, Hailin; Zhang, Zuping; Zhang, Jingpu title: In silico drug repositioning based on the integration of chemical, genomic and pharmacological spaces date: 2021-02-08 journal: BMC Bioinformatics DOI: 10.1186/s12859-021-03988-x sha: 8aee9abb45ce5b46759744d714045dea03bc9976 doc_id: 691816 cord_uid: xzh6ihjk BACKGROUND: Drug repositioning refers to the identification of new indications for existing drugs. Drug-based inference methods for drug repositioning apply some unique features of drugs for new indication prediction. Complementary information is provided by these different features. It is therefore necessary to integrate these features for more accurate in silico drug repositioning. RESULTS: In this study, we collect 3 different types of drug features (i.e., chemical, genomic and pharmacological spaces) from public databases. Similarities between drugs are separately calculated based on each of the features. We further develop a fusion method to combine the 3 similarity measurements. We test the inference abilities of the 4 similarity datasets in drug repositioning under the guilt-by-association principle. Leave-one-out cross-validations show the integrated similarity measurement IntegratedSim receives the best prediction performance, with the highest AUC value of 0.8451 and the highest AUPR value of 0.2201. Case studies demonstrate IntegratedSim produces the largest numbers of confirmed predictions in most cases. Moreover, we compare our integration method with 3 other similarity-fusion methods using the datasets in our study. Cross-validation results suggest our method improves the prediction accuracy in terms of AUC and AUPR values. CONCLUSIONS: Our study suggests that the 3 drug features used in our manuscript are valuable information for drug repositioning. The comparative results indicate that integration of the 3 drug features would improve drug-disease association prediction. Our study provides a strategy for the fusion of different drug features for in silico drug repositioning. or adverse side-effects could further prevent testing drugs from entering clinical trials. Therefore, improving research and development (R&D) productivity becomes the most important priority for the global pharmaceutical industry [2] . Drug repositioning [3] , which aims to find new indications for approved or investigational drugs, has emerged as an important alternative to the traditional drug discovery. As it uses de-risked drug compounds, drug repositioning has the potential to reduce development time and increase success ratio compared to developing an entirely new drug for disease treatment [4] . Some successful examples of drug repositioning have been reported. A well-known instance is sildenafil, which has been repurposed from an antihypertensive drug to the treatment of erectile dysfunction. Existing antivirals, such as baloxavir, azvudine and darunavir, are repurposed to fight the current COVID-19 pandemic [5] . With the accumulation of biomedical data, computational approaches exploiting multi-source information for drug repositioning have been continuously proposed . These methods can be roughly categorized as drug-based and disease-based (see Review [28] for more details). Drug-based approaches are preferred when rich chemical or pharmacological data for drugs are available. For example, under the principle that drugs with chemical similarities could suggest shared biological activity, Keiser et al. [7] applied a similarity ensemble approach (SEA) to evaluate the 2D structural similarity of drugs to identify new drug-target interactions for drug repositioning. Based on the hypothesis that the mechanism of actions (MoA) of two drugs would be same if they induced the same side effects, Yang and Agarwal [8] used clinical side-effects of drugs as features to build Naive Bayes models to predict indications for diseases. Because protease is a common target for SARS-CoV-2, HIV-1 and hepatitis C viral (HCV) strains. FDA approved HIV-1 protease inhibitors and HCV protease inhibitors have been screened to be potential effective drugs against the COVID-19 [27] . Considering the fact that a drug usually acts on multiple targets, Rutherford et al. [14] extracted drug-disease associations for drug repositioning using the interactions between disease-related genes and drug targets. For these methods, different drug features are applied to address the drug repositioning problem from different angles. Generally, these drug-based approaches compare some unique signature of a drug against that of another one. The signature of a drug could be mainly derived from three categories of data: chemical structures, genomic data and adverse event profiles. As we know, collection bias and noise may exist in these data and some are even not complete. Meanwhile, complementary information exists in these different types of data. Therefore, it is necessary to combine these data for a comprehensive understanding of drug's MoA. However, integrating these different kinds of data to improve in silico drug repositioning is an open question till now. In this paper, we first collect 3 types of drug data (i.e., drug substructures, drug targets and drug side-effects) from public databases. Drug-drug similarities are then calculated based on each of the three types of features. A method using propagation to integrate the three similarity measurements is proposed. Under the guilt-by-association principle, we finally test their ability to infer drug-disease associations for drug repositioning. Experimental results based on cross-validations and case studies show that the integrated similarity measurement outperforms each of the 3 similarity measurements. We also compare our fusion method with 3 state-of-the-art similarity-integration methods and our method shows superior prediction performance in drug repositioning. In order to evaluate the prediction performance of the 4 similarity measurements, we implement leave-one-out cross-validations (LOOCVs) on the 548 drugs. For each drug, we consider it as a new one and leave it out once as the testing data. We remove all the associated diseases of the testing drug from our dataset. The remaining 547 drugs with indication information and similarity measurement are taken as the training data. For each drug, we prioritize the whole candidate diseases according to the scores derived from Eq. (8) (see "Methods"). When the score of a predicted association exceeds a given threshold, we consider it as a positive prediction; otherwise, a negative prediction. True positive rate (TPR), false positive rate (FPR), Precision (P) and Recall (R) are calculated by varying the thresholds to plot ROC and PR curves. Area under ROC curve (AUC) values and area under precision-recall curve (AUPR) values are computed for performance comparison. Furthermore, comprehensive drug-disease association predictions using all known information as training set are conducted. We analyse the top-ranked results for the 548 drugs by searching evidence from public databases. We report in Table 1 the average AUC values and AUPR values received by LOOCVs on the 548 drugs from the 4 similarity measurements. As shown in Table 1 , Integrated-Sim receives the highest average values of AUC and AUPR and performs best in the 4 similarity measurements. The average AUC value for IntegratedSim increases by 0.0659, 0.0310 and 0.0536 than these for the other 3 measurements, respectively. Meanwhile, the average AUPR value for IntegratedSim is 0.1474, 0.0586 and 0.1289 higher compared with these for the other 3 measurements, respectively. The overall results of LOOCVs for all 4 similarity measurements are illustrated by ROC curves and PR curves in Figs. 1 and 2, respectively. We conduct paired t-tests to measure whether the AUC values and AUPR values obtained by IntegratedSim across the 548 drugs are significantly higher than these in the other 3 datasets. The calculated p-values are available at Table 2 . We can discover from the statistical results that IntegratedSim achieves significantly better performance than all the other 3 measurements at the significance level 0.05. We show the precision and recall values across the 548 drugs in the 4 similarity datasets within the top k (k = 5, 10, 15 and 20) candidates in Figs. 3 and 4, respectively. Because higher values of precision and recall within the top k predictions indicate that more real drug indications are successfully inferred. We can conclude from the two figures that IntegratedSim consistently outperforms the other 3 measurements at different k cutoffs. There are two parameters k and t in our method for similarity fusion. The parameter k is the number of neighbours and t is the number of iterations. We comprehensively set their values in the range of [1, 30] and list the average AUC values and AUPR values in Tables 3 and 4 , respectively. We find from the 2 tables that the best inference performance can be achieved when the values of both parameters are set to be 5. After extensive comparison, we choose the best-performed similarity measurement IntegratedSim to conduct comprehensive drug-disease association predictions. In this inference process, all known information including associations and similarity measurement are used as the training set. We rank the unknown pairs according to their scores derived from Eq. (8) . The list of the top 20 predicted results can be seen in Additional file 1. We check the top 20 predicted results according to the public database CTD [29] , a knowledgebase that contains information for chemicals, genes, phenotypes, diseases, and exposures to advance our understanding about human health. Literature-based drug-disease associations are downloaded from this database to validate our predictions. For the predicted results in IntegratedSim, we discover that 158, 612, 1006 and 1575 predictions from the top 1, top 5, top 10 and top 20 results for the 548 drugs are verified in CTD, respectively. We also predict new drug-disease associations using the other 3 similarity measurements. Comparison of numbers of confirmed associations in the top k (k = 1, 5, 10 and 20) predictions is showed in Fig. 5 . We receive the largest numbers of confirmed predictions from IntegratedSim in most cases. It should be noted that the top predictions that are not supported in CTD yet may also exist in reality. We compare our integration method with 3 latest similarity fusion methods. We refer to the 3 methods as Napolitano's method [30] , Oerton's method [31] and Li's method [32] . To make fair comparison, we apply the 3 fusion methods to our datasets for drug-disease association prediction. We also use leave-one-out cross validations to test their prediction abilities. The average AUC and AUPR values of these methods are listed in Table 5 . We discover that our method performs best in the 4 fusion methods. Drug-based inference methods for drug repositioning make use of some unique drug features for matching. However, such information may be incomplete or contain noise. The incomplete or noisy data would produce biased results for drug repositioning. We develop a method to combine 3 different drug features. We ensure in our integration method that a drug is more similar to itself than to other drugs throughout iterations, which results in more reliable drug-disease association predictions. Note that the information of target proteins used in our manuscript is not complete. Meanwhile, according to a review [33] , non-coding RNAs (ncRNAs) would be another new class of drug targets as they play significant roles in gene expression regulation and in disease progression. Integrating these ncRNAs with target proteins would make us know better about drug's MoA. We therefore expect that the performance of our method would be improved when more experimentally supported drug targets are integrated. In addition, our method could be easily extended when more drug features are available. This is useful because diverse categories of biomedical data are becoming available with recent advances in technologies. These biomedical data offer new potential for drug repositioning [34] [35] [36] [37] . It should be noted that the performance of our similarity integration method depends on suitable parameter setting. Choosing proper parameters under different conditions for our method is a problem that needs to be properly addressed. Meanwhile, we only study the effects of drug features on drug repositioning. Recent repurposing approaches [38] [39] [40] [41] are making using of both drug and disease data. Our previous study [42] demonstrated that the topology of drug-disease bipartite network is also a vital factor in predicting new indications for drugs. In the future, we plan to integrate more information to improve the prediction ability. The bold value indicated the highest one in each row In this paper, we comprehensively study the effects of 3 drug features from chemical, genomic and pharmacological spaces on drug repositioning. Cross-validations and case studies suggest the 3 drug features are all predictive factors for drug repositioning. We further develop a fusion method to integrate these features for better in silico drug repositioning. Compared with 3 latest state-of-the-art methods, our fusion method shows improvements in prediction accuracy. We expect that our study will provide guidance in data integration for in silico drug repositioning. In our manuscript, we collect and integrate 3 types of drug signatures for drug repositioning. The datasets used for performance evaluation and new drug indication prediction are downloaded from two references [43, 44] . In reference [43] , Zhang et al. collected chemical structures of 1103 drugs from PubChem [45] . They used 881-dimensional binary fingerprint profiles to encode the presence or absence of substructures. Target proteins of 1007 drugs were obtained from DrugBank [46] . Each drug was represented by a 775-dimensional binary target profile. Side-effects of 888 drugs were received from SIDER [47] . They used 1385-dimensional binary profiles to encode the presence or absence of each sideeffect keyword. In reference [44] , Li and Lu extracted therapeutic uses for 799 drugs from NDF-RT (http://www.nlm.nih.gov/resea rch/umls/sourc erele asedo cs/curre nt/NDFRT /) and provided 3250 drug-disease relationships between the 799 drugs and 719 diseases. Finally, we receive 548 drugs which contain all information of chemical structures, target proteins, side-effects and indications. As there are three types of drug features (chemical structures, target proteins and side-effects) in our study and these features are represented by binary profiles, we separately calculate the similarity between drugs in each feature set according to the Jaccard score. This strategy of similarity calculation is also applied in reference [48] , in which the similarity score between two drugs based on the feature of chemical structures is computed as the size of the intersection over the union when viewing each chemical structure as specifying a set of elements. We refer to the 3 similarity datasets as chemSim, genoSim and pharSim. Inspired by the successful work of reference [49] in shape/image retrieval and reference [50] in cancer subtype identification, we apply a diffusion method as follows to combine the 3 calculated similarity measurements. We refer to this integrated similarity as IntegratedSim. For generality, we use an n × n similarity matrix W with W (i, j) indicating the similarity between drug x i and drug x j . We define a full and sparse kernel on the similarity matrix W and the full kernel is normalized as: Let N i represent a set of drug x i 's neighbours. We use K nearest neighbours (KNN) to measure local affinity as: Suppose there are 2 similarity datasets for fusion. We compute P (1) and P (2) according to Eq. (1) for the two similarity matrices; then the matrices S (1) and S (2) are calculated as in Eq. (2). Let P (1) t=0 = P (1) and P (2) t=0 = P (2) denote the initial two status matrices when t = 0. We propagate the similarity information through the common neighbourhood and update the two similarity matrices iteratively as follows: After t steps, the final integrated similarity matrix is computed as For the 3 similarity measurements in our study, we adjust Eq. (3) to The final fused similarity matrix is calculated as Based on the guilt-by-association principle, we assume if a drug is prescribed to treat a disease, similar drugs might also be able to cure the disease (see Fig. 6 ). The same idea for association analysis has been used in some other bioinformatics fields [51] [52] [53] . For an unknown drug-disease association (r i , d j ), we calculate its inference score as, where r i and d j denote drug i and disease j, Sim(r i , r l ) is the similarity value between drugs i and l, and a lj = 1if there exists an association between drug l and disease j, otherwise a lj = 0. The higher a score is received from Eq. (8), the higher with confidence a (1) P(i, j) = W (i, j) 2 k� =i W (i, k) j � = i The online version contains supplementary material available at https ://doi.org/10.1186/s1285 9-021-03988 -x. Estimated research and development investment needed to bring a new medicine to market How to improve R&D productivity: the pharmaceutical industry's grand challenge Drug repositioning: identifying and developing new uses for existing drugs Drug repurposing: progress, challenges and recommendations Coronavirus puts drug repurposing on the fast track The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease Predicting new molecular targets for known drugs Systematic drug repositioning based on clinical side-effects Structural analysis of oncogenic mutation of isocitrate dehydrogenase 1 Identify potential drugs for cardiovascular diseases caused by stress-induced genes in vascular smooth muscle cells Pathological role of a point mutation (T315I) in BCR-ABL1 protein-A computational insight Discovery of potent SARS-CoV-2 inhibitors from approved antiviral drugs via docking and virtual screening Evaluation of acridinedione analogs as potential SARS-CoV-2 main protease inhibitors and their comparison with repurposed anti-viral drugs A systems-level analysis of drug-target-disease associations for drug repositioning Discovery and in silico evaluation of aminoarylbenzosuberene molecules as novel checkpoint kinase 1 inhibitor determinants Accelerated clinical discovery using self-reported patient data collected online and a patient-matching algorithm Natural analogues inhibiting selective cyclin-dependent kinase protein isoforms: a computational perspective Repurpose terbutaline sulfate for amyotrophic lateral sclerosis using electronic medical records A new insight into protein-protein interactions and the effect of conformational alterations in PCNA Identification of novel and selective agonists for ABA receptor PYL3 Targeting the protein-protein interface pocket of Aurora-A-TPX2 complex: rational drug design and validation Rational drug repositioning by medical genetics A semi-supervised method for drug-target interaction prediction with consistency in networks Structural based study to identify new potential inhibitors for dual specificity tyrosine-phosphorylation-regulated kinase Predicting new indications for approved drugs using a proteochemometric method Identification of bioactive molecules from tea plant as SARS-CoV-2 main protease inhibitors The antiviral and antimalarial drug repurposing in quest of chemotherapeutics to combat COVID-19 utilizing structure-based molecular docking Exploiting drug-disease relationships for computational drug repositioning Comparative toxicogenomics database (CTD): update 2021 convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year research ? Drug repositioning: a machine-learning approach through data integration Understanding and predicting disease relationships through similarity fusion Inferring lncRNA functional similarity based on integrating heterogeneous network data MicroRNAs and other non-coding RNAs as targets for anticancer drug development Inferring drug-disease associations based on known protein complexes A miRNA-driven inference model to construct potential drug-disease associations for drug repositioning Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd deepDR: a network-based deep learning approach to in silico drug repositioning Computational drug repositioning using low-rank matrix approximation and randomized algorithms Predicting drug-disease associations by using similarity constrained matrix factorization Drug repositioning based on bounded nuclear norm regularization Integration of drug repositioning and drug-target prediction via cross-network embedding Network-based inference methods for drug repositioning Exploring the relationship between drug side-effects and therapeutic indications. In: American medical informatics association annual symposium proceedings A new method for computational drug repositioning using drug pairwise similarity PubChem 2019 update: improved access to chemical data DrugBank 5.0: a major update to the DrugBank database The SIDER database of drugs and side effects PREDICT: a method for inferring novel drug indications with application to personalized medicine Unsupervised metric fusion by cross diffusion Similarity network fusion for aggregating data types on a genomic scale Gut microbiome structure and metabolic activity in inflammatory bowel disease Comparative analysis of similarity measurements in miRNAs with applications to miRNA-disease association predictions Guilt-by-association-functional insights gained from studying the LRRK2 interactome We are grateful to Dr. Mengyun Yang at Central South University for useful discussions. HC and ZZ performed data preparation. HC, ZZ and JZ conceived and designed the experiments. HC performed all computational experiments. HC and JZ analyzed the results. HC and ZZ wrote the paper. All authors read and approved the final manuscript. The authors declare that they have no competing interests. drugs diseases drug-drug similarity space Fig. 6 The guilt-by-association principle behind our in silico drug repositioning. If a drug with unknown indication profile shares a similar property with another drug whose indication profile is known, the former may share the same indication profile with the latter Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.