key: cord-0438608-m2o0cvue authors: Ghorbanali, Zahra; Zare-Mirakabad, Fatemeh; Mohammadpour, Bahram title: DRP-VEM: Drug repositioning prediction using voting ensemble date: 2021-10-01 journal: nan DOI: nan sha: 8c4867ee36505a7d01a6ccc01b0a65181ae47261 doc_id: 438608 cord_uid: m2o0cvue Traditional drug discovery methods are costly and time-consuming. Drug repositioning (DR) is a common strategy to overcome these issues. Recently, machine learning methods have been used extensively in DR problem. The performance of these methods depends on the features, representations and training dataset. In this problem, feature sets include many redundant features, which have a negative effect on the performance of methods. Moreover, selecting an appropriate training set is influential in the rise of machine learning method accuracy. However, in this problem, we face two obstacles to find the proper training set. First, most methods employ known and unknown drug-disease pairs as positive and negative sets, respectively. While the number of known pairs is much less than unknowns, it leads to machine learning performance error because of biasing to the majority group. Second, the absence of a drug-disease association means this association has not been approved experimentally and may be changed. In this paper, DRP-VEM framework is proposed to overcome the challenges. We assess DRP-VEM based on different parameters: disease and drug feature representations, classification methods, and voting ensemble training approaches. DRP-VEM is evaluated using heterogenous evaluation criteria. Moreover, we compare DRP-VEM using the best combination of parameters with DisDrugPred. Despite the growth of technology and its role in diagnosing diseases, transforming these successes and benefits to medical treatment is not fast enough [1] . Traditional drug discovery is a time-consuming and costly method. According to recent studies, the process of discovering a new druggable component, passing the test phase steps, and bringing it to markets takes more than ten years and from $314 million to $2.8 billion [2] . Drug repositioning or drug repurposing uses an approved drug for a new indication outside its first treatment purpose. A historical example of drug repositioning is Sildenafil. Researchers developed Sildenafil for treating hypertension, but today it is used to cure erectile dysfunction and is known as Viagra [1] . This method also can be used for treating new diseases. For example, researchers repurposed existing antiviral drugs such as Baloxavir, Azvudine, and Darunavir to treat coronavirus disease during the Covid-19 pandemic [3] . Firstly, some physicians discovered drug repositioning opportunistically. Although using retrospective clinical experiences is useful, physicians have to check a wide range of drugs. Despite the increased cost and risk of failure, finding an alternative drug to treat a specific disease would become timeconsuming. As these methods have not involved a systematic approach, nowadays, due to the growth of computational methods and their application in various studies and the expansion of available data of drugs and diseases, researchers prefer to apply computational methods for solving drug repositioning problem. We divide these computational methods into three main groups: drug-based, disease-based, and hybrid. Owing to the availability of drug information, more researchers focus on drug-based techniques. Ozsoy et al. combined three main drug features: chemical structure, protein interaction, and side effects. Pareto dominance technique was applied to find the neighbors of a drug. Then, they used a collaborative filtering recommendation system to find the probability of association between drugdisease pairs [4] . Zeng et al. proposed the DeepDR method, which performed random walk to represent drug feature networks and then combined them using an autoencoder. Finally, a variational autoencoder was applied to estimate the drug-disease association probabilities [5] . Chen et al. collected three features of the drug: chemical structure, targets, and side effects. After calculating the similarities of drugs, a fusion method was developed for merging the similarities to predict the probabilities of drug-disease associations [6] . Despite the importance of disease-related data in the drug repositioning problem, researchers have not widely studied disease-based methods due to a lack of information. Therefore, these methods have focused on a specific disease or therapeutic domain [7] . Chiang and Butte calculated the similarities of the diseases by counting shared therapies. Then, a " guilt by association " approach was applied to consider disease similarities. By using these similarities, they found new drug-disease association pairs [8] . In hybrid methods, researchers combine both drug and disease data to obtain the chance of drugdisease association pair. Moridi et al. presented a pipeline that efficiently represents drug and disease features using the deep learning method. They proposed a non-linear approach to find the drugdisease candidates [9] . Xuan et al. presented DisDrugPred by integrating drug similarities, disease similarities and known drug-disease associations using non-negative matrix factorization technique to calculate the association probability of drugs and diseases [10] . Lue et al. proposed an approach named RWHND to reconstruct a heterogeneous network by combining drugs, drug targets, diseases and disease genes data. Then, a random walk model was developed to candidate pharmaceutical treatment for a disease [11] . Although researchers have done great studies on the drug repositioning (DR) problem, challenges still need to be addressed. In the following, we review these challenges and our idea to overcome them: • The previous studies focused on drug-based methods mostly and less on the hybrid. In addition, these studies used different drug features and tried to combine all of them. However, they have not focused on which feature has a significant role in detecting drug-disease association pairs or the appropriate representation. In this paper, we aim to ascertain if using all features in solving DR problems is necessary or causes redundancy. Moreover, we find significant feature which improves the accuracy of machine learning methods in addressing DR problem. In addition, we assess which data representations have better performance than others. • In the literature, different machine learning techniques can be found to solve DR problem. This article shows that if the selected representation and combination of features are defined appropriately, the effect of classification methods on predicted results for the DR problem differs slightly. • Most approaches consider known drug-disease pairs as positive and all unknown pairs as negative sets. Nevertheless, finding the proper training set faces two challenges. In the first one, while the number of known pairs is much less than unknown ones, it leads machine learning biasing to the leading group, so the method's performance is flawed [12] . The second one, the lack of a drug and disease association as a negative set, has not been assessed clinically. This study introduces a new algorithm to make a training set called voting ensemble training approach to overcome this issue. To show that the selected drug feature, the chosen feature representation of drugs and disease, the elected machine learning method and the voting ensemble training approach are suitable, we compare our framework with the DisDrugPred [10] . The rest of this article is constructed as follows: the "Methods" section presents the description of DR problem, data and our framework called DRP-VEM. The "Results and Discussion" section includes the assessment of DRP-VEM and comparison results with DisDrugPred, and finally, the "Conclusion" shows the future point of DR problem. This article aims to: • find which feature presentations are appropriate to depict the drugs and diseases in DR problem, • analyze which feature has more impact on solving DR problem, • assess if all drug features are necessary or cause redundancy for predicting drug-disease associations, • select which classification method shows more accuracy for DR problem, • propose a voting ensemble training approach to overcome challenges about using unknown drugdisease pairs as the negative set and unbalanced data in facing DR problem. In the following section, first, we define DR problem. Ensuing, we introduce our datasets. Next, we illustrate the data representations and training approach. Finally, we propose and explain our method. , respectively, where, n and m are the numbers of diseases and drugs. In the mathematical definition of DR problem, our primary goal is to find the existence of a therapeutic association between disease P   and drug R   . If the model predicts ,  has a therapeutic association, the output is one and otherwise zero. In DR problem, the following data is given as input: • The features of the disease P , • The features of the drug R , • The set of known drug-disease association pairs, there is known assciation between disease P and drug R  =   It is necessary to collect some known drug-disease associations and select some features for drugs and diseases. Therefore, we use four drug features: target, domain, side effect, and chemical structure. Also, we apply semantic similarity as a disease feature. In the following, the databases used to extract data are introduced: • Drug-Disease association: we choose the "repoDB" database [13] to collect known drug-disease association pairs. • Drug features: We retrieve drug names, identification, and target from "DrugBank" [14] . The target domains of drugs are extracted from "Uniprot" [15] . We derive the information on side effects from "SIDER4.1" [16] . Finally, the chemical structures of drugs are collected from "Pubchem" [17] . • Disease Feature: We extract the disease similarity from "DincRNA" [18] based on Wang's method [19] . The list of diseases is limited based on the DincRNA database with size of 158. In addition, we select 413 drugs where all features are available. Table 1 shows the number of extracted data from databases. Here, there are 1506 target components for drugs. Also, there are 1070 domain, 5734 side effect and 881 chemical structure components. In this subsection, we introduce how to present each data for feeding into the framework. : The . . ✓ The cosine similarity vector F R C with length m is presented for drug R based on feature F as follows: 12 , . . . , ( ) Thus, we define eleven different drug feature representations as bellow: However, this approach is not appropriate because of two main challenges: ✓ The number of known pairs is much less than unknown ones () AB . It leads binary classifier biasing to the majority group, so the method's performance is flawed [12] . ✓ The lack of a drug and disease association means the association of this pair has not been assessed clinically yet, not that the pair will never be associated. To overcome the first challenge, we apply an under-sampling approach by randomly selecting unknown association pairs with size k-times of known association pairs. We called this selecting approach for making training set as one-to-k distribution. To address the second challenge, we cluster unknown pairs according to the one-to-k distribution approach for constructing negative training sets in which their intersection set is empty and the union set equals the whole. Assume that the number of these clusters is k p . So, the model is trained on k p negative datasets. For each test sample, we vote the response of the trained models to predict association. The details of making training and test sets based on voting ensemble method are available as follows: is fed to a classifier as the training set. Therefore, the classifier is trained k p times. For each sample from the test set, such as x  ε , each trained model predicts the association between disease and drug. Finally, we vote on Here, we introduce the evaluation criteria, analyze our results, and compare our method with DisDrugPred [10] . In the following, we assess our framework, DRP-VEM, based on the selected characteristics of each parameter. • Assessment of disease feature representation We defined W and O's disease feature representations based on the Wang vector (5) and the one-hot vector (6) . The performance of each  for every evaluation criterion e  is measured according to (12): The evaluation scores of every drug feature representation are shown in table 4. Applying target cosine similarity vector ( T C ) or domain cosine similarity vector ( D C ) with a slight difference ( 0.8% ) has better performances than the other drug feature representations. We can infer that the target of a drug has a significant effect on predicting the association of a drug-disease pair. Meanwhile, the domain of a target is a region of the protein's polypeptide chain that is self-stabilizing and folds independently from the rest [22] . As their scores are very close, applying one of these features is required. Assessing extracted results of Table 7 depicts the performance of each combination. As we expected, the combination of the Wang vector (W) and target cosine similarity ( T C ) has better results than other ones. The binary vector combined with the one-hot vector ( . ) and cosine similarity vector with the Wang vector ( . ) performs better in most cases. It seems to be because and BO are both discrete representations, as well and CW are continues one. Therefore, their combinations work more accurately. Table7: Evaluation criteria on combination of disease and drug feature representations. Here, every classifier (  ) uses a voting ensemble training approach (  ) for learning. We analyze all combinations of classifiers and training approaches to declare the best model owing to (16) . The results are shown in table 8. The ACC criterion rises if the number of negatives grows. So, for every three classifiers, the ACC on 5 is better than the others. On the other hand, by increasing the negative samples, the amount of AUC-PR is decreased. We define WAS score to make a better trade-off among ACC, AUC and AUC-PR. According to corresponding results, our model using decision tree as a classifier and 1 as a training approach is performed better than other ones. • Comparing DRP-VEM with DisDrugPred We implement DisDrugPred [10] and analyze its performance by utilizing our dataset. As DisDrugPred is a regression algorithm and not a classifier, we calculate its mean square error (MSE) instead of the ACC. As mentioned DisDrugPred article, we perform 5-fold cross-validation. According to our assessment, the accurate combination of parameters belongs to model 1 , , , =  , that we named BestOverAll. In the BestOverAll, the Wang vector representation for disease (W ) and target cosine similarity vector for drug ( T C ) are combined to fed DT classifier, which is learned by performing voting ensemble training 1 T . This model is generally preferred. However, the best scores among all 264 experimented models belong to model ) is fed to classifier DT learned based on one-to-three distribution training set 3 T . As the ACC score is not available for DisDrugPred and similarly MSE is not available for BestOverAll and BestAmongAll, we calculate WAS score as the average of AUC and AUC-PR. Table 9 illustrates the corresponding results. BestOverAll is slightly better than DisDrugPred. But BestAmongAll achieves remarkable evaluation scores. This article proposed a new framework, named DPR-VEM, to solve the DR problem using a voting ensemble method. We examined different parameters to find the proper combination of drug and disease feature representations, classification method and training approach. We chose ACC, AUC, AUC-PR and WAS to evaluate the framework. Owing to results, the best overall model (BestOverAll) belongs to target cosine similarity vector as drug feature representation, Wang similarity vector as disease representation decision tree as classification method, and one-to-one distribution as votingensemble training approach. The ACC, AUC, AUC-PR and WAS scores for the BestOverAll model are %81.9, %81.8, %76.6 and %79.7, respectively. DRP-VEM is compared with DisDrugPred [10] as a state-o-the-art drug repositioning method. DisDrugPred got AUC = %82.4, AUC-PR = %63.9 and WAS = %73.1. The data and implementation of DRP-VEM is available at http://bioinformatics.aut.ac.ir/DRP-VEM. In conclusion, using target or domain as a drug feature is necessary, while concatenating all features reduces the model's accuracy and caused redundancy. The performance of the cosine similarity vector and Wang vector as drug and disease feature representation is more accurate. Moreover, the decision tree classifier distinguishes the dataset better than others. In addition, applying voting ensemble approach to make training and test sets solve the classification method's biasing challenge. In this article, we focused more on the assessment of drug feature representation and less on disease. In the future, we aim to analyze disease feature representations more. This study, utilized fingerprint as a drug chemical structure representation. While SMILES representation appeared more informatic, we want to achieve a representation format for SMILES. Drug repositioning: Identifying and developing new uses for existing drugs Estimated Research and Development Investment Needed to Bring a New Medicine to Market Coronavirus puts drug repurposing on the fast track Realizing drug repositioning by adapting a recommendation system to handle the process DeepDR: A networkbased deep learning approach to in silico drug repositioning In silico drug repositioning based on the integration of chemical, genomic and pharmacological spaces Exploiting drug-disease relationships for computational drug repositioning Systematic evaluation of drug-disease relationships to identify leads for novel drug uses The assessment of efficient representation of drug features using deep learning for drug repositioning Drug repositioning through integration of prior knowledge and projections of drugs and diseases Computational Drug Repositioning with Random Walk on a Heterogeneous Network Learning from imbalanced data: open challenges and future directions A standard database for drug repositioning DrugBank: a comprehensive resource for in silico drug discovery and exploration UniProt: The universal protein knowledgebase in 2021 The SIDER database of drugs and side effects PubChem 2019 update: Improved access to chemical data DincRNA: A comprehensive webbased bioinformatics toolkit for exploring disease associations and ncRNA function Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases Receiver operating characteristic (ROC) curve for medical researchers The area under the precision-recall curve as a performance metric for rare binary events Favorable domain size in proteins We would like to express our great appreciation to Miss. Mina Shaygan, the member of the CBRC lab at Amirkabir university of technology, for her patience in designing the CBRC lab webpage. Her willingness to give her time so generously is very much appreciated.Authors' contributions FZ and ZG contributed to the design and implementation of the framework, the analysis of the results, and the manuscript's writing. BM contributed to implementing DisDrugPred approach. No funding information to declare. The data that support the findings of this study, including disease similarity based on Wang's method, drug name, identifiers and targets, drug domain, drug side effects, drug chemical structure, and drugdisease associations are openly available; namely DincRNA [18] , DrugBank [14] , Uniprot [15] , SIDER4.1 [16] , PubChem [17] , and repoDB [13] have been downloaded and used in this study, respectively.Ethics approval and consent to participate Not applicable. All authors give consent to publish. No competing interest to declare.Author Details 1,2,3 Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran. According to table 5, DT distinguishes our dataset significantly better than two other models, and RF performance is more reliable than CNB.• Assessment of voting ensemble training approachWe displayed four voting ensemble training sets,Model z model is trained based on  and then prediction for test data is made according to figure 1. Table 6 illustrates every evaluation criterione  for each  according to (14) . The most accurate performance belongs to one-to-one distribution, 1 T , where the number of positives and negatives are equal.Table6: Evaluation criteria on the training approach We fed the different combinations of the drug (  ) and the disease representation (  ) to z model . We examine possible different combinations of them according to (15) .