key: cord-0029726-1spk12x1 authors: Wu, Zixin; Chen, Lei title: Similarity-Based Method with Multiple-Feature Sampling for Predicting Drug Side Effects date: 2022-04-01 journal: Comput Math Methods Med DOI: 10.1155/2022/9547317 sha: 44b5a1fd928543f01729c1a8cae3f751ab1362a2 doc_id: 29726 cord_uid: 1spk12x1 Drugs can treat different diseases but also bring side effects. Undetected and unaccepted side effects for approved drugs can greatly harm the human body and bring huge risks for pharmaceutical companies. Traditional experimental methods used to determine the side effects have several drawbacks, such as low efficiency and high cost. One alternative to achieve this purpose is to design computational methods. Previous studies modeled a binary classification problem by pairing drugs and side effects; however, their classifiers can only extract one feature from each type of drug association. The present work proposed a novel multiple-feature sampling scheme that can extract several features from one type of drug association. Thirteen classification algorithms were employed to construct classifiers with features yielded by such scheme. Their performance was greatly improved compared with that of the classifiers that use the features yielded by the original scheme. Best performance was observed for the classifier based on random forest with MCC of 0.8661, AUROC of 0.969, and AUPR of 0.977. Finally, one key parameter in the multiple-feature sampling scheme was analyzed. Drugs are important in treating various diseases; however, their therapeutic effects are accompanied by negative effects called side effects. In the pharmaceutical field, drug side effect is classified as an adverse drug reaction (ADR), the harmful or accidental reactions of qualified drugs that are irrelevant to the purpose of their use under normal usage and dosage. Some market-approved drugs may generate unaccepted side effects that can be harmful to the human body and bring high risks to pharmaceutical companies. For example, fluconazole and atorvastatin have potential hepatotoxicity and nephrotoxicity that can increase transaminase when used in specific patients such as those with liver disease. Side effects are one of the major obstacles in launching new drugs and delaying their development. Thus, determining all the side effects for a given drug is an important topic in drug development. Despite their efficiency in identifying side effects, solid clinical trials are time consuming and expensive and thus cannot meet the demand of large-scale tests. Thus, rapid and cheap methods for the identification of drug side effects must be developed. Many advanced computational algorithms have been proposed [1] [2] [3] [4] [5] to provide strong technique support to deal with various medical problems. Several computational methods have been developed for the identification of drug side effects. Most of them are machine learning-based techniques that deeply investigate current information on drug side effects and develop proper patterns that can be used to predict side effects for a given new drug. Some early methods consisted of an individual binary classifier for each side effect [6] [7] [8] [9] [10] ; hence, they always contain several binary classifiers that must be simultaneously executed to determine all side effects for a given drug. In view of this situation, some other techniques were directly built with multilabel classifiers [11] [12] [13] [14] [15] [16] that identify side effects as labels and drugs as samples. Recommender systems were also proposed to predict drug side effects [17] [18] [19] . Recent works paired drugs and side effects as samples to convert the original problem as binary classification [20] [21] [22] . A key step in developing such binary classifiers is to extract essential properties from each drug-side effect pair. Some researchers used a similaritybased scheme to extract features [21, 22] ; for convenience, they extracted only one feature from one type of drug association, a process called single-feature sampling scheme. However, some essential information may be omitted. For research continuation, a novel feature extraction scheme that can hold essential information for each drug-side effect pair must be developed. In this study, an efficient binary classifier was proposed for the identification of drug side effects. Drugs and side effects were also paired as samples [20] [21] [22] . The singlefeature sampling scheme [21, 22] was generalized to extract essential features from each pair. Named as multiplefeature sampling scheme, this newly proposed strategy can generate multiple features from each type of drug association. Classic machine learning algorithm, random forest (RF) [23] , was adopted as the prediction engine. According to the 10-fold cross-validation results, the performance of such classifier was better than that of the previous classifier that uses original single sampling scheme for feature extraction. Further tests suggested that classifiers with other classification algorithms and features yielded by the multiple sampling scheme were all superior to those with the same classification algorithm and features generated by the original scheme. This finding indicated the power of the features generated by the proposed feature extraction scheme. 2.1. Benchmark Dataset. Data on 841 drugs and their side effects (824) [20] [21] [22] were extracted from SIDER (http:// sideeffects.embl.de/) [24] , a public database collecting the information of marketed drugs and their ADRs. The original data contained 888 drugs and 1385 side effects. The side effects that were annotated to no more than five drugs were excluded. Furthermore, drugs without the properties mentioned in Section 2.2 were discarded. From the remaining 841 drugs and 824 side effects, 57,058 drug-side effect pairs were obtained. Each pair indicated that the specific drug in the pair has the side effect in the same pair. Given that these pairs indicate the relationship between one drug and one side effect, they were termed as positive samples and comprised the positive dataset (PDS). In addition to PDS, a negative dataset (NDS) was necessary in building an efficient binary classifier. A total of 57,058 drug-side effect pairs were produced by randomly pairing one drug and one side effect [20, 21] . However, no pairs can be labeled as positive samples. Therefore, these pairs constituted one NDS. Different NDSs may influence the performance of the classifier. Therefore, four other NDSs were also generated. Finally, five datasets each containing the PDS and one NDS were produced and denoted by DS 1 , DS 2 , DS 3 , DS 4 , and DS 5 . Properties. Two drugs with strong associations always share similar functions [25] [26] [27] [28] [29] . Side effects can be deemed as one type of drug function. Thus, classifiers can be con-structed by adopting features derived from drug associations. From different aspects of drugs, several types of drug associations can be measured and quantified. For easy comparisons, the drug associations adopted in a previous study [21] were adopted, and their brief descriptions are as follows. 2.2.1. Drug Fingerprint Association. Simplified molecular input line entry specification (SMILES) string [30] is a widely used scheme for drug representation. Fingerprints can be extracted from this string using existing software, such as RDKit [31] . The associations of two drugs can be evaluated by comparing their fingerprints. Here, ECFP_4 fingerprints and Tanimoto coefficient were used to measure such association between any two drugs. For formulation, this association for drugs d 1 and d 2 was denoted by G f ðd 1 , d 2 Þ. Association. In addition to SMILES string, another popular drug representation scheme is graph-based method. Here, each drug is represented by a graph with nodes depicting atoms and edges indicating bonds. The association of two drugs can be assessed by considering the similarity of two corresponding graphs. "SIM-COMP" (https://www.genome.jp/tools/simcomp/) reported in the KEGG [32, 33] was set up based on such idea. This tool can output the associations of a given drug with other drugs as measured by scores between 0 and 1. Such association for drugs d 1 and d 2 was denoted by G s ðd 1 , d 2 Þ. Association. The ATC system is a widely accepted and used in drug classification. Each drug in such system is assigned five-level ATC codes that indicate its essential properties. For two drugs, their association can be measured according to their ATC codes. This study used the same method in [21] to evaluate drug association based on their ATC codes. For convenience, the association of drugs d 1 and d 2 was denoted by G a ðd 1 , d 2 Þ. Given the extensive literature on drugs, the association of two drugs can be measured from their cooccurrence in some literature and natural language processing methods. The well-known public database, STITCH (version 4.0, http://stitch4.embl.de/) [34] , provides such associations, which were directly employed in this study. "Textmining" score was extracted from the downloaded file "chemical_chemical.links.detailed.v4.0.tsv." For drugs d 1 and d 2 , their literature association was denoted by G tm ðd 1 , d 2 Þ. Target protein is the basic property of drugs. Hence, the association of two drugs can be estimated by comparing their target proteins. In this study, the target proteins of drugs were retrieved from Drug-Bank (https://go.drugbank.com/) [35] . Each drug was encoded into a binary vector by applying one-hot scheme to its target proteins. The direction cosine of two vectors was defined as such association of two drugs. For formulation, this association between drugs d 1 and d 2 was denoted as G t ðd 1 , d 2 Þ. Computational and Mathematical Methods in Medicine 2.3. Feature Engineering. In Section 2.2, five types of drug associations that have been used to extract features to represent drug-side effect pairs [21, 22] were employed. These features indicated the linkage between one drug and one side effect in a drug-side effect pair. However, they extract only one feature from each type of drug association and thus cannot fully capture the essential linkage between the drug and the side effect. This study proposed a novel feature extraction scheme called multiple-feature sampling scheme, which can extract multiple features from one type of drug association. For a clear description, some denotations are necessary. For one drug-side effect pair p = , where d and s indicate one drug and one side effect, respectively, let S be a set consisting of drugs having side effect s that have been extracted from the training dataset. If d also has side effect s, then, it would not be included in S. For one type of drug association, all values between d and drugs in S are selected. Denoted by Ψ k ðpÞ (where k ∈ f f , s, a, tm, tg represents the type of drug association used to construct such list), a candidate feature list for p is then constructed with the decreasing order of above values. The top value in this list has been previously chosen as exclusive feature [21, 22] . Selection of several values in this list can contain more information to represent the linkage of drug d and side effect s. On the basis of the different selection models, two strategies were proposed, namely, discrete and continuous strategies. Their procedures are shown in Figure 1 . In this strategy, several values from the list Ψ k ðpÞ are selected to indicate the distribution of values in the list. In this way, these selected values can fully indicate the linkage between drug d and side effect s. This process can be achieved by selecting some discrete values in the list. For example, the value at the first place or that at the top q% place can be selected. These values comprise a set of features from one type of drug association. Strategy. This strategy differs from the first one. Given that the linkage of drug d and side effect s is highly indicated by some top values in the list, these values must be properly selected because they may fully contain the essential information. For an integer q between 1 and 100, the top q% values in the list Ψ k ðpÞ were selected as features. A proper classification algorithm is important in building an efficient classifier. In this study, RF [23] was adopted to construct the classifier. RF is one of the most classic classification algorithms and has been used to set up many classifiers in bioinformatics [36] [37] [38] [39] [40] [41] . RF is an integrated classification algorithm containing several decision trees, each of which is constructed by two random selection procedures. The first procedure is to select samples. Given a dataset with n samples, randomly select n samples with replacement from such dataset. The second procedure is to select features to split each node. The selected features should be much less than overall features. After the predefined number of decision trees has been constructed, RF integrates them by major voting. For a query sample, each decision tree gives its prediction. The majority prediction is the predicted result of RF. Although a decision tree is a relative weak classification algorithm, RF is extremely powerful and has always been an important candidate to build different classifiers. In this study, "RandomForest" in Weka [42] was directly used to implement the abovementioned RF. Default parameters were adopted, and the number of decision trees was set to 100. In addition to RF, the following classification algorithms were used to build corresponding classifiers: support vector machine (SVM) (polynomial kernel, RBF kernel) [43] , Adaboost M1 [44] , Bagging [45] , Bayesian network [46] , Naive Bayes [47] , K-nearest neighbor (KNN) [48] , decision tree (C4.5) [49] , PART [50] , logistic regression [51] , multilayer perceptron (MLP) [52] , and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [53] . The goal is to confirm that the features yielded by the multiple sampling scheme are more effective than those yielded by the single sampling scheme. For convenience, corresponding tools in Weka were used to implement the above classification algorithms under default parameters. These classification algorithms adopt different principles and procedures for classification. Therefore, their usage can fully test the utility of the proposed feature sampling scheme. If the classifier with features yielded by the multiple sampling scheme is superior to that with previous features for any of these classification algorithms, then, the robustness of the novel features obtained by the multiple sampling scheme is confirmed. 2.5. Accuracy Measurement. Ten-fold cross-validation [54] [55] [56] [57] [58] [59] was adopted to evaluate the performance of all constructed classifiers. Such method randomly divides the original dataset into ten parts. Each part is singled out one by one as the test set, and the remaining parts constitute the training set. Samples in the test set are predicted by the classifier based on the training set. Thus, each sample is tested exactly once. For a binary classification problem, four entries can be counted by comparing the predicted and true classes of each sample, that is, true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The following measurements were based on these four entries: sensitivity (SN) (also called recall), specificity (SP), prediction accuracy (ACC), Matthews correlation coefficient (MCC) [20, 21, 37, [60] [61] [62] [63] , precision, and F1-measure. Their definitions are as follows: ACC, MCC, and F1-measure use all four entries and thus are more important than the other three measurements. Receiver operating characteristic (ROC) curve [64] and precision-recall (PR) curve were further employed to fully assess the performance of constructed classifiers. These curves indicate the performance of classifiers under different thresholds. ROC curve takes 1-SP as x-axis and SN as the y-axis, and PR curve takes recall as x-axis and precision as y-axis. Areas under these two curves (AUROC and AUPR) are important measurements to evaluate the performance of classifiers. Among the abovementioned parameters, MCC was selected as the main measurement. A novel feature extraction method was proposed to extract essential features from drug-side effect pairs. On the basis of these features, efficient classifiers to predict drug side effects were established. All procedures are illustrated in Figure 2 . The discrete strategy picks some discrete values in the candidate feature list. Given that the top value in such list is the most important and has been previously selected as the exclusive feature [21, 65] , this top value is always picked up as one feature. As mentioned in Section 2.3, the value located at top q% place in the list was also selected. In this study, q was set as 5, 10, 15, and 20. Values with high ranks in the candidate feature list are more important than those with low ranks, that is, the top value is the most important, followed by values at 5%, 10%, 15%, and 20%. Incremental feature selection was adopted to generate four feature subsets as listed in column 1 of Table 1 . With each feature subsets derived from five types of drug associations, a RF classifier was built on each of five datasets and evaluated by 10-fold cross-validation. The average performance is listed in Table 1 . MCC followed an increasing trend when the values at top 5%, 10%, 15%, and 20% were added. Other five measurements also generally followed such trend. The RF classifiers with all selected features (top values and those at 5%, 10%, 15%, and 20%) generated the highest MCC of 0.7172. This finding indicated that the features yielded by such multiple-feature sampling scheme were quite efficient for the identification of drug side effects. The ROC and PR curves of these four RF classifiers were investigated, and the results are shown in Figure 3 . All AUR-OCs and AUPRs were higher than 0.900 and 0.910, respectively, thus, further suggesting the good performance of RF classifiers with discrete strategy. Different from discrete strategy, continuous strategy selected values from the candidate feature list in a continuous way. As mentioned in Section 2.3, top q% values in the candidate feature list can be chosen as features. Here, some q values including 10, 20, 30, and 40 and four feature subsets were tested. A RF classifier was also built on each of the five datasets by using the feature subsets derived from the five types of drug associations. Each classifier was assessed by 10-fold cross-validation, and the average performance is listed in Table 2 Computational and Mathematical Methods in Medicine and precision of 0.9747. Compared with the RF classifiers with discrete strategy, the best RF with continuous strategy had higher measurements, particularly for MCC (by 15%), ACC (by 7%), and F1-measure (by 7%). These results indicated that the features obtained by continuous strategy were more powerful in identifying drug side effects than those yielded by discrete strategy. The ROC and PR curves of RF classifiers with continuous strategy were plotted as shown in Figure 4 . All ROC curves were close to the point (0, 1), and all PR curves were close to the point (1, 1). The AUROCs and AUPRs were all quite high. Compared with AUROCs and AUPRs for discrete strategy, those for continuous strategy were generally higher. This finding further confirmed that the features yielded by continuous strategy were more powerful than those yielded by discrete strategy. A multiple-feature sampling scheme was proposed to extract essential features from each drug-side effect pair. Previous studies [21, 22] only picked up the top value as the feature, and this technique was called single sampling scheme. This section compares the RF classifiers with these two feature sampling schemes. The average performances of RF classifiers with singlefeature sampling scheme are listed in Table 3 Figure 2 : Entire procedures of the method for identification of drug side effects. Positive dataset (reported drug-side effect pairs) is retrieved from SIDER, and five negative datasets are randomly generated. From the four public databases or tools, five drug properties are employed and used to extract features with multiple-feature sampling scheme. Random forest is adopted to build the model and is further evaluated by 10-fold cross-validation. Table 3 . The MCCs for two strategies were 0.7172 and 0.8661, which were higher than that for the RF classifier with single-feature sampling scheme. Same conclusions can be obtained for other five measurements. The ROC and PR curves of RF classifier with single-feature sampling scheme were also plotted ( Figure 3 ) and were found to be always under those of RF classifiers with discrete strategy. The AUROC and AUPR of the RF classifier with single-feature sampling scheme were 0.870 and 0.878, respectively, which were also lower than those of the RF classifier with discrete strategy. For the RF classifier with continuous strategy, its AUROCs and AUPRs ( Figure 4) were even better than those of the RF classifier with discrete strategy and were also higher than those of the RF classifier with single-feature sampling scheme. All these results implied that the features yielded by the multiple sampling scheme contained more essential information of drug-side effect pairs than those obtained by the single sampling scheme. These features provide RF with improved performance. Sampling Scheme. The RF classifiers with features yielded by multiple sampling (discrete strategy) were superior to 9 Computational and Mathematical Methods in Medicine those with features yielded by single sampling, and the RF classifiers with continuous strategy were better than those with discrete strategy. However, the relevance of this result to the selection of classification algorithms must be explored. In this section, 12 classification algorithms mentioned in Section 2.4 were tested. The classifiers with different algorithms and all feature subsets used for RF were constructed and evaluated by 10-fold cross-validation. The predicted results are listed in Tables S1-S24. The performances of classifiers with single sampling and the best performance of classifiers with multiple sampling are listed in Table 4 . The classifiers with multiple sampling (discrete strategy) were generally better than those with single sampling, and those with continuous strategy were superior to those with discrete strategy and single sampling. For a visualized confirmation, a radar graph was plotted for each value of ACC, MCC, and F1-measure as illustrated in Figure 5 . For each measurement, the area in the closed curve of classifiers with multiple sampling (continuous strategy) was the largest, followed by the closed curve of classifiers with multiple sampling (discrete strategy); the area in the closed curve of classifiers with single sampling was the smallest. On the basis of these results, multiple sampling scheme is more efficient to capture the essential properties of drug-side effect pairs than single sampling scheme, and continuous strategy is better than discrete strategy. (Table 2 ). For other classifiers with different classification algorithms, q = 20 always yields the best performance as shown in Figure 6 . Among the 13 classifiers with different classification algorithms, 10 provided the best performance when q = 20, occupying 76.92%. Meanwhile, two yielded the best performance when q = 30. This phenomenon was reasonable. When q is extremely small, some essential information of drug-side effect pairs cannot be included. When q is large, several noises may be employed. Current investigation revealed that the values of q can be taken in an interval [20, 30] . This study prevents a novel investigation on drug side effects. The contributions contained two aspects. One was the multiple-feature sampling scheme that can extract essential features from drug-side effect pairs, and other one was novel computational methods for the identification of drug side effects based on the features yielded by the multiple sampling scheme. Classifiers were built on the basis of different classification algorithms. By comparison, the classifiers using features yielded by the multiple sampling scheme performed better than those using features yielded by the single sampling scheme. The proposed classifiers can be useful tools to identify drug side effects, and the novel feature extraction scheme can be applied to other similar biological or medical problems. The original data used to support the findings of this study are available at SIDER and in supplementary information files. The authors declare that there is no conflict of interest regarding the publication of this paper. This work was supported by the Natural Science Foundation of Shanghai (17ZR1412500). Table S1 : performance of SVM (polynomial kernel) classifier with discrete strategy. Table S2 : performance of SVM (polynomial kernel) classifier with continuous strategy. Table S3 : performance of SVM (RBF kernel) classifier with discrete strategy. Table S4 : performance of SVM (RBF kernel) classifier with continuous strategy. Table S5 : performance of Adaboost M1 classifier with discrete strategy. Table S6 : performance of Adaboost M1 classifier with continuous strategy. Table S7 : performance of Bagging classifier with discrete strategy. Table S8 : performance of Bagging classifier with continuous strategy. Table S9 : performance of Bayesian network classifier with discrete strategy. Table S10 : performance of Bayesian network classifier with continuous strategy. Table S11 : performance of Naive Bayes classifier with discrete strategy. Table S12 : performance of Naive Bayes classifier with continuous strategy. Table S13 : performance of KNN classifier with discrete strategy. Table S14 : performance of KNN classifier with continuous strategy. Table S15 : performance of decision tree classifier with discrete strategy. Table S16 : performance of decision tree classifier with continuous strategy. Table S17 : performance of PART classifier with discrete strategy. Table S18 : performance of PART classifier with continuous strategy. Table S19 : performance of logistic regression classifier with discrete strategy. Table S20 : performance of logistic regression classifier with continuous strategy. Table S2 : performance of multilayer perceptron classifier with discrete strategy. Table S22 : performance of multilayer perceptron classifier with continuous strategy. Table S23 : performance of RIPPER classifier with discrete strategy. A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification Ensemble of keyword extraction methods and classifiers in text classification Exploring performance of instance selection methods in text sentiment classification A hybrid ensemble pruning approach based on consensus clustering and multiobjective evolutionary algorithm for sentiment classification A term weighted neural language model and stacked bidirectional LSTM based framework for sarcasm identification Predicting drug side-effect profiles: a chemical fragment-based approach Predicting neurological adverse drug reactions based on biological, chemical and phenotypic properties of drugs using machine learning models Inverse similarity and reliable negative samples for drug side-effect prediction Large-scale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs Predicting adverse drug reactions through interpretable deep learning framework Predicting drugs side effects based on chemical-chemical interactions and proteinchemical interactions Predicting drug side effects by multi-label learning and ensemble learning An algorithmic framework for predicting side effects of drugs Facilitating prediction of adverse drug reactions by using knowledge graphs and multi-label learning models Drug side effect prediction through linear neighborhoods and multiple data source integration Using drug similarities for discovery of possible adverse reactions Identification of drug-side effect association via multiple information integration with centered kernel alignment A novel triple matrix factorization method for detecting drug-side effect association based on kernel target alignment Identification of drug-side effect association via semi-supervised model and multiple kernel learning Predicting drug side effects with compact integration of heterogeneous networks A similarity-based method for prediction of drug side effects with heterogeneous information Prediction of drug side effects with a refined negative sample selection strategy Random forests A side effect resource to capture phenotypic effects of drugs Predicting biological functions of compounds based on chemicalchemical interactions Predicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities A hybrid method for prediction and repositioning of drug anatomical therapeutic chemical classes Inferring anatomical therapeutic chemical (ATC) class of drugs using shortest path and random walk with restart algorithms Recognizing novel chemicals/drugs for anatomical therapeutic chemical classes with a heat diffusion algorithm SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules RDKit: open-source cheminformatics KEGG: new perspectives on genomes, pathways, diseases and drugs KEGG: Kyoto encyclopedia of genes and genomes STITCH 4: integration of protein-chemical interactions with user data DrugBank 5.0: a major update to the DrugBank database for 2018 Predicting non-deposition sediment transport in sewer pipes using random forest Similarity-based machine learning model for predicting the metabolic pathways of compounds Prediction of antimalarial drug-decorated nanoparticle delivery systems with random forest models RF-PseU: a random forest predictor for RNA pseudouridine sites A deep learning architecture for metabolic pathway prediction Identification of drug-disease associations by using multiple drug and disease networks Data Mining:Practical Machine Learning Tools and Techniques Support-vector networks Experiments with a new boosting algorithm Bagging predictors BAYESNET: Bayesian Classification Network Based on Biased Random Competition Using Gaussian Kernels An empirical study of the naive Bayes classifier Nearest neighbor pattern classification C4.5: Programs for Machine Learning Generating accurate rule sets without global optimization Speeding up logistic model tree induction Multilayer perceptron, fuzzy sets, classifiaction Fast effective rule induction A study of cross-validation and bootstrap for accuracy estimation and model selection Detecting the multiomics signatures of factor-specific inflammatory effects on airway smooth muscles Identifying transcriptomic signatures and rules for SARS-CoV-2 infection Identification of protein subcellular localization with network and functional embeddings iMPTCE-Hnetwork: a multi-label classifier for identifying metabolic pathway types of chemicals and enzymes with a heterogeneous network iATC-NRAKEL: an efficient multi-label classifier for recognizing anatomical therapeutic chemical classes of drugs Comparison of the predicted and observed secondary structure of T4 phage lysozyme Determining protein-protein functional associations by functional rules based on gene ontology and KEGG pathway Identification of drugdrug interactions using chemical interactions Identify key sequence features to improve CRISPR sgRNA efficacy Signal Detection Theory and ROC Analysis Similarity-based prediction for anatomical therapeutic chemical classification of drugs by integrating multiple data sources