key: cord-0646376-q46uy6ro
authors: Yaseen, Adiba; Amin, Imran; Akhter, Naeem; Ben-Hur, Asa; Computer, Fayyaz Minhas Department of; Sciences, Information; Engineering, Pakistan Institute of; Sciences, Applied; Islamabad,; Pakistan,; Biotechnology, National Institute for; Engineering, Genetic; Faisalabad,; Science, Department of Computer; University, Colorado State; Collins, Fort; Centre, USA Tissue Image Analytics; Warwick, University of; Coven-try,; UK,
title: Insights into performance evaluation of com-pound-protein interaction prediction methods
date: 2022-01-28
journal: nan
DOI: nan
sha: 5d7bf6d2c26d2089fc9741d347e1ac34fc82aae8
doc_id: 646376
cord_uid: q46uy6ro

Motivation: Machine learning based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing studies and can improve the efficiency and cost-effectiveness of wet lab assays. Despite the publication of many research papers reporting CPI predictors in the recent years, we have observed a number of fundamental issues in experiment design that lead to over optimistic estimates of model performance. Results: In this paper, we analyze the impact of several important factors affecting generalization perfor-mance of CPI predictors that are overlooked in existing work: 1. Similarity between training and test examples in cross-validation 2. The strategy for generating negative examples, in the absence of experimentally verified negative examples. 3. Choice of evaluation protocols and performance metrics and their alignment with real-world use of CPI predictors in screening large compound libraries. Using both an existing state-of-the-art method (CPI-NN) and a proposed kernel based approach, we have found that assessment of predictive performance of CPI predictors requires careful con-trol over similarity between training and test examples. We also show that random pairing for gen-erating synthetic negative examples for training and performance evaluation results in models with better generalization performance in comparison to more sophisticated strategies used in existing studies. Furthermore, we have found that our kernel based approach, despite its simple design, exceeds the prediction performance of CPI-NN. We have used the proposed model for compound screening of several proteins including SARS-CoV-2 Spike and Human ACE2 proteins and found strong evidence in support of its top hits. Availability: Code and raw experimental results available at https://github.com/adibayaseen/HKRCPI Contact: Fayyaz.minhas@warwick.ac.uk

Compound Protein Interaction (CPI) prediction is an important task in Target Compound Screening for identifying protein targets of compounds, drug design, and drug repurposing studies (Schirle and Jenkins 2016) . Affinity chromatography (Broach and Thorner 1996) and protein microarrays (Lee and Lee 2016; Zhao et al. 2021 ) are among the most frequently used experimental methods for the identification of CPIs. However, such wet-lab approaches can be expensive and timetaking (W. Zhang, Pei, and Lai 2017) (Paul et al. 2010) . The emergence of pandemics such as Ebola and COVID-19 and the global challenge of antimicrobial resistance have highlighted the need of improving efficiency and throughput in drug design (Thafar et al. 2019) . Consequently, CPI prediction using computational methods has become an attractive area of research (X. Chen et al. 2016 ) as such approaches can improve the cost, time, and efficiency of drug discovery in contrast to experimental methods (Mazandu et al. 2018 ).

Conventionally, structure-based and ligand-based virtual screening are the most well-researched areas of drug discovery (Lim et al. 2021 ). However, these methods require the three-dimensional (3D) structure of the protein of interest. As a consequence, machine learning (ML) based methods that use sequence characteristics of proteins and chemical structural representations of compounds for interaction prediction have been developed (Bredel and Jacoby 2004) (Bleakley and Yamanishi 2009; Gönen 2012; Y. Wang and Zeng 2013) . Based on the representation of proteins and compounds used in them, these computational methods can be categorized into three main classes: feature representation-based methods, similarity-based methods, and end-to-end learning methods. Similarity-based methods are based on the assumption that similar drugs tend to target similar proteins and vice versa (R. Chen et al. 2018) . In feature representation-based approaches (Ding et al. 2014) , features from compounds and proteins are extracted and fed to a machine learning model such as the nearest neighbor predictor, bipartite local models (Bleakley and Yamanishi 2009) , Bayesian matrix factorization-based kernels (Gönen 2012) , gaussian contact profiling ( van Laarhoven, Nabuurs, and Marchiori 2011) , pairwise kernel method (Jacob and Vert 2008) , etc. Comparative analysis by Ding et al. has shown that PKM outperforms other approaches (Ding et al. 2014) .

In recent years, researchers have developed multiple deep learning models for CPI prediction. DeepDTA (Öztürk, Özgür, and Ozkirimli 2018) extracts real-valued sparse feature representations of proteins as well as compounds using convolutional neural networks (CNNs) and appends these features through the final fully connected layer. Wid-eDTA (Öztürk, Ozkirimli, and Özgür 2019) and Conv-DTI (S. ) also used an analogous idea with additional features, ligand structural similarity, and information about protein domains and motifs to enhance model accuracy. For representation compound structures, GraphDTA (Nguyen et al. 2021 ) and CPI-GNN (Tsubaki, Tomii, and Sese 2019) used novel graph neural networks (X.-M. Zhang et al. 2021) (GNNs) as an alternative to CNNs. CPI-NN was shown to outperform other embedding-based methods.

Despite the increased sophistication of CPI models through deep learning, the generalization performance of existing approaches on independent or real-world datasets is still not perfect (Riley 2019) . One of the fundamental issues behind this is biased and overly-optimistic performance assessment strategies arising from the use of unsuitable datasets, poor non-redundancy control in train-test data splitting in cross-validation, improper procedures for generation of negative example, lack of independent test sets, and choice of performance metrics.

Here, we discuss each of these issues in further detail. A number of ML-based CPI prediction models have used the MUV (Rohrer and Baumann 2009), DUD-E (Mysinger et al. 2012 ) and Human-CPI datasets (Tsubaki, Tomii, and Sese 2019; Liu et al. 2015) for model training and performance evaluation. However, these datasets do not contain true or experimentally verified negative examples and may have a large degree of redundancy between proteins and compounds which can lead to biased machine learning models (Lieyang Chen et al. 2019) , (Sieg, Flachsenberg, and Rarey 2019) (Lifan Chen et al. 2020) .

Another issue associated with the performance assessment of ML CPI models is the protocol used for generating negative examples. As there is no standardized dataset of negative examples for compoundprotein interaction prediction, researchers in this domain resort to one of two approaches for the generation of "synthetic" negative examples for training and performance assessment of machine learning models: Random pairing and inter-class similarity-controlled negative example generation. In random pairing, proteins and compounds in the positive set are simply randomly paired for generating synthetic negative examples after exclusion of known positive pairs as in the dataset used in CPI-NN (Tsubaki, Tomii, and Sese 2019) . However, researchers have argued that random-pairing can produce examples that are highly similar to positive examples and this can add labeling noise in training (Ding et al. 2014) . As a consequence, they have proposed that negative examples should be generated with controls over inter-class similarity. This process first creates a candidate negative set through random pairing of compounds and proteins. Then a similarity function is used to calculate the degree of similarity between a candidate negative example and the given set of positive examples. Only those candidate negative examples are added to the final negative set whose similarity score with positive examples is lower than a pre-specified threshold resulting in negative examples that are sufficiently dissimilar to known positive examples (Ding et al. 2014 ). However, as in the case of protein-protein interaction prediction models (Ben-Hur and Noble 2006) , the use of similarity-controlled negative example generation in model evaluation can result in overly optimistic performance results with a high likelihood of generalization failure on real-world test sets.

A number of existing approaches also use an equal number of positive and negative examples even though the number of compounds that can be expected to bind to a given protein can be significantly smaller in comparison to the size of the universe of possible compounds. This results in the generation of a large number of false positives in realworld applications. Furthermore cross-validation protocols employed in most existing ML CPI models also do not consider protein sequence and compound similarity in generating training and test folds resulting in overly optimistic performance estimates as the training set can contain examples that are very similar to test examples. Ideally, the examples in the test folds should be sufficiently different from training examples to reflect real-world use cases.

Lastly, existing methods report areas under the Receiver Operating Characteristic or Precision-Recall curves (abbreviated as AUCROC and AUCPR) as performance metrics. However, given that such approaches are typically used for screening interactions from a large number of candidate compound-protein pairs for wet-lab validation, these metrics may not provide a directly interpretable estimate of how good a method is at ranking interacting compounds of a protein.

In this work, we highlight the issues discussed above with a number of experiments using an existing CPI prediction model (CPI-NN) (Tsubaki, Tomii, and Sese 2019) as well as a novel heterogeneous kernelbased approach. We suggest improvements in the evaluation protocol used for performance assessment of such models in terms of negative example generation as well as performance metrics. We report the prediction results of the proposed approach for screening candidate compounds for a number of test proteins not included in the data sets used in model construction including SARS-CoV-2 Spike protein and Human ACE2.

In this section, we discuss details of our datasets, experiments and machine learning method design for compound protein interaction prediction.

We use the human protein-compound interaction dataset originally proposed by (Liu et al. 2015) and employed in a number of existing methods such as CPI-NN (Tsubaki, Tomii, and Sese 2019) . In this dataset, positive examples consisting of protein-compound pairs were collected from two experimentally verified databases: DrugBank 4.1 (Wishart et al. 2008) and Matador (Günther et al. 2008 ). This dataset has 3,364 positive examples of interacting protein-compound pairs constituting 852 unique proteins and 1179 unique compounds. It also contains an equal number of negative examples obtained by randomly pairing proteins and compounds in the positive set provided as part of the CPI-NN code repository (Tsubaki, Tomii, and Sese 2019).

We found that the aforementioned HCPI dataset by Liu et al. and used in CPI-NN (Tsubaki, Tomii, and Sese 2019) contains duplicated examples which can lead to an overestimation of prediction performance. We removed these duplicated examples from the HCPI positive set resulting in 2,633 unique positive examples that constitute our NR-HCPI dataset together with negative examples obtained by randomly paired proteins and compounds in the positive set excluding any pairs already included in the positive set. We generated different ratios of positive to negative examples (P:N = 1:1, 1:3, 1:5, and 1:7) for the evaluation of predictive performance under more realistic evaluation scenarios with high-class imbalance. In conjunction with this dataset, we also utilized a non-redundant cross-validation (NRCV) protocol which is detailed in the performance evaluation section.

As discussed in the Introduction, one of the fundamental issues with protein-compound interaction datasets is the lack of experimentally 

For drug repurposing analysis, we use the SuperDRUG2 (version 2) database (Siramshetty et al. 2018 ) of approved and commercially available drugs with a total of 3,633 unique small molecules. We have also used the Superdrug bank molecules for screening potential targets of SARS-Cov-2 Spike protein and the human ACE2 protein.

For performance analysis, we have used the existing CPI-NN method which gives state-of-the-art results over the same datasets (Tsubaki, Tomii, and Sese 2019) . CPI-NN has been validated for human and C. elegans proteins with high AUC-ROC (0.95 and above) and under different class ratio settings. We used the publicly available code of CPI-NN and conducted experiments with various cross-validation and assessment strategies ourselves after verifying the reproducibility of the results using the same experimental settings as reported in the original CPI-NN paper.

We have also developed a simple kernel-based approach for CPI prediction (see Fig. 1 ). For this purpose, we model compound protein interaction (CPI) prediction as a classification problem in which every example ≡ ( , ) consists of a protein and a compound with corresponding feature representations ( ) and ( ), respectively. Each example in the training dataset = {(( , ), )| = 1 … } is associated with a binary label ∈ {−1, +1} indicating whether the corresponding protein and compound interact (+1) or not (−1). Features are extracted from protein sequence as well as from SMILES of compounds for heterogeneous modeling of the CPI problem as discussed below.

In order to capture amino-acid specific binding characteristics of proteins with their target compounds in the predictive model, we have used the amino acid composition (AAC) of protein (denoted by ( )) which is a 20-dimensional vector representation of a protein sequence containing the frequency of occurrence of various amino acids in the protein sequence (K. Huang et al. 2020) .

In order to model the physiochemical similarity across amino acids, we used grouped k-mer composition of proteins as a feature vector. In this approach, each amino acid in a protein is assigned one of seven predetermined groups based on its physicochemical characteristics (Hashemifar et al. 2018 ) and the counts of all possible group-level k-mers are used as a feature vector. For = 2 and = 3, this results in 7 2 = 49and 7 3 = 343-dimensional features of a protein denoted by 2 ( ) and as 3 ( ), respectively.

For modeling compound characteristics, we extract features from SMILES of compounds in the compound protein interaction pair using its Extended-Connectivity Fingerprint (ECFP) (also known as Morgan Fingerprint) (M. Veselinovic et al. 2015) using RDKit (Cao et al. 2013 ). This fingerprint is a topological feature of a chemical compound and captures its structural confirmation within a given radius. The feature dimension of this representation is 1024 for a radius of 3 atoms.

We have developed a heterogeneous feature space kernel representation for compound-protein interaction prediction. As each classification example in this problem comprises a protein and compound, we first construct non-linear kernel representations of proteins and compounds separately which are then merged to form a heterogeneous feature space kernel for classification as shown in Figure 1 .

We use the compound feature representation ( ) to construct a radial basis function (RBF) similarity kernel between pairs of compounds as follows:

In this equation ( Note that the joint kernel is a quadratic sum of the protein and compound kernels which gives rise to an abstract and nonlinear joint feature space ( , ) for compound-protein pairs with the kernel being an implicit generalized dot product between ( , ) and ( ′ , ′ ). The product ( , ′ ) ( , ′ ) in the above formulation implicitly corresponds to the tensor-product of the protein and compound feature spaces. It is also important to note that two examples will have a high kernel score if the corresponding proteins and compounds in the two examples are similar. The joint kernel over the training dataset = {(( , ), )| = 1 … } is used for training a kernelized Support Vector Machine (SVM) (Vapnik and Izmailov 2017) which is then used to infer the prediction score ( , ) for a given test example ( , ). This approach is in line with the work by (Jacob and Vert 2008) with major differences in the choice of constituent kernels and construction of the joint kernel (see supplementary material for comparative results).

We have designed multiple experiments to identify issues in performance evaluation and generalization of CPI predictors which are described in this section.

For direct comparison with previous methods, we have used stratified five-fold cross-validation which is typically used for reporting CPI prediction results. In five-fold stratified cross-validation, the dataset is divided into 5 equal folds such that the ratio of examples of every class in each fold remains the same as the overall class proportion. Each cross-validation experiment is repeated ten times to obtain the average and standard deviation of different performance metrics such as AUCROC and AUC-PR.

One of the limitations of five-fold cross-validation is that very similar proteins or compounds may end up in different folds resulting in an overly optimistic assessment of prediction performance. To estimate the generalization performance in a real-world setting where test proteins may not share very high sequence similarity with proteins in the training set, we have also performed a more stringent non-redundant cross-validation analysis which has not been performed in previous studies. For this purpose, proteins in the NR-HCPI dataset are first clustered based on a given sequence identity threshold through CD-HIT (Y. Huang et al. 2010 ). These clusters are then divided into five folds such that no two folds have examples from the same cluster while ensuring that the number of examples in every fold remains approximately the same. This guarantees that the sequence similarity of proteins in examples in a test fold is always less than a specified threshold with proteins in the training set. We used two different sequence similarity thresholds (40% and 90%) in our analysis.

In addition to classical 5-fold cross-validation and non-redundant cross-validation, we have also analyzed the prediction quality of our proposed method as well as CPI-NN on an external set containing experimentally verified negative examples from Binding DB as described in the dataset section. In this experiment, the ML model is trained on four folds of non-redundant cross-validation as described above. However, the original negative examples in the test fold are replaced with experimentally verified negative examples from Binding DB. This process is repeated by alternating across different folds and then multiple runs to generate mean and standard deviation values of performance metrics.

As discussed in the Introduction section, there are two strategies used in the literature for generating negative examples: Random Pairing and Similarity Controlled Negative Example Generation. In this work, we systematically compare these strategies for training and performance assessment of the proposed model. For this purpose, we have developed the algorithm shown in Table-2 

In a practical setting, compound-protein interaction prediction approaches are used for screening a large number of compounds for potential binding with a target protein of interest. Ideally, interacting compounds of a given protein should rank close to the top in comparison to non-interacting compounds in the screening library based on prediction scores of all test examples from the predictor.

However, cross-validation experiments used in previous works do not model this "screening" use case as they are restricted to a fixed evaluation data set and do not analyze how a predictor would rank known interacting partners in a setting in which all compounds are paired with all proteins. In this work, we have performed in silico screening of all unique compounds against all proteins in a given test set. This all-vsall pairwise screening is useful for drug discovery and repurposing studies and is carried out by computing the prediction score of all possible pairs of proteins and compounds in a test set using a prediction model and calculating how often a predictor ranks a known interacting pair in its top predictions. We have performed three different screening experiments for comparison between CPINN and the proposed model:

In this experiment, we train a model using training folds of the NRCV dataset and then compute prediction scores of all-vs-all compound-protein pairs in the test fold using the trained model (see supplementary information file for an illustration of the experimental setup). This process is repeated for all five folds of the dataset to compute a rank-based performance metric (RFPP) described in the next subsection.

For drug-repurposing analysis with the proposed model, we used the SuperDRUG2 dataset containing 3,633 FDA-approved drugs. In this experiment, a CPI model is first trained on all examples in training folds of the NRCV dataset and then used to generate prediction scores for all proteins in the test fold paired with all compounds in the SuperDRUG2 database (see supplementary material on GitHub for an illustration of the experimental setup). These scores are used to rank known interacting compounds of each protein in the test set relative to the compounds in SuperDRUG2 to compare the predictive performance of CPINN and the proposed model and identify putative compounds in SuperDRUG2 that can bind test proteins in the NRCV dataset.

We have also used the proposed model trained over the NR-HCPI dataset to generate predictions for interactions of SARS-CoV-2 Spike protein and Human ACE2 protein across all compounds in SuperDrug2 to identify putative interactions (Goulter et al. 2004), (Xia and Lazartigues 2008; Zou et al. 2020) . We then performed a literature search for any experimental evidence of interaction of the top-scoring compound with these proteins or their association with SARS-CoV-2 treatment effects. For this purpose, the proposed model was trained over positive examples in the HCPI dataset after removing lightweight molecules (with molecular weight less than 100) as well as these proteins and using a P: N ratio of 1:7.

For quantifying the prediction quality of CPI predictors in screening experiments, we have developed an interpretable performance metric called Rank of First Positive Prediction (RFPP) inspired from our previous work on protein-protein interactions (Minhas, Geiss, and Ben-Hur 2014) . For a given protein in the test set, RFPP is obtained by first pairing all possible test compounds with the protein and computing the prediction scores of all such examples using the CPI model under evaluation. Then the rank of the highest-scoring compound that is a known interacting partner of the test protein is used as the RFPP value of that protein (see supplementary material for an illustration of this experimental setup). Note that for an ideal predictor, the RFPP for all test proteins should be one, i.e., the top-ranked compound of each test protein should be an interaction partner of that protein.

In order to quantify the predictive quality of a CPI model across all test proteins, we first compute RFPP for all test proteins and then calculate percentiles of the RFPP values across all proteins. The percentile values across all proteins can be used to compare the predictive performance of screening models based on their ability to rank putative compoundprotein interactions. The th percentile of RFPP of a predictor will be (denoted as ( ) = ) if % test proteins have at least one known interacting compound in the top q predictions from the predictor. It essentially shows the expected number of compounds that will need to be screened to identify a true interacting partner in wet lab experiments.

For an ideal predictor, the RFPP value for all proteins in the test set should be one, i.e., RFPP(100)=1. We have generated the RFPP percentile plots of the proposed method as well as CPI-NN. As a baseline we have also plotted the RFPP percentiles of a random predictor which generates random prediction scores given a protein and compound. These values provide more directly interpretable estimates of prediction quality for such screening experiments.

Previous approaches have used five-fold cross-validation (CV) for performance evaluation over the Human CPI dataset. In order to provide a direct comparison with previously published methods, we have performed five-fold cross-validation with the proposed approach with different class ratios (see Table- 3). The proposed model gives an AUC-ROC and AUC-PR of 99% which is better than CPI-NN (94%). However, as discussed in the introduction section, this high predictive performance result of both CPI-NN and the proposed approach can be attributed to the duplication of examples in the dataset as well as similarity between examples across crossvalidation folds. If the duplicated examples are removed, we observe a minor drop in prediction performance of both approaches. In order to get a more realistic estimate of the generalization performance of these methods, we have performed non-redundant cross-validation analysis as discussed in the previous section. Table 3 presents detailed results of this non-redundancy analysis for different ratios of positive to negative (P:N) examples ratios and classifiers at 90% sequence identity thresholds. As expected, the predictive performance of the predictors decreases significantly with the removal of non-redundancy between training and test sets through NRCV. These experiments clearly show that it is very important to analyze prediction performance under nonredundant cross-validation. Results at 40% thresholds are reported in the supplementary material (on GitHub) and follow a similar trend.

As outlined in section 2.3.4, we have used a set of experimentally verified negative examples from the Binding-DB dataset to analyze the prediction performance of the proposed model as well as CPI-NN. For this purpose, both models were first trained on the NR-HCPI dataset with a balanced class ratio. The results of this analysis are given in Table 4 which shows that upon using true negative examples from Binding-DB in testing, the prediction performance of the proposed model is superior to CPI-NN (AUC-ROC of 76.8% vs. 89.9%) which supports the findings from non-redundant cross-validation above. As expected, increasing the ratio of negative examples in training for the proposed method improves the prediction performance over the binding DB test set further. 

We have analyzed the impact of different ways of generating synthetic negative examples (random-pairing vs. inter-class similarity controlled negative example generation) on estimation of prediction quality of a CPI model through cross-validation and its generalization performance on an external dataset containing experimentally verified negative examples from Binding-DB. For this purpose, we have used a procedure that allows us to generate synthetic negative examples by controlling their degree of similarity with a given positive set through an inter-class similarity threshold α (see Section 2.4.5 for complete details of experimental setup). The AUCPR values of both CPINN and the proposed model for cross-validation and the external test set for different values of α is plotted in the Fig 2. It shows that, as expected, if the similarity between the generated synthetic negative examples and the positive set is increased, the AUC-PR values of both CPINN and the proposed approach obtained from cross-validation decrease. This is inline with the findings by (Ding et al. 2014 ) and support similarity controlled generation of negative examples. However, if models trained over negative examples that are significantly different from the given positive are tested on an external set containing experimentally verified negative examples, their generalization performance is low. This is indicated in Fig 2 by increase in AUCPR values of both CPINN and the proposed approach over the external test set as the value of α is increased. This experiment clearly shows that using random pairing of proteins and compounds is a superior strategy for generating synthetic negative examples as it not only gives more realistic estimates of predictive quality but can improve the performance of CPI models over unseen test sets.

Fig 3 shows the RFPP percentiles across all proteins resulting from the all-vs-all screening experiment over the NR-HCPI dataset with non-redundant cross-validation detailed in section 2.4.6. In this experiment, a CPI model is first trained over examples in the training folds of the non-redundant cross-validation dataset and then used to rank all possible pairs of proteins and compounds in the test set to see how good is the method at ranking known interacting compounds for all proteins through the RFPP metric. The total number of all such possible combinations in this dataset is ~292,500. It shows that for 85% of test proteins, the proposed approach is able to find at least one known interacting compound of those proteins in its top 10 hits (i.e., RFPP(85) = 10) whereas, for CPINN, only 50% proteins have at least one known hit in its top 10 predictions for each protein (RFPP(50) = 10). In contrast, a random predictor can be expected to have at least one interacting compound in its top 10 predictions for only 5% of proteins in this test set. This clearly shows the efficacy of the proposed approach as well as the ease of interpreting results of model evaluation through the use of RFPP in screening experiments. As expected, we also see that adding more randomly paired negative examples to training improves RFPP further.

In order to evaluate the prediction performance of the proposed model and CPINN for possible drug-repurposing studies, we have conducted a virtual screening experiment using the FDA approved drugs in the SuperDRUG2 dataset. For this purpose, we score all possible (~ 908,250) pairs of proteins from the NR-HCPI with compounds from SuperDRUG2. All these predictions from the proposed model are made available to the community as supplementary results. As an additional step, we have also calculate the RFPP percentiles across all proteins from the proposed model as well as CPINN for this screening experiment which are given in the supplementary file. For a random predictor, we can expect to find at least one true interacting compound in the top 10 hits for only 3% of the proteins in this analysis. However, CPINN and the proposed model are able to identify at least one interacting compound for 50% and 75% of proteins, respectively.

The results of in silico screening of compounds in the SuperDRUG2 dataset for Human ACE2 (Uniprot ID: Q9BYF1) and SARS-Cov-2 Spike (Uniprot ID: P59594) proteins through the proposed method are given in the supplementary file (on GitHub) which shows the top 100 predictions of our model for ACE2 and Spike protein along with evidence from the literature supporting the predicted interaction. We have found that the proposed model is able to identify a number of compounds as potential interaction partners of these proteins even though these were not included in its training. Specifically, we have identified Trandolapril, Dimethyl Sulfoxide (DMSO), Remdesivir, Ramipril, N-Acetylglucosamine, Perindopril, Sunitinib and Glutathione in our top hits for ACE2 binding with strong support from experiments and in silico studies in the literature. Similarly, N-Acetylglucosamine, DMSO, Remdesivir, Sunitinib, Nilotinib, Dasatinib and Sorafenib show binding potential with the spike protein of SARS-Cov-2 with strong support in recent literature.

In this work, we have identified a number of shortcomings in experiment design approaches for CPI prediction. We hope that this work will enable the community to address these issues so that the future CPI models are more effective in prediction of protein-compound interactions for novel cases.

Molecular Dynamics Analysis of N-Acetyl-D-Glucosamine against Specific SARS-CoV-2's Pathogenicity Factors

Choosing Negative Examples for the Prediction of Protein-Protein Interactions

Supervised Prediction of Drug-Target Interactions Using Bipartite Local Models

Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery

High-Throughput Screening for Drug Discovery

The Tyrosine Kinase Inhibitor Nilotinib Inhibits SARS-CoV-2 in Vitro

ChemoPy: Freely Available Python Package for Computational Biology and Chemoinformatics

Efficacy and Effect of Inhaled Adenosine Treatment in Hospitalized COVID-19 Patients

Hidden Bias in the DUD-E Dataset Leads to Misleading Performance of Deep Learning in Structure-Based Virtual Screening

TransformerCPI: Improving Compound-Protein Interaction Prediction by Sequence-Based Deep Learning with Self-Attention Mechanism and Label Reversal Experiments

Machine Learning for Drug-Target Interaction Prediction

Drug-Target Interaction Prediction: Databases, Web Servers and Computational Models

Supervised Molecular Dynamics for Exploring the Druggability of the SARS-CoV-2 Spike Protein

Similarity-Based Machine Learning Methods for Predicting Drug-Target Interactions: A Brief Review

New Uses for Old Drugs: Remdesivir and COVID-19

Dimethyl Sulfoxide Reduces the Stability but Enhances Catalytic Activity of the Main SARS-CoV-2 Protease 3CLpro

Possible Role of Adenosine in COVID-19 Pathogenesis and Therapeutic Opportunities

BindingDB in 2015: A Public Database for Medicinal Chemistry, Computational Chemistry and Systems Pharmacology

Predicting Drug-Target Interactions from Chemical and Genomic Kernels Using Bayesian Matrix Factorization

SuperTarget and Matador: Resources for Exploring Drug-Target Relationships

A Review on Remdesivir: A Possible Promising Agent for the Treatment of COVID-19

Predicting Protein-Protein Interactions through Sequence-Based Deep Learning

An Observational Cohort Study to Assess N-Acetylglucosamine for COVID-19 Treatment in the Inpatient Setting

Zinc Iodide in Combination with Dimethyl Sulfoxide for Treatment of SARS-CoV-2 and Other Viral Infections

DeepPurpose: A Deep Learning Library for Drug-Target Interaction Prediction

CD-HIT Suite: A Web Server for Clustering and Comparing Biological Sequences

Pulmonary Adverse Drug Event Data in Hypertension with Implications

Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach

Advanced Bioinformatics Rapidly Identifies Existing Therapeutics for Patients with Coronavirus Disease-2019 (COVID-19)

Serum Antibody Response in Patients with Philadelphia-Chromosome Positive or Negative Myeloproliferative Neoplasms Following Vaccination with SARS-CoV-2 Spike Protein Messenger RNA (MRNA) Vaccines

Gaussian Interaction Profile Kernels for Predicting Drug-Target Interaction

Nebulization of Glutathione and N-Acetylcysteine as an Adjuvant Therapy for COVID-19 Onset

Target Identification for Biologically Active Small Molecules Using Chemical Biology Approaches

A Review on Compound-Protein Interaction Prediction Methods: Data, Format, Representation and Model

Improving Compound-Protein Interaction Prediction by Building up Highly Credible Negative Samples

Application of SMILES Notation Based Optimal Descriptors in Drug Discovery and Design

Large-Scale Data-Driven Integrative Framework for Extracting Essential Targets and Processes from Disease-Associated Gene Data Sets

PAIRpred: Partner-Specific Prediction of Interacting Residues from Sequence and Structure

Searching for Target-Specific and Multi-Targeting Organics for Covid-19 in the Drugbank Database with a Double Scoring Approach

Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking

GraphDTA: Predicting Drug-Target Binding Affinity with Graph Neural Networks

DeepDTA: Deep Drug-Target Binding Affinity Prediction

WideDTA: Prediction of Drug-Target Binding Affinity

How to Improve R&D Productivity: The Pharmaceutical Industry's Grand Challenge

Three Pitfalls to Avoid in Machine Learning

Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity Data

Identifying Compound Efficacy Targets in Phenotypic Drug Discovery

COVID-19: Molecular Targets, Drug Repurposing and New Avenues for Drug Discovery

In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening

Old Arsenal to Combat New Enemy: Repurposing of Commercially Available FDA Approved Drugs Against Main Protease of SARS-CoV2

SuperDRUG2: A One Stop Resource for Approved/Marketed Drugs

Comparison Study of Computational Prediction Tools for Drug-Target Binding Affinities

Compound-Protein Interaction Prediction with End-to-End Learning of Neural Networks for Graphs and Sequences

Knowledge Transfer in SVM and Neural Networks

Sunitinib Reduces the Infection of SARS-CoV, MERS-CoV and SARS-CoV-2 Partially by Inhibiting AP2M1 Phosphorylation

LDCNN-DTI: A Novel Light Deep Convolutional Neural Network for Drug-Target Interaction Predictions

Predicting Drug-Target Interactions Using Restricted Boltzmann Machines

Repurposing of Kinase Inhibitors for Treatment of COVID-19

DrugBank: A Knowledgebase for Drugs, Drug Actions and Drug Targets

Drug Repurposing Approach to Combating Coronavirus: Potential Drugs and Drug Targets

Computational Multitarget Drug Design

Graph Neural Networks and Their Current Applications in Bioinformatics

Identifying Drug-Target Interactions Based on Graph Convolutional Network and Deep Neural Network

This work is supported by HEC NRPU 6085. FM is also support by the Path-LAKE consortium grant at University of Warwick.