key: cord-0296142-ac1lhor5 authors: Tsukiyama, Sho; Hasan, Md Mehedi; Fujii, Satoshi; Kurata, Hiroyuki title: LSTM-PHV: Prediction of human-virus protein-protein interactions by LSTM with word2vec date: 2021-02-27 journal: bioRxiv DOI: 10.1101/2021.02.26.432975 sha: 564b557e09e172430ca6d50936a2cef9783f3fbd doc_id: 296142 cord_uid: ac1lhor5 Viral infection involves a large number of protein-protein interactions (PPIs) between human and virus. The PPIs range from the initial binding of viral coat proteins to host membrane receptors to the hijacking of host transcription machinery. However, few interspecies PPIs have been identified, because experimental methods including mass spectrometry are time-consuming and expensive, and molecular dynamic simulation is limited only to the proteins whose 3D structures are solved. Sequence-based machine learning methods are expected to overcome these problems. We have first developed the LSTM model with word2vec to predict PPIs between human and virus, named LSTM-PHV, by using amino acid sequences alone. The LSTM-PHV effectively learnt the training data with a highly imbalanced ratio of positive to negative samples and achieved an AUC of 0.976 with an accuracy of 98.4% using 5-fold cross-validation. By using independent test dataset, we compared the LSTM-PHV with existing state-of-the-art PPI predictors including DeepViral. In predicting PPIs between human and unknown or new virus, the LSTM-PHV presented higher performance than the existing predictors when they were trained by multiple host protein-including datasets. LSTM-PHV learnt multiple host protein sequence contexts more efficiently than the DeepViral. Interestingly, learning of only sequence contexts as words presented remarkably high performances. Use of uniform manifold approximation and projection demonstrated that the LSTM-PHV clearly distinguished the positive PPI samples from the negative ones. We presented the LSTM-PHV online web server that is freely available at http://kurata35.bio.kyutech.ac.jp/. Viral infections are one of the major causes of human health, as we can see from the current status of the SARS-CoV2 that raises a global pandemic. As of February 2021, more than 110 million people infected and nearly 2.4 million deaths have been reported worldwide for the COVID-19 disease [1] . Viruses achieve their own life cycle and proliferate their clones by hijacking and utilizing the functions of their hosts. In the process to achieve this purpose, viruses interact with host proteins to control cell cycles and apoptosis and to transport their own genetic material into the host nucleus [2, 3] . Therefore, it is important to identify human-virus protein-protein interactions (HV-PPIs) in understanding the mechanisms of viral infections and host immune responses and finding new drug targets. However, compared to intraspecies PPIs, few interspecies PPIs have been identified. In identifying interactions, experimental methods such as yeast-tohybrid and mass spectrometry have been widely used, but they are time-consuming and laborious. For this reason, it is difficult to apply experimental methods for all protein pairs. Therefore, the computational approach is a preliminary treatment prior to the experimental method. The use of amino acid sequence information is promising in the prediction of PPIs because the experimental data of PPIs and sequence information of proteins are abundant. Machine learning (ML)-based approaches are very attractive [4] that use the amino acid binary profiles [5, 6] , evolutionary properties [7, 8] , physicochemical properties [9, 10] , and structural information [11] . Zhou et al integrated different encoding methods, such as relative frequency of amino acid triplets, frequency difference of amino acid triplets, and amino acid composition to construct a SVM-based PPI predictor [12] . Recently, promising encoding schemes have been proposed to capture the sequence patterns of proteins, including conjoint triad [13, 14] , auto covariance [15] , and autocorrelation [16] . Human-virus interactions involve not only the various properties of amino acid sequences but also the arrangement of 20 amino acid residues in the semantic context of whole protein sequences. While many predictors have focused on the former features, the latter context-based information is suggested to be effective in predicting HV-PPIs [17] . To capture the context information of amino acid residue sequences as much as possible, word/document embedding techniques have recently been proposed. Yang et al. combined the doc2vec encoding schemes with a random forest method to predict PPIs [17] . The DeepViral (Liu-Wei, et al., 2020) combined doc2vec/word2vec embedding methods with a convolutional neural network (CNN). DeepViral also encoded host phenotype associations from PathoPhenoDB [18] and protein functions from the Gene Ontology (GO) database (The Gene Ontology Consortium, 2017) to predict PPIs. In addition, several constructed ML models were designed for certain individual virus species, limiting their generalizability to other human host-virus systems [19] [20] [21] . To utilize the amino acid sequence context as words effectively, we have proposed the long short-term memory (LSTM) model [22] with the word2vec embedding method that predicts the PPIs between human and virus, named LSTM-PHV. To the best of our knowledge, this is the first application of the LSTM with the word2vec to sequence-based PPI prediction. Interestingly, use of the sequence context as words presented remarkably accurate prediction of the interactions between human and unknown virus proteins. The data of PPIs were downloaded from the Host-Pathogen Interaction Database 3.0 (HPIDB 3.0) [23] . The retrieved HV-PPIs were further selected by the following process. First, to ensure interactions with a certain level of confidence, the PPIs with an MI score of below 0.3 were removed. The MI score is the confidence score assigned to each PPI from IntAct [24] and VirHostNet [25] . Second, redundant PPIs were excluded by using CD-HIT with an identity threshold of 0.95 [26] . Third, the PPIs that contained the proteins consisting of standard amino acids only and the proteins with a length of more than 30 residues and less than 1000 residues were selected. Finally, 22383 PPIs from 5882 human and 996 virus proteins were considered as positive samples. To the best of our knowledge, there is no gold standard for generating negative samples. Many previous studies used a random sampling method. Pairs of the human and virus proteins that do not appear in the positive PPI dataset are randomly sampled as negative data. However, the random sampling method may incorrectly assign many positive samples to negative ones [5, 20] . To address this problem, the dissimilarity negative sampling method was developed [5] , which used a sequence similarity-based method to explore the protein pairs that are unlikely to interact. We employed the dissimilarity-based negative sampling method as follows. We calculated the sequence similarities of all pairs of virus proteins in positive samples with the Needleman-Wunsch algorithm of BLOSUM30 and defined a similarity vector for each virus protein. Subsequently, we excluded the virus proteins showing lower sequence similarities than for more than half of the total virus proteins as outliers. was calculated by: where and are the first quartile and quartile range of the similarity scores for the -th virus protein , respectively. By setting the maximum and minimum values of the similarity scores to 0 and 1, respectively, the similarity score was normalized and converted into the distance. The human proteins that consisted of the standard amino acids and whose residue length was longer than 30 and shorter than 1000 were retrieved from the SwissProt database [12] . In the field of natural language processing, embedding methods such as word2vec [28] and doc2vec [29] were developed to obtain the distributed representation of words and documents, respectively. In word2vec, the weights in a neural network learn the context of words to provide the distributed representation that encodes different linguistic regularities and patterns [30] . There are two methods for learning the context of words: Continuous Bag-of-Words Model (CBOW) and the Continuous Skip-Gram Model (Skip-Gram). CBOW predicts the current word based on the context, while Skip-Gram predicts the context from the current word. Skip-gram is more efficient with less training data, while CBOW learns faster and more frequent words. At present, computational biology used these methods [31, 32] . The amino acid sequences of human and virus proteins registered as positive and negative samples were encoded as matrixes using the word2vec method. The k-mers (k consecutive amino acids) in amino acid sequences were regarded as a single word (unit) and each amino acid sequence was represented by multiple k-mers. For example, given an amino acid sequence MAEDDPYL, the unit of the 4-mers are MAED, AEDD, EDDP, DDPY, and DPYL ( Fig. 1 ). We trained a CBOW-based word2vec model to learn the appearance pattern of k-mers from the computational speed standpoint by using the genism of the python package [33] . Here, k-mers and protein sequences correspond to words and sentences in natural language. Human and virus proteins in positive samples and non-redundant proteins in the SwissProt database [27] were used to train the word2vec model. The non-redundant proteins were collected by applying CD-HIT to all proteins with an identity threshold of 0.9. The k-mers up to three neighbors of a specific k-mer are considered as the peripheral k-mers, and training was iterated 1000 times. The trained word2vec model produced 128-dimensional embedding vectors in each k-mer and they were concatenated to produce the embedding matrixes of proteins. Since 4-mer provided the largest AUC by 5-fold cross-validation in a previous study [17] , we set k to 4. Neural networks such as convolutional neural network (CNN) and recurrent neural network (RNN), in particular, are very powerful and have been applied to difficult problems such as speech recognition and visual object recognition [34] . The RNN learns time or step dependencies in sequence data and enables training on variable-length data. The LSTM solves the gradient explosion and gradient disappearance problems of RNNs, enabling long-term time-dependent learning. The LSTM-PHV is composed of three sub-networks (Fig. 2) . The two, upstream networks where is the correct label, is the model-predicted probability of interaction, is the number of data whose label is in the mini-batch, and is the hyperparameter. was set to 0.99. To prevent over-learning, the training process was terminated when the maximum accuracy in the validation data was not updated for consecutive 20 epochs. To prevent the weight of the loss function from being 0, we set an approximately equal ratio of labels for all the mini-batches. To evaluate the prediction performance, 7 statistics measures were used: sensitivity where TP, FP, TN and, FN are the numbers of the correctly predicted positive samples, incorrectly predicted positive samples, correctly predicted negative samples and incorrectly predicted negative samples, respectively. The threshold for a determination of whether protein pairs interact or not was set to a predicted probability of 0.5. AUC and AUPRC are the areas beneath the ROC curve and PR curve, respectively. These measures were calculated by the scikit-learn of the python package [37] . To visualize the concatenated vectors, we reduced the dimensionality of the concatenated vector from 256 to 2 using the uniform manifold approximation and projection (UMAP). UMAP is the nonlinear dimensionality reduction approach [38] , which can preserve not only local patterns but also global patterns in low-dimensional space. In this study, the number of neighbors in the k-neighbor graph was set to 50, and a minimum distance between points in the low-dimensional space was set to 0. The distances between any points were calculated by the Euclidean distance. The optimization was implemented up to 500 epochs with a learning rate of 1.0. We used the LSTM-PHV to predict HV-PPIs using amino acid sequences alone. Prediction performances were evaluated via 5-fold cross-validation on the training dataset. Out of the five models, the model with the highest AUC was used to predict the independent test dataset. The accuracies on the training and independent datasets were 0.984 and 0.985, respectively ( Fig. 3 and Fig. 4) . The AUCs were 0.976 and 0.973 on the training and independent datasets, respectively (Tables S1 and S2). To characterize the performance of LSTM-PHV, we compared it with Yang's model (RF model with Doc2vec) [17] on our independent test data, as shown in Fig. 4 and AUPRC. The LSTM-PHV was able to learn the imbalanced data better than Yang's model. The high MCC takes a great advantage, because learning of imbalanced data is essential. At present the number of known PPIs is very small compared to the total number of protein pairs. It is not evitable that negative samples are typically produced much more than positive ones in the absence of golden standard of generating negative samples. To assess whether the LSTM-PHV is applicable to unknown virus species, the LSTM-PHV was evaluated by using the four datasets provided by Zhou et al [12] (Fig. S2 ), which were employed by the DeepViral [39]. We compared LSTM-PHV with Zhou's model and the DeepViral. In training the LSTM-PHV by Zhou's datasets, we set a batch size to 256 and used the normal binary cross-entropy loss function, because Zhou's datasets were much smaller than our dataset and it was balanced data. As shown in Fig 5 and The two, upstream neural networks with the LSTM generated the fixed-length vectors. To examine how these neural networks extract PPI-related information, transforming the embedding matrices of proteins to the two fixed-length vectors, we drew the UMAP map of their concatenated vectors on the independent test data ( Fig. 6 and Fig. S1-S4 ). In all the UMAP maps, multiple clusters were generated and positive samples were The LSTM-based model with word2vec (LSTM-PHV) efficiently learnt highly imbalanced training data to accurately predict PPIs between human and virus. Learning of amino acid sequence contexts as words are dominantly effective in prediction PPIs. UMAP visualizes that positive samples are clearly distinguished from negative samples. Amino acid sequences were represented by 4-mers and embedded as a matrix by training the word2vec model. The matrixes were generated by concatenating the vectors of 4-mers in a row. The blue and red lines show the two, upstream neural networks with the LSTM that transform the human and virus protein matrices into the two fixed-length vectors, respectively. The purple line shows the final neural network that concatenates the fixedlength vectors of human and virus proteins to predict their PPIs. We employed the four datasets (a, b, c, d) that combine the four training data with two test data according to Zhou's study. The two datasets containing human-virus interactions (TR1-TS1 and TR2-TS2) were applied to LSTM-PHV, Zhou's model, the DeepViral that used amino acid sequences alone, and the DeepViral integrating three features. The other two datasets containing host-virus interactions (TR3-TS1 and TR4-TS2) were applied to the LSTM-PHV, Zhou's model and DeepViral. ACC and AUC correspond to accuracy and the area under the ROC curve, respectively. The performances were obtained from Table 1 World Health Organization et al. Coronavirus disease (covid-19) situation dashboard Understanding Human-Virus Protein-Protein Interactions Using a Human Protein Complex-Based Analysis Framework, mSystems The landscape of human proteins interacting with viruses and other pathogens Evolution of Sequence-based Bioinformatics Tools for Protein-protein Interaction Prediction DeNovo: virus-host sequence-based proteinprotein interaction prediction Sequence-based prediction of protein-protein interactions using weighted sparse representation model combined with global encoding Evolutionary profiles improve protein-protein interaction prediction from sequence PPIevo: protein-protein interaction prediction from PSSM based evolutionary information PPI-Detect: A support vector machine model for sequence-based prediction of protein-protein interactions LocFuse: human proteinprotein interaction prediction via classifier fusion using protein localization information ProMate: a structure based prediction program to identify the location of protein-protein binding sites A generalized approach to predicting proteinprotein interactions between virus and host Sequence-based prediction of protein protein interaction using a deep-learning algorithm Protein-Protein Interactions Prediction Using a Novel Local Conjoint Triad Descriptor of Amino Acid Sequences Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences SIPMA: A Systematic Identification of Protein-Protein Interactions in Zea mays Using Autocorrelation Features in a Machine-Learning Framework Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method linking human pathogens to their phenotypes in support of infectious disease research Supervised learning and prediction of physical interactions between human and HIV proteins Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins Prediction and analysis of human-herpes simplex virus type 1 protein-protein interactions by integrating multiple methods Long short-term memory HPIDB 2.0: a curated database for host-pathogen interactions The IntAct molecular interaction database in 2012 Distributed Representations of Words and Phrases and their Compositionality Identifying antimicrobial peptides using word embedding with deep recurrent neural networks PTPD: predicting therapeutic peptides by deep learning and word2vec Software Framework for Topic Modelling with Large Corpora Sequence to Sequence Learning with Neural Networks Automatic Differentiation in PyTorch On the Variance of the Adaptive Learning Rate and Beyond Scikit-learn: Machine Learning in Python Uniform Manifold Approximation and Projection for Dimension Reduction