key: cord-0944150-pwkl8rwy
authors: Yang, Xiaodi; Yang, Shiping; Lian, Xianyi; Wuchty, Stefan; Zhang, Ziding
title: Transfer learning via multi-scale convolutional neural layers for human-virus protein-protein interaction prediction
date: 2021-02-16
journal: bioRxiv
DOI: 10.1101/2021.02.16.431420
sha: 6fec7f80d32e3a45f25cab9402ef61155eed529d
doc_id: 944150
cord_uid: pwkl8rwy

To predict interactions between human and viral proteins, we combine evolutionary sequence profile features with a Siamese convolutional neural network (CNN) architecture and a multi-layer perceptron (MLP). Our architecture outperforms various feature encodings-based machine learning and state-of-the-art prediction methods. As our main contribution, we introduce two types of transfer learning methods (i.e., ‘frozen’ type and ‘fine-tuning’ type) that reliably predict interactions in a target human-virus domain based on training in a source human-virus domain, by retraining CNN layers. Our transfer learning strategies can effectively apply prior knowledge transfer from large source dataset/task to small target dataset/task to improve prediction performance. Finally, we utilize the ‘frozen’ type of transfer learning to predict human-SARS-CoV-2 PPIs, indicating that our predictions are topologically and functionally similar to experimentally known interactions. Source code and datasets are available at https://github.com/XiaodiYangCAU/TransPPI/.

Viruses often employ a complex network of protein-protein interactions (PPIs) to coopt their own 37 cellular biological processes, strongly implying that the detection of virus-host PPIs is essential for 38 our understanding of the mechanisms that allow the virus to control cellular functions of the human 39 host. Considerable Here, we focus on the application of transfer deep learning approaches to the prediction of 85 interactions between proteins of viruses and the human host, an important issue amidst the world-wide 86 COVID-19 pandemic. In particular, we design a deep learning approach to predict interactions 87 between proteins of various viruses and the human host through representing interacting protein 88 sequences with a pre-acquired protein sequence profile module. In particular, we utilize Position 89 Specific Scoring Matrix (PSSM) features to encode the sequence characteristics with a Siamese-based 90 CNN that is fed to a multi-layer perceptron (MLP) (Figure 1 ) (see Materials and methods for details). 91

We propose two types of transfer learning methods through freezing/fine-tuning the parameters of the 92 CNN layers trained with a source and retrained with a target human-virus system to improve prediction 93 performance. Note that we trained the MLP layers with a target human-virus system through randomly 94 initializing parameters. Notably, we found that the transfer of prior knowledge learned from a 95 large-scale human-virus PPI dataset to predict interactions in smaller data sets such as Dengue virus, 96

Zika virus and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) improved prediction 97 performance, provided better model generalization and reduced model training time. Finally, we 98 6 employ the 'frozen' transfer learning models to predict human-SARS-CoV-2 PPIs based on pre-trained 99 models with all known human-viral protein interactions. Analysis of the obtained interaction network 100 indicates high functional similarity to experimentally observed host-virus interactions, suggesting that 101 our approach indeed provides meaningful predictions and can guide efforts to understand viral 102 infection mechanisms and host immune responses post SARS-CoV-2 infection. 103

The deep learning framework for human-virus PPI prediction 106

Representing interactions between human and viral proteins through their sequences, we introduce an 107 end-to-end deep neural network framework, called a Siamese-based CNN that consists of a 108 pre-acquired protein sequence profile module, a Siamese CNN module and a prediction module 109 149 dimensional array. As noted previously, we considered a fixed sequence length of n = 2,000 and 150 zero-padded smaller sequences. Training our CNN+MLP approach with word2vec+CT one hot 151 encodings of the corresponding protein sequences, we observed that the representation of sequences 152 through PSSM in our approach provided better prediction performance especially in relatively small 153 datasets such as Dengue, Zika and SARS-CoV-2 (Table 3) . 154 155

To further assess the performance of our proposed method, we compared our method with three 157 existing human-virus PPI prediction approaches. Recently, we proposed a sequence embedding-based 158 RF method to predict human-virus PPIs with promising performance . In particular, 159

we applied an unsupervised sequence embedding technique (i.e., doc2vec) to represent interacting 160 protein sequences as low-dimensional vectors with rich features that are subjected to RF to predict the Alguwaizani et al.'s SVM approach with these data sets as well. Notably, Table 4 suggests that our 171 deep learning and our previously published RF-based method clearly outperformed other approaches. 172 173

To explore potential factors that affect prediction performance in a cross-viral setting, we trained our 175 deep learning model on one human-virus PPI data set and predicted protein interactions in a different 176 human-virus system. Utilizing 5-fold cross validation, we expectedly observed that the performance of 177 such naïve cross-viral tests dropped considerably compared to training and testing in the same 178 human-virus system (Figure 2a) . To allow reliable cross-viral predictions of PPIs, we introduce two 179 transfer learning methods (i.e., 'frozen' and 'fine-tuning') where we first train the parameters of CNN 180 layers on a source human-virus PPI dataset. Subsequently, we transferred all parameters of CNN layers 181

to initialize a new model ('frozen' or 'fine tuning') with randomly initialized MLP layers to train on a 182 10 given target human-virus PPI domain. To comprehensively test our transfer learning approaches, we 183 considered each combination of human-viral PPI sets as source and target domains. training on all source human-virus PPI datasets showed best performance compared to separately 203 11 training with virus-specific source PPI datasets (data not shown). Therefore, we employed the five 204 'frozen' models of the 5-fold cross-validation based on human-all virus source dataset to predict 205 human-SARS-CoV-2 PPIs and averaged the scores of the five models as the prediction result. At a 206 false positive rate control of 0.01, we identified 946 high-confidence protein interactions between 21 207 SARS-CoV-2 proteins and 551 human proteins (Supplementary Table S1 ). As for topological 208 network analysis, we found a power-law distribution when we counted the number of human proteins 209 that were targeted by a certain number of viral proteins (Figure 3a) . (Figure 4) . As a result, we found that both predicted and known viral targets were significantly 233 enriched with essential genes (Figure 4a) , while they rarely were transcription factors and kinases 234 (Figure 4b, c) . In turn, we observed that ubiquitinated, methylated and acetylated proteins 235 increasingly appeared in sets of viral targets while phosphorylated target appeared not enriched 236 (Figure 4d-g) . In particular, several studies have revealed SARS-CoV-2 targets human proteins that 237 receive PTMs relate to pathways that modulate host antiviral immune responses (Pruimboom, 2020; 238 Shin et al., 2020) . In turn, phosphorylation appears ubiquitous and not specific for viral targets. 239 240

Comparing our predicted and experimentally obtained sets of interactions between proteins of 242 SARS-CoV-2 and the human host, we found considerable overlaps. In particular, 298 out of 946 243 predicted PPI were identified through previous experimental efforts that amount to 52.5% of known 244 interactions in SARS-CoV-2, while 648 were specifically identified through our deep learning 245 13 approach (Figure 5a, Supplementary Table S1 ), indicating the reliability and specificity of our 246 model for the identification of novel interactions. To further excavate functional similarities and 247 differences between known and predicted targets, we performed functional and pathway enrichment 248 for experimentally known and predicted viral targets, respectively. Considering hypergeometric tests 249 (Bonferroni corrected P-value 0.01), we observed a relatively large number of shared GO 250 enrichment terms/KEGG enrichment pathways of experimentally confirmed targets and predicted 251 targets, indicating functionally similar between them, which further demonstrates the reliability and 252 quality of our predictions. (Figure 5b, Supplementary Table S2, S3) . In more detail, enriched GO 253 BP terms in human host proteins that were found in the experimental PPIs and predictions are 254 displayed in Figure 5c . Indicating high confidence of our predicted targets to discover functions 255 SARS-CoV-2 meddles in, we found that functional enrichments of experimental and predicted viral 256 targets both point to the involvement of viral targets in protein transport, protein import and mRNA 257 export from the nucleus. Notably, our predictions augment such functions, indicating that the virus 258 may also interfere with nuclear pore organization and assembly as well as protein export from the 259 nucleus. 260 261

To further explore potential functional modules that can reveal SARS-CoV-2 biology, we combined 263 our predicted 946 human-SARS-CoV-2 PPIs with known human-specific PPIs as of the HIPPIE 264 database (Alanis-Lobato et al., 2017) (Figure 6a) . Specifically, we identified 9 topological modules 265 based on connectivity among human proteins (Figure 6a, b) , utilizing the MCODE algorithm (Bader 266 14 and Hogue, 2003). Investigating the enrichment of GO BP terms and KEGG pathways through 267 hypergeometric tests (Bonferroni adjusted P-value 0.05), we observed that these modules largely 268 revolve around ribosome biogenesis, retrograde protein transport, elastic fiber assembly, 269 mitochondrial translation, protein processing in endoplasmic reticulum, stress granule regulation, 270 protein folding in endoplasmic reticulum, centrosome and gene splicing (Supplementary Table S4) . we found a conserved binding motif (Figure 6c) , corroborating our assumption that SARS-CoV-2 286 nsp13 protein may also interfere with the regulation processes of IFN that support antiviral innate 287 15 immune response. 288

In a different module (Figure 6d) orchestrate the assembly of viral replication complexes. In particular, coronavirus commonly 295 manipulates the stress granules and related RNA biology processes while stress granule 296 formation/assembly is considered a primary antiviral response (Nakagawa et al., 2018; 297 Quispe-Tintaya, 2019). To corroborate such findings, we found a conserved amino-acid motif 298 ('FGXF') in the nsp3 proteins of these viruses. In particular, mutations in the 'FGDF' motif of SFV 299 nsp3 protein indicated that residues F, G, F are essential for G3BP-binding (Panas et al., 2015) . 300

Notably, we found that the SARS-CoV-2 nsp12 protein and the SARS-CoV nsp3 protein both 301 contained the complete 'FGXF' motif, indicating the reliability of our prediction and the detailed 302 interaction pattern. 303 304 Discussion and conclusions 305 We designed a Siamese-based multi-scale CNN architecture by using PSSM to represent the sequences 306 of interacting proteins, allowing us to predict interactions between human and viral proteins with an 307 MLP approach. In comparison, we observed that our model outperformed previous state-of-the-art 308 16 human-virus PPI prediction methods. Furthermore, we confirmed that the performance of the 309 combination of our deep learning framework and the representation of the protein features as PSSM 310 was mostly superior to combinations of other machine learning and pre-trained feature embeddings. 311

While we found that our model that was trained on a given source human-viral interaction data set 312 performed dismally in predicting protein interactions of proteins in a target human-virus domain in a 313 naïve way, we introduced two transfer learning methods (i.e., 'frozen' type and 'fine-tuning' type). 314

Such methods allowed us to train on a source human-virus domain and retrain the layers of CNN with 315 data of a target domain. Notably, our methods increased the cross-viral prediction performance 316 dramatically, compared to the naïve baseline model. In particular, for small target datasets, fine-tuning 317 pre-trained parameters that were obtained from larger source sets increased prediction performance. Implementation details. As for pre-acquired sequence profile construction, we consider a fixed 367 sequence length of 2,000. As for the construction of our learning approach, we employ four 368 convolutional modules, with input sizes 20, 64, 128 and 256. The convolution kernel size is set to 3 369 while the size of pooling window is set to 2 with 3 max-pooling layers and a global max-pooling layer. We conduct 5-fold cross-validation to evaluate the performance of predictive models. Under 5-fold 426 cross validation, all the PPI datasets are equally divided into five non-overlapping subsets and each 427 22 subset owns once chance to train/test the model which can provide an unbiased evaluation. We 428 aggregate the following metrics [i.e., accuracy, precision, sensitivity, specificity, F1-score, and area 429 under the precision recall curve (AUPRC)] to evaluate the performance of the proposed method. In 430 particular, we define accuracy as word2vec+CT one-hot) and our deep learning algorithm (CNN+MLP). 736

Proc Neural Inf Process Syst

Convolutional deep belief networks for

scalable unsupervised learning of hierarchical representations

Deep neural network based predictions of protein 627 interactions using primary sequences

Virus-host interactome and proteomic survey 630 reveal potential virulence factors influencing SARS-CoV-2 pathogenesis

Prediction and analysis of 633 human-herpes simplex virus type 1 protein-protein interactions by integrating multiple methods

Database of Essential Genes that includes built-in analysis tools

in the regulation of IRF3 signaling

Neural article pair modeling for wikipedia sub-article matching

Chromatin accessibility prediction via 643 convolutional long short-term memory networks with k-mer embedding

Inhibition of stress granule 646 formation by middle east respiratory syndrome coronavirus 4a accessory protein facilitates viral 647 translation, leading to efficient virus replication

Viral and cellular proteins containing FGDF motifs bind G3BP to block stress Granule formation

Artiphysiology'reveals V4-like shape tuning in a 652 deep network trained for image classification

Ebola virus protein VP35 impairs the function of 654 interferon regulatory factor-activating kinases IKKε and TBK-1

Methylation pathways and SARS-CoV-2 lung infiltration and vell 657 membrane-virus gusion are both subject to epigenetics

Stress granules and processing bodies in translational control

On the convergence of Adam and Beyond

Deep convolutional neural networks for large-scale speech tasks

Comparative flavivirus-host protein interaction mapping 671 reveals mechanisms of dengue and Zika virus pathogenesis

Cytoscape: a software environment for integrated models of 675 biomolecular interaction networks

Transfer learning for visual categorization: a survey

Predicting 679 protein-protein interactions based only on sequences information

Papain-like protease regulates SARS-CoV-2 viral spread and innate immunity

Sequence-based prediction of protein protein interaction 687 using a deep-learning algorithm

UniRef clusters: a 689 comprehensive and scalable alternative for improving sequence similarity searches

MultiPLIER: a transfer learning framework for 693 transcriptomics reveals systemic features of rare disease

The Gene Ontology Resource: 20 years and still 696

GOing strong

Viral targeting of the interferon-β-inducing Traf family member-associated 699

TANK)-binding kinase-1

Control of taNK-binding kinase 702 of herpes simplex virus 1

Prediction of DNA-binding 705 residues in proteins from amino acid sequences using a random forest model with a hybrid feature

Viral organization of human proteins

Prediction of protein-protein interactions from protein 710 sequence using local descriptors

HVIDB : a comprehensive database for 713 human-virus protein-protein interactions. Brief Bioinform published online

Prediction of human-virus protein-protein 716 interactions through a sequence embedding-based machine learning method

GTRD: a database 719 on gene transcription regulation -2019 update

Prediction of protein-protein interactions from

A deep learning 725 framework for modeling structural features of RNA-binding protein targets

In order to build the integrated interaction network for topological analysis, we first collected known 470 protein interactions between the human proteins predicted to interact with SARS-CoV-2 from the 471 HIPPIE database (Alanis-Lobato et al., 2017). In the next step, we combined our predicted 472 human-SARS-CoV-2 PPIs with the known human PPIs into the final interaction network. Moreover, 473we clustered human proteins within the network based solely on their topological connectivity. We 474 applied the MCODE plugin (Bader and Hogue, 2003) to find clusters of densely interconnected 475 human proteins denoting potential functional modules or parts of pathways (include loops: no; 476 degree cutoff: 2; haircut: no; fluff: no; node score: 0.4; kcore: 2; max depth: 100). Visualizations of 477 the modules (i.e., subnetworks) were carried out with Cytoscape (Shannon et al., 2003) . Enrichment 478 analysis for each cluster was performed by using hypergeometric tests, where corresponding 479 P-values were Bonferroni corrected, and only the five most enriched GO BP terms and KEGG 480 pathways were considered (Adjusted P-value 0.05) in Figure 6 . All pathways through a hypergeometric test (Bonferroni corrected P-value 0.01). We found that a 791 relatively large shared GO enrichment terms and KEGG enrichment pathways in groups of host 792 proteins that appeared in the experimentally known PPIs and predictions. c In more detail, we 793 observed that enriched GO BP terms in host proteins that were found in the experimental PPIs and 794 predictions were functionally similar. 795 796 Figure 6 a Combining predictions that we obtained with the transfer learning approach and known 797 human PPIs we determined connectivity-based modules that were subjected to functional 798 interpretation. b Human-SARS-CoV-2 PPI network with enriched GO BP terms and KEGG 799 pathways for each topological module. c SARS-CoV-2 targets a module that involves the centrosome, 800 cell cycle and interferon pathway. SARS-CoV-2 interacts with interferon pathway and presents a 801 conserved region with multiple viral pathogens. A conserved binding motif that nsp13 of 802 SARS-CoV-2 and proteins of various other viral pathogens share suggests that SARS-CoV-2 nsp13 803 protein may interfere with the regulation processes of IFN, supporting antiviral innate immune 804 response. d SARS-CoV-2 targets a module that involves stress granule regulation, RNA processing 805 43 and protein export, and interacts with stress granule proteins and shows potential interaction patterns 806 by a conserved amino-acid motif in the nsp12 of SARS-CoV-2 and nsp3 proteins of other viruses. 807