key: cord-0704494-r8by7dtj
authors: Dong, Thi Ngan; Brogden, Graham; Gerold, Gisa; Khosla, Megha
title: A multitask transfer learning framework for the prediction of virus-human protein–protein interactions
date: 2021-11-27
journal: BMC Bioinformatics
DOI: 10.1186/s12859-021-04484-y
sha: b6fab315571c9de68f9b49b99eda04e06b36689d
doc_id: 704494
cord_uid: r8by7dtj

BACKGROUND: Viral infections are causing significant morbidity and mortality worldwide. Understanding the interaction patterns between a particular virus and human proteins plays a crucial role in unveiling the underlying mechanism of viral infection and pathogenesis. This could further help in prevention and treatment of virus-related diseases. However, the task of predicting protein–protein interactions between a new virus and human cells is extremely challenging due to scarce data on virus-human interactions and fast mutation rates of most viruses. RESULTS: We developed a multitask transfer learning approach that exploits the information of around 24 million protein sequences and the interaction patterns from the human interactome to counter the problem of small training datasets. Instead of using hand-crafted protein features, we utilize statistically rich protein representations learned by a deep language modeling approach from a massive source of protein sequences. Additionally, we employ an additional objective which aims to maximize the probability of observing human protein–protein interactions. This additional task objective acts as a regularizer and also allows to incorporate domain knowledge to inform the virus-human protein–protein interaction prediction model. CONCLUSIONS: Our approach achieved competitive results on 13 benchmark datasets and the case study for the SARS-CoV-2 virus receptor. Experimental results show that our proposed model works effectively for both virus-human and bacteria-human protein–protein interaction prediction tasks. We share our code for reproducibility and future research at https://git.l3s.uni-hannover.de/dong/multitask-transfer.

between the virus and its host. These interactions include the initial attachment of virus coat or envelope proteins to host membrane receptors, hijacking of the host translation and intracellular transport machineries resulting in replication, assembly and subsequent release of virus particles [2] [3] [4] . Besides providing mechanistic insights into the biology of infection, knowledge of virus-host interactions can point to essential events needed for virus entry, replication, or spread, which can be potential targets for the prevention, or treatment of virus-induced diseases [5] .

In vitro experiments based on yeast-two hybrid (Y2H), ligand-based capture MS, proximity labeling MS, and protein arrays have identified tens of thousands of virus-human protein interactions [6] [7] [8] [9] [10] [11] [12] [13] [14] . These interaction data are deposited in publicly available databases including InAct [15] , VirusMetha [16] , VirusMINT [17] , and HPIDB [18] , and others. However, experimental approaches to unravel PPIs are limited by several factors, including the cost and time required, the generation, cultivation and purification of appropriate virus strains, the availability of recombinantly expressed proteins, generation of knock in or overexpression cell lines, availability of antibodies and cellular model systems. Computational approaches can assist in vitro experimentation by providing a list of most probable interactions, which actual biological experimentation techniques can falsify or verify.

In this work, we cast the problem of predicting virus-human protein interactions as a binary classification problem and focus specifically on emerging viruses that has limited experimentally verified interaction data.

Limited interaction data. One of the main challenges in tackling the current task as a learning problem is the limited training data. Towards predicting virus-host PPI, some known interactions of other human viruses collected from wet-lab experiments are employed as training data. The number of known PPIs is usually too small and thus, not representative enough to ensure the generalizability of trained models. In effect, the trained models might overfit the training data and would give inaccurate predictions for any given new virus.

Difference to other pathogens. A natural strategy to overcome the limitation posed by scarce virus protein interaction data is to employ transfer learning from available intra-species PPI or PPI data for other types of pathogens. This may, in its simplest fashion, not be a viable strategy as virus proteins can differ substantially from human or bacterial proteins. Typically, they are highly structurally and functionally dynamic. Virus proteins often have multiple independent functions so that they cannot be easily detected by common sequence-structure comparison [19] [20] [21] . Besides, virus protein sequences of different species are highly diverse [22] . Consequently, models trained for intra-species human PPI [23] [24] [25] [26] [27] or for other pathogen-human PPI [28] [29] [30] [31] [32] [33] cannot be directly used to predict virus-human protein interactions.

Limited information on structure and function of virus proteins. While for human proteins, researchers can retrieve information from many publicly available databases to extract features related to their function, semantic annotation, domains, structure, pathway association, and intercellular localization, such information is not readily available for most virus proteins. Protein crystal structures are available for some virus proteins.

However, for many, predictive structures based on the amino acid sequence must be used. Thus, for the majority of virus proteins, currently, the only reliable source of virus protein information is its amino acid sequence. Learning effective representations of the virus proteins, therefore, is an important step towards building prediction models. Heuristics such as K-mer amino acid composition are bound to fail as it is known that virus proteins with completely different sequences might show similar interaction patterns.

In this work, we develop a machine learning model which overcomes the above limitations in two main steps, which are described below.

Transfer Learning via protein sequence representations. Though the training data on interactions as well as the input information on protein features are limited, a large number of unannotated protein sequences are available in public databases like UniProt. Inspired by advancements in Natural Language Processing, Alley et al. [34] trained a deep learning model on more than 24 million protein sequences to extract statistically meaningful representations. These representations have been shown to advance the state-of-the-art in protein structure and function prediction tasks. Rather than using hand-crafted protein sequence features, we use the pre-trained model by [34] (referred to as UniRep) to extract protein representations. The idea here is to exploit transfer learning from several million sequences to our scant training data.

Incorporating domain information. We further fine-tune UniRep's globally trained protein representations using a simple neural network whose parameters are learned using a multitask objective. In particular, besides the main task, our model is additionally regularized by another objective, namely predicting interactions among human proteins. The additional objective allows us to encode (human) protein similarities dictated by their interaction patterns. The rationale behind encoding such knowledge in the learnt representation is that the human proteins sharing similar biological properties and functions would also exhibit similar interacting patterns with viral proteins. Using a simpler model and an additional side task helps us overcome overfitting, which is usually associated with models trained with small amounts of training data.

We refer to our model as MultiTask Transfer (MTT) and is further illustrated in "Method" section. To sum up, we make the following contributions.

• We propose a new model that employs a transfer learning-based approach to first obtain the statistically rich protein representations and then further refines them using a multitask objective. • We evaluated our approach on several benchmark datasets of different types for virus-human and bacteria-human protein interaction prediction. Our experimental results (c.f. "Result analysis" section) show that MTT outperforms several baselines even on datasets with rich feature information. • Experimental results on the SARS-CoV-2 virus receptor shows that our model can help researchers to reduce the search space for yet unknown virus receptors effectively. • We release our code for reproducibility and further development at https:// git. l3s.

uni-hanno ver. de/ dong/ multi task-trans fer.

Existing work mainly casts PPI prediction task as a supervised machine learning problem. Nevertheless, the information about non-interacting protein pairs is usually not available in public databases. Therefore, researchers can only either adapt models to learn from only positive samples or employ certain negative sampling strategy to generate negative examples for training data. Since the quality and quantity of the generated negative samples would significantly affect the outcome of the learned models, the authors in [31, 35, 36] proposed models that only learned from the available known positive interactions. Nourani et al. [36] and Li et al. [31] treated the virushuman PPI problem as a matrix completion problem in which the goal was to predict the missing entries in the interaction matrix. Nouretdinov et al. [35] use a conformal method to calculate p-values/confidence level related to the hypothesis that two proteins interact based on similarity measures between proteins. Another line of work which casts the problem as a binary classification task focussed on proposing new negative sampling techniques. For instance, Eid et al [22] proposed Denovo-a negative sampling technique based on virus sequence dissimilarity. Mei et al. [37] proposed a negative sampling technique based on one class SVM. Basit et al. [33] offered a modification to the Denovo technique by assigning sample weights to negative examples inversely proportional to their similarity to known positive examples during training.

Dick et al. [30] utilizes the interaction pattern from intra-species PPI networks to predict the inter-species PPI between human-HIV-1 virus and human. Though the results are promising, this cannot be directly applied to completely new viruses where information about closely-related species is not available or to viruses whose intraspecies PPI information is not available.

The works presented in [38] [39] [40] [41] [42] [43] [44] employed different feature extraction strategies to represent a virus-human protein pair as a fixed-length vector of features extracted from their protein sequences. Instead of hard-coding sequence feature, Yang et al. [45] and Lanchantin et al. [46] proposed embedding models to learn the virus and human proteins' feature representations from their sequences. However, their training data was limited to around 500,000 protein sequences. Though not very common, other types of information/features were also used in some proposed models besides sequence-based features. Those include protein functional information (or GO annotation) as in [47] , proteins domain-domain associations information as in [48] , protein structure information as in [32, 49] , and the disease phenotype of clinical symptoms as in [47] . One limitation of these approaches is that they cannot be generalized to novel viruses where such kind of information is not available.

Among the network-based approaches, Liu et al. and Wang et al. [50, 51] constructed heterogeneous networks to compute virus and human proteins features. Nodes of the same type were connected by either weighted edges based on their sequence similarity or a combination of sequence similarity and Gaussian Interaction Profile kernel similarity. Deng et al. [43] proposed a deep-learning-based model with a complex architecture of convolutional and LSTM layers to learn the hidden representation of virus and human proteins from their input sequence features along with the classification problem. Despite the promising performance, those studies still have the limitation posed by hand-crafted protein features.

We first provide a formal problem statement.

Problem statement. We are given protein sequences corresponding to infectious viruses and their known interactions with human proteins. Given a completely new (novel) virus, its set of protein(s) V along with its (their) sequence(s), we are interested in predicting potential interactions between V and the human proteins.

We cast the above problem as that of binary classification. The positive samples consist of pairs of virus and human proteins whose interaction has been verified experimentally. All other pairs are considered to be non-interacting and constitute the negative samples. In "Data description and experimental set up" section, we add details on positive and negative samples corresponding to each dataset.

Summary of the approach. The schematic diagram of our proposed model is presented in Fig. 1 . As shown in the diagram, the input to the model is the raw human and virus protein sequences which are passed through the UniRep model to extract low dimensional vector representations of the corresponding proteins. The extracted embeddings are then passed as initialization values for the embedding layers. These representations are further fine-tuned using the Multilayer Perceptron (MLP) modules (shown in blue). The fine-tuning is performed while learning to predict an Fig. 1 Our proposed MTT model for the virus-human PPI prediction problem. The UniRep embeddings are used to initialize our embedding layers which will be further fine-tuned by the two PPI prediction tasks. Sharing representation for human proteins further enables us to transfer the knowledge learned from the human PPI network to inform our virus-host PPI prediction task interaction between two human proteins (between proteins A and B in the figure) as well as the interaction between human and virus proteins (between proteins B and C). In the following, we describe in detail the main components of our approach.

Significance of using protein sequence as input. We note that the protein sequence determines the protein's structural conformation (fold), which further determines its function and its interaction pattern with other proteins. However, the underlying mechanism of the sequence-to-structure matching process is very complex and cannot be easily specified by hand-crafted rules. Therefore, rather than using handcrafted features extracted from amino acid sequences, we employ the pre-trained UniRep model [34] to generate latent representations or protein embeddings. The protein representations extracted from UniRep model are empirically shown to preserve fundamental properties of the proteins and are hypothesized to be statistically more robust and generalizable than hand-crafted sequence features.

UniRep for extracting sequence representations. In particular, UniRep consists of an embedding layer that serves as a lookup table for each amino acid representation. Each amino acid is represented as an embedding vector of 10 dimensions. Each input protein sequence of length N will be denoted as a two-dimensional matrix of size Nx10. That two-dimensional matrix will then feed as input to a Multiplicative Long Short Term Memory (mLSTM) network of 1900 units. The 1900 dimension is selected experimentally from a pool of architectures that require different numbers of parameters as described in [52] , namely, a 1900-dimensional single layer multiplicative LSTM ( ∼ 18.2 million parameters), a 4-layer stacked mLSTM of 256 dimensions per layer ( ∼ 1.8 million parameters), and a 4-layer stacked mLSTM with 64 dimensions per layer ( ∼ 0.15 million parameters). The output from mLSTM is a 1900 dimensional embedding vector that serves as the pre-trained protein embedding for the input protein sequence. We use the calculated pre-trained virus and human protein embeddings to initialize our embedding layers. The two supervised PPI prediction tasks will further fine-tune those embeddings during training.

We further fine-tune these representations by training two simple neural networks (single layer MLP with ReLu activation) using an additional objective of predicting human PPI in addition to the main task. More precisely, the UniRep representations will be passed through one hidden layer MLPs with ReLU activations to extract the latent representations. Let X denote the embedding lookup matrix. The ith row corresponds to the embedding vector of node i. The final output from MLP layers for an input v is then given by hid(v) = MLP(X(v)) . To predict the likelihood of interaction between a pair (v 1 , v 2 ) we first perform an element-wise product of the corresponding hidden vectors (output of MLPs) and pass it through a linear layer followed by sigmoid activation. In the following we provide a detailed description of our multi-task objective.

Let , denote the set of learnable parameters corresponding to fine-tuning components (as shown in Fig. 1 in green and blue boxes), i.e., the Multilayer Perceptrons (MLP) corresponding to the virus and human proteins, respectively. Let W 1 , W 2 denote the two learnable weight matrices (parameters) for the linear layers (as depicted in gray boxes in the Figure) . We use VH, and HH to denote the training set of virus-human, humanhuman PPI, correspondingly. We use binary cross entropy loss for predicting virushuman PPI predictions, as given below:

where variables z vh is the corresponding binary target variable and y vh is the predicted likelihood of observing virus-human protein interaction, i.e., where σ (x) = 1/1 + e −x is the sigmoid activation and ⊙ denotes the element-wise product.

For human PPI, we predict the confidence score of observing an interaction between two human proteins. More specifically, we directly predict z hh ′-the normalized confidence scores for interaction between two human proteins as collected from STRING [53] database. Predicting the normalized confidence scores helps us overcome the issues with defining negative interactions. We use mean square error loss to compute the loss for the human PPI prediction task as below where y hh ′ is computed similar to (2) for human proteins and N is the number of (h, h ′ ) pairs.

We use a linear combination of the two loss functions to train our model.

where α is the human PPI weight factor.

We commence by describing the 13 datasets used in this work to evaluate our approach.

The Novel H1N1 and Novel Ebola datasets. We retrieve the curated or experimentally verified PPIs between virus and human from four databases: APID [54] , IntAct [15] , VirusMetha [16] , and UniProt [55] using the PSICQUIC web service [56] . In total, there are 11,491 known PPIs between 246 viruses and humans. From this source of data, we generate new training and testing data for the two viruses: the human H1N1 Influenza virus and Ebola virus. We name the two datasets Novel H1N1 and Novel Ebola

according to the virus present in the testing set. The positive training data for the Novel H1N1 dataset includes PPIs between human and all viruses except H1N1. Similarly, the positive training data for the Novel Ebola dataset includes PPIs between human and all viruses except Ebola. The positive testing data for the human-H1N1 dataset contains PPIs between human and 11 H1N1 virus proteins. Likewise, the positive testing data for the human-Ebola dataset contains PPIs between human and three of the eight Ebola virus proteins (VP24, VP35, and VP40).

Negative sampling techniques such as the dissimilarity-based method [22] , the exclusive co-localization method [57, 58] are usually biased as they restrict the number of tested human proteins. It is also unrealistic for a new virus because information about such restricted human protein set, generated from filtering criteria based on the positive instances, is typically unavailable. For those reasons, we argue that random negative sampling is the most appropriate, unbiased approach to generate negative training/testing samples. Since the exact ratio of positive:negative is unknown, we conducted experiments with different negative sample rates. In our new virus-human PPI experiments, we try four negative sample rates: [1, 2, 5, 10] . In addition, to reduce the bias of negative samples, the negative sampling in the training and testing set is repeated ten times. In the end, for each dataset, we test each method with 4x4x10 = 160 different combinations of negative training and negative testing sets (with fixed positive training and test samples). The statistics for our new testing datasets are given in Table 1 .

The DeepViral [47] Leave-One-Species-Out (LOSO) benchmark datasets. The data was retrieved from the HPIDB database [18] to include all Pathogen-Host interactions that have confidence scores available and are associated with an existing virus family in the NCBI taxonomy [59] . After filtering, the dataset includes 24,678 positive interactions and 1,066 virus proteins from 14 virus families. We follow the same procedure as mentioned in [47] to generate the training and testing data corresponding to four virus species with taxon IDs: 644788 (Influenza A), 333761 (HPV 18), 2697049 (SARS-CoV-2), 2043570 (Zika virus). From now on, we will use the NCBI taxon ID of the virus species in the testing set as the dataset name. For each dataset, the positive testing data consists of all known interactions between the test virus and the human proteins. The negative testing data consists of all possible combinations of virus and 16,627 human proteins in Uniprot (with a length limit of 1000 amino acids) that do not appear in the positive testing set. Similarly, the positive training data consists of all known interactions between human protein and any virus protein, except for the one which is in the testing set. The negative training data is generated randomly with the positive:negative rate of 1:10 from the pool of all possible combinations of virus and 16,627 human proteins that do not appear in the positive training set. Statistics of the datasets are presented in Table 1 . Though performing a search on the set of 16,627 human proteins might not be a fruitful realistic strategy, we still keep the same training and testing data as released in the DeepViral study in our experiments to have a direct and fair comparison with the DeepViral method.

The two datasets released by Zhou et al. [41] are widely used by recent papers to evaluate state-of-the-art models on new virus-human PPI prediction tasks. We refer to them as Zhou's H1N1 and Zhou's Ebola where each dataset was named after the viruses in the testing sets. Zhou's H1N1 and Zhou's Ebola share similar positive training and testing samples with the Novel H1N1 and Novel Ebola datasets. However, they differ in the negative training and testing samples sets. While the negative samples in Novel H1N1 and Novel Ebola were generated randomly from the pool of all possible pairs, the negative training/testing samples in Zhou's H1N1 and Zhou's Ebola were generated based on the protein sequence dissimilarity score. Therefore, Zhou's H1N1 and Zhou's Ebola have the limitations as mentioned in "The realistic host cell-virus testing datasets" section and are not ideal for evaluating the new virus-human PPI prediction task. The data statistics for these two datasets are shown in Table 1 .

The dataset with protein motif information (Denovo SLiM [22] ). The Denovo SLiM dataset Virus-human PPIs were collected from VirusMentha database [16] . The presence of Short Linear Motif (SLiM) in virus sequences was used as a criterion for data filtering. SLiMs are short, recurring patterns of protein sequences that are believed to mediate protein-protein interaction [60, 61] . Therefore, sequence motifs can be a rich feature set for virus-human PPI prediction tasks. The test set [22] contained 425 positives and 425 negative PPIs (Supplementary file S12 used in DeNovo's study ST6). The training data consisted of the remaining PPI records and comprised of 1590 positive and 1515 negative records for which virus SLiM sequence is known and 3430 positives and 3219 negatives without virus SLiM sequences information. Denovo_slim negative samples were also generated using the Denovo negative sampling strategy (based on sequence dissimilarity). The Barman's dataset [48] with protein domain information. The dataset was retrieved from VirusMINT database [17] . Interacting protein pairs that did not have any "InterPro" domain hit were removed. In the end, the dataset contained 1035 positives and 1035 negative interactions between 160 virus proteins of 65 types and 667 human proteins. 5-Fold cross-validation was then employed to test each method's performance.

We evaluate our method on three datasets for three human pathogenic bacteria: Bacillus anthracis (B1), Yersinia pestis (B2), and Francisella tularensis (B3), which were shared by Fatma et al. [22] .

The data was first collected from HPIDB [18] . B1 belongs to a bacterial phylum different from that of B2 and B3, while B2 and B3 share the same class but differ in their taxonomic order. B1 has 3057 PPIs, B2 has 4020, and B3 has 1346 known PPIs. A sequence-dissimilarity-based negative sampling method was employed to generate negative samples. For each bacteria protein, ten negative samples were generated randomly. Each of the bacteria was then set aside for testing, while the interactions from the other two bacteria were used for training. For simplicity, we use the name of the bacteria in the testing set as the name of the dataset. The statistics for those three datasets are presented in Table 2 .

We compare our method with the following seven baseline methods and two simper variants of our model.

• Generalized [41] : It is a generalized SVM model trained on hand-crafted features extracted from protein sequence for the novel virus-human PPI task. Each virushuman pair is represented as a vector of 1175 dimensions extracted from the two protein sequences. • Hybrid [43] : It is a complex deep model with convolutional and LSTM layers for extracting latent representation of virus and human proteins from their input sequence features and is trained using L1 regularized Logistic regression. • doc2vec [45] : It employs the doc2vec [62] approach to generate protein embeddings from the corpus of protein sequences. A random forest model is then trained for the PPI prediction. • MotifTransformer [46] : It is a transformer-based deep neural network that pretrains protein sequence representations using unsupervised language modeling tasks and supervised protein structure and function prediction tasks. These representations are used as input to an order-independent classifier for the PPI prediction task. • DeNovo [22] : This model trained an SVM classifier on a hand-crafted feature set extracted from the K-mer amino acid composition information using a novel negative sampling strategy. Each protein pair is represented as a vector of 686 dimensions. • DeepViral [47] : It is a deep learning-based method that combines information from various sources, namely, the disease phenotypes, virus taxonomic tree, protein GO annotation, and proteins sequences for intra-and inter-species PPI prediction. • Barman [48] : It used an SVM model trained on a feature set consisting of the protein domain-domain association and methionine, serine, and valine amino acid composition of viral proteins. • 2 simpler variants of MTT: Towards ablation study, we evaluate two simpler variants:

(i) SingleTask Transfer (STT), which is trained on a single objective of predicting pathogen-human PPI. STT is basically the MTT without the human PPI prediction side task and (ii) Naive Baseline, which is a Logistic regression model using concatenated human and pathogen protein UniRep representations as input.

We use Pytorch [63] For the Doc2vec model, we use the released code shared by the authors with the given parameters. For the Generalized and Denovo models, we re-implement the methods in Python using all the parameters and feature set as described in the original papers. For Barman and DeepViral, the results are taken from the original papers or calculated from the given model prediction scores.

For all benchmark datasets except the case study, we report five metrics: the Area under Receiver Operating Characteristic curve (AUC ) and the area under the precision-recall curve (AP), the Precision, Recall, and F1 scores.

For the case study, we report the topK score with K from 1 to 10. TopK is equal to 1 if the human receptor for SARS-CoV-2 virus appears in the top K proteins that have the highest scores predicted by the model and 0 otherwise.

In the following four subsections, we provide a detailed comparison of MTT with (i) methods employing hand-crafted input features, (ii) sequence embedding-based methods, (iii) an approach that uses protein domain information, (iv) simpler variants of MTT as ablation studies respectively. All statistical test results present in this section are those from the pair-wise t-test [64] on the F1 scores attained from multiple runs on the same dataset.

Generalized [41] and Denovo [22] are the two traditional methods relying on handcrafted features extracted from the protein sequences. The number of hand-crafted features employed by Denovo and Generalized are 686 and 1175, respectively. They both employ SVM for the classification task. Since SVM scales quadratically with the number of data points, Denovo and Generalized are not scalable to larger datasets. Figure 2 presents their comparison between MTT on small testing datasets. Detailed scores are given in Table 6 in the Appendix. Results from the two-tailed t-test [65, 66] support that MTT significantly outperforms Denovo in all benchmarked datasets with a confidence score of at least 95% . Compared with Generalized, MTT has higher performance in six out of seven datasets (except Denovo_slim). The difference is the most significant on the Barman, Zhou's H1N1, and Zhou's Ebola datasets. On Denovo_ slim dataset, MTT 's F1 score is lower than Generalized and only 2% higher than Denovo. This is expected since Denovo_slim is a specialized dataset favoring methods using local sequence motif features, which are exploited by Denovo and Generalized.

Hybrid is one recently proposed, deep learning-based method. Despite that, the input features are still manually extracted from the protein sequence. Since the code is not publicly available, we only have the AUC score corresponding to the Zhou's H1N1 dataset, which is also taken from the original paper as listed in Table 6 . Compared with Hybrid, MTT has higher AUC score. Though comparison on the AUC for one dataset does not bring much insight, we include this method here for completeness. 

Doc2vec and MotifTransformer are state-of-the-art methods based on sequence embeddings or representations. Doc2vec utilizes the embeddings learned from the extracted k-mer features while MTT and MotifTransformer employ the embedding directly learned from the amino acid sequences. In addition, MTT is a multitask-based approach that incorporates additional information on human protein-protein interaction into the learning process. Figure 3 shows a comparison in F1 score of MTT and Doc2vec over all benchmarked datasets. Detailed scores are presented in Table 7 in the Appendix. Since the code for the MotifTransformer model is not publicly available, we only have the corresponding results available for the Zhou's H1N1 and Zhou's Ebola datasets, which are also taken from the original paper. '-' denotes the score is not available. Compared with Motif-Transformer, MTT has a slightly worse F1 score on Zhou's H1N1 and significantly better F1 score on Zhou's Ebola datasets.

Comparison with Doc2vec. MTT out-performs Doc2vec in 5 out of 9 benchmark datasets, and the performance gap is statistically significant with a p-value smaller than 0.05. MTT is significantly better than Doc2vec on the Novel Ebola dataset, while on the Novel H1N1 dataset, the reverse holds true. Doc2vec outperforms MTT in three testing datasets whose negative samples were drawn from a sequence dissimilarity method. We also note that these datasets might be biased since in the ideal testing scenario, we do not have knowledge about the set of human proteins that interacted with the virus. Therefore, such dissimilarity-based negative sampling is infeasible. 

Barman features set is constructed from the domain-domain association and the hand-crafted feature extracted from the protein sequences. Since the protein domain information is not available for all viral proteins, the Barman method has restricted application. A comparison between Barman and MTT is presented in Table 3 . Due to data and code availability, we only have the results for the Barman model on one dataset. From reported results, we could clearly see that MTT outperforms its competitor for a large margin in all available metrics.

DeepViral exploited that disease phenotypes, the viral taxonomies, and proteins' GO annotation to enrich its protein embeddings. Table 4 presents a comparison between MTT and DeepViral on the four datasets released by DeepViral 's authors. The reported results on each dataset are the average after five experimental runs for DeepViral and ten experimental runs for MTT. We observe MTT and STT significantly supersede their competitor regarding the averaged F1 score. The gain is more significant on smaller datasets (644788 and 333761)

We compare our method with two of its simpler variants: the STT and the Naive baseline baseline models. STT is the MTT model without the human PPI prediction task. Naive baseline concatenates the learned embeddings for the virus and human proteins to form the input to a Logistic Regression model. Figure 4 presents Table 3 Comparison between MTT and bARMAn-a method that relies on the protein domain information Due to data and code availability issues, for the Barman method, we only have results for the Barman 's dataset, which are also taken from the original paper. '−' indicates that the result is not available a comparison between the F1 score of MTT and its variants on our benchmarked datasets. Table 8 show all reported scores over all datasets. MTT is significantly better than STT in five out of nine benchmarked and the four DeepViral datasets with a p-value smaller than 0.05. While in the remaining four datasets, the difference is not statistically significant. This confirms that the learned patterns from the human PPI network bring additional benefits to the virus-human PPI prediction task. Compare with Naive baseline, MTT wins in eight out of nine benchmarked and the four DeepViral datasets. On the remaining dataset (Novel H1N1), the difference is not statistically different. STT significantly outperforms Naive baseline in eight out of nine datasets. This claims the effectiveness of our chosen architecture. Fig. 4 Ablation study on benchmarked datasets. Compared with STT, MTT is statistically better in five datasets, while on the remaining four (noVel H1n1, DenoVo_SliM, YeRSinA, and FRAnCi), the difference is not statistically significant. MTT is statistically better than nAiVe bASeline on eight out of nine datasets, while on the remaining dataset(noVel ebolA), the difference is not statistically different

The virus binding to cells or the interaction between viral attachment proteins and host cell receptors is the first and decisive step in the virus replication cycle. Identifying the host receptor(s) for a particular virus is often fundamental in unveiling the virus pathogenesis and its species tropism.

Here we present a case study for detecting the human protein binding partners for SARS-CoV-2. Our virus-human PPI dataset is retrieved from the InAct Molecular Interaction database [15] (the latest update is 07.05.2021). We retrieve the protein sequences from Uniprot [55] . In the next section, we describe the construction of the training and testing dataset to predict SARS-CoV-2 binding partners.

The statistics for our SARS-CoV-2 binding prediction dataset are presented in Table 5 . We construct the corresponding datasets as follows.

Training set. As positive interaction samples, we include in the training data only direct interactions between the human proteins and any virus except the SARS-CoV and SARS-CoV-2. Direct interaction requires two proteins to directly bind to each other, i.e. without an additional bridging protein. Moreover, the interacting human protein should be on the cell surface. Without loss of generality, we perform our search for the binding receptor on the set of all human proteins that have a KNOWN direct interaction with any virus and locate to the cell surface. Our surface human protein list consists of all reviewed Uniprot proteins that meet at least one of the following criteria: (i) appears in the human surfacetome [67] list or (ii) has at least one of the following GO annotations [68, 69] :{CC-plasma membrane, CC-cell junction}.

The negative samples for training data contain indirect (interactions that are not marked as direct in the database) between the human proteins and any virus except SARS-CoV and SARS-CoV-2. The indirect interactions can be a physical association (two proteins are detected in the same protein complex at the same point of time) or an association in which two proteins that may participate in the formation of one or more physical complexes without additional evidence whether the proteins are directly binding to specific members of such a complex).

Validation and test sets. As established in studies [70] [71] [72] , angiotensin-converting enzyme 2 (ACE2) is the human receptor for both SARS-CoV [73] and SARS-CoV-2 viruses [72] . The positive validation and testing set consist of interaction between the known human receptor (ACE2) and the corresponding spike proteins of SARS-CoV and SARS-CoV-2, respectively. Our negative validation and testing set encapsulate of all 22:572 possible combinations the two viral spike proteins and 52 human proteins that meet our filtering criteria.

Since we are interested in only the direct interaction between virus and human proteins, we also customize our intra human PPI training set. Our intra human PPI dataset is also retrieved from the InAct [15] database (the latest update is 07.05.2021). We retain only interactions between two human proteins that appear in the virus-human PPI dataset constructed above. The confidence scores are normalized into the [0, 1] ranges. All confidence scores corresponding to "indirect" interactions are set to 0. In the end, our intrahuman PPI training set consists of 96,458 interactions between 5563 human proteins.

Finally, we here evaluate the prediction methods on how effective they are in ranking human protein candidates for binding to an emerging virus envelope protein. Figure 5 presents the methods' performance after ten runs on the case study dataset. TopK is equal to 1 if the true human receptor appears in the top K proteins that correspond to the highest predicted scores by the model and is equal to 0 otherwise. The reported scores plotted in Fig. 5 are the average after ten experimental runs with random initialization.

Using this method we find that ACE2, the only SARS-CoV-2 receptor proven in in vivo and in vitro studies [72, 74, 75] , consistently appears as the highest ranked prediction of MTT in each of the ten experimental runs. We observe a significant difference between the highest ranked performance of MTT and its competitors. The performance gain shown by MTT over STT is quite substantial after ten runs and supports the superiority of our multitask framework. The next highest nine hits presented in both models have not been shown to interact with SARS-CoV-2 in in vitro studies. Interestingly, dipeptidyl peptidase 4 (DDP4), a receptor for another betacoronavirus MERS-CoV [76] Fig. 5 Case study results for benchmarked methods. topK = 1 if the SARS-CoV-2 virus receptor appear in the top K proteins that have highest scores predicted by the model and topK = 0 otherwise. The reported results are the averages after 10 runs also scored highly in the MTT method. However, although in silico analysis has speculated a possible interaction [77] , it is yet to be shown experimentally. Similarly, the serine protease TMPRSS2, which is required for SARS-CoV-2 S protein priming during entry [72] , appeared in position 7 using the Doc2vec model. Finally, aminopeptidase N (ANPEP) the receptor for the common cold coronavirus 229E appeared as first hit in the Doc2vec model [78] .

In Figures 6 and 7 , we plot the average confidence scores (corresponding to predicted interaction probability) corresponding to top 10 predictions of MTT and Doc2vec models. Specifically, the proteins are ranked based on the average (over 10 runs) confidence scores as predicted by the two models. While for MTT, the receptor ACE2 always occurs at the top of the list with average confidence score of more than 0.70 (which is more than 11% higher than the confidence score assigned to the second hit), Doc2vec assigns it a score of less than 0.44 where ACE2 is ranked 2nd based on average scores. Moreover, there is negligible difference between the prediction scores for ACE2 and the first predicted hit ANPEP in case of Doc2vec.

These results indicate that MTT can provide high-quality prediction results and can help biologists to restrict the search space for the virus interaction partner effectively. This case study showcases the effectiveness of our method in solving virus-human PPI prediction problem and aims to convince biologists of the potential application of our prediction framework.

We presented a thorough overview of state-of-the-art models and their limitations for the task of virus-human PPI prediction. Our proposed approach exploits powerful statistical protein representations derived from a corpus of around 24 Million protein sequences in a multitask framework. Noting the fact that virus proteins tend to mimic human proteins towards interacting with the host proteins, we use the prediction of human PPI as a side task to regularize our model and improve generalization. The comparison of our method with a variety of state-of-the-art models on several datasets showcase the superiority of our approach. Ablation study results suggest that the human PPI prediction side task brings additional benefits and helps boost the model performance. A case study on the interaction of the SARS-CoV-2 virus spike protein and its human receptor indicates that our model can be used as an effective tool to reduce the search space for evaluating host protein candidates as interacting partners for emerging viruses. In future work, we will enhance our multitask approach by incorporating more domain information including structural protein prediction tools [79] as well as exploiting more complex multitask model architectures.

The following subsections provide detailed experimental results. For the Hybrid and MotifTransformer, the author's code is not available and results are taken from the original paper as the. '-' indicates that the score is not available. For other methods, the reported results are the average after 10 experimental runs. We perform pairwise t-test tests for statistical significance testing. Our presented results are statistically significant with a p-value less than 0.05. Table 6 provides a comparison between MTT and baselines which employ hand-crafted features. MTT outperfroms Denovo in all benchmarked datasets while MTT supersede Generalized in six out of the seven datasets. The performance gains are statistically significant with a p-value of 0.05. Table 7 provides a comparison between MTT and embedding-based methods on small testing datasets. MTT outperforms Doc2vec in 4 datasets. The performance gains are statistically significant with a p-value of 0.05. MTT is outperformed by Doc2vec in three datasets: Zhou's Ebola, Zhou's H1N1 and Denovo_slim. We point out that these datasets are quite specialized where the negative training and testing samples were drawn from a sequence dissimilarity negative sampling technique. In particular, the protein sequences for negative test set were already chosen based on their dissimilarity to those is positive set. This is not a realistic setting when the positive test set is in itself unknown. Nevertheless, the performance of MTT is comparable to the state of the art on these especially curated datasets too while it outperforms all methods on more general datasets.

In Table 8 we compare MTT with its simpler variants for 11 datasets. The reported results are average after 10 runs. Results from pair-wise t-test show that that (i)MTT is significantly better than Naive baseline in all datasets with p-value smaller than 0.05, (ii) MTT is significantly better than STT in 4 out of 7 datasets with p-value smaller than 0.05 while for the remaining datasets, the difference is not statistically significant. 

Comparing SARS-COV-2 with SARS-COV and influenza pandemics

Break ins and break outs: viral interactions with the cytoskeleton of mammalian cells

Exploring and exploiting proteome organization during viral infection

Protein interactions during the flavivirus and hepacivirus life cycle

Exploring the SARS-COV-2 virus-host-drug interactome for drug repurposing

Elucidation of host-virus surfaceome interactions using spatial proteotyping

Proximity labeling approaches to study protein complexes during virus infection

Glycomics and proteomics approaches to investigate early adenovirushost cell interactions

Decoding protein networks during virus entry by quantitative proteomics

Proteomic approaches to uncovering virus-host protein interactions during the progression of viral infection

Proteomics tracing the footsteps of infectious disease

Exploring and exploiting proteome organization during viral infection

Connecting viral with cellular interactomes

New world arenavirus clade c, but not clade a and b viruses, utilizes α-dystroglycan as its major receptor

The intact molecular interaction database in 2012

Virusmentha: a new resource for virus-host protein interactions

Virusmint: a viral protein interaction database

Hpidb 20: a curated database for host-pathogen interactions

Viruses with different genome types adopt a similar strategy to pack nucleic acids based on positively charged protein domains

Virus-host interactome: putting the accent on how it changes

Rapid evolution of virus sequences in intrinsically disordered protein regions

Denovo: virus-host sequence-based protein-protein interaction prediction. Bioinformatics

Predicting protein-protein interactions using sprint. In: Protein-protein interaction networks

Sequence-based prediction of protein protein interaction using a deep-learning algorithm

Computational methods for predicting protein-protein interactions and binding sites

Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme

Machine-learning techniques for the prediction of protein-protein interactions

Computational biology and machine learning approaches to study mechanistic microbiomehost interactions

In silico unravelling pathogen-host signaling cross-talks via pathogen mimicry and human proteinprotein interaction networks

Pipe4: fast ppi predictor for comprehensive inter-and cross-species interactomes

Pathogen host interaction prediction via matrix factorization

Interface-based structural prediction of novel host-pathogen interactions. In: Computational methods in protein evolution

Training host-pathogen protein-protein interaction predictors

Unified rational protein engineering with sequence-based deep representation learning

Determining confidence of predicted interactions between HIV-1 and human proteins using conformal method

Computational prediction of virus-human protein-protein interactions using embedding kernelized heterogeneous data

A novel one-class SVM based negative data sampling method for reconstructing proteome-wide HTLV-human protein interaction networks

Prediction of protein-protein interactions between viruses and human by an SVM model

An improved method for predicting interactions between virus and human proteins

Predhpi: an integrated web server platform for the detection and visualization of host-pathogen interactions using sequence-based methods

A generalized approach to predicting protein-protein interactions between virus and host

Seq-bel: sequence-based ensemble learning for predicting virus-human protein-protein interaction

Predict the protein-protein interaction between virus and host through hybrid deep neural network

Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-COV-2 and human proteins

Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method

Transfer learning for predicting virus-host protein interactions for novel virus sequences

Deepviral: prediction of novel virus-host interactions from protein sequences and infectious disease phenotypes

Prediction of interactions between viral and host proteins using supervised machine learning methods

A structure-informed atlas of human-virus interactions

Predicting virus-host association by kernelized logistic matrix factorization and similarity network fusion

A network-based integrated framework for predicting virus-prokaryote interactions

Principles of machine learning-guided protein engineering

String v10: protein-protein interaction networks, integrated over the tree of life

Apid interactomes: providing proteome-based interactomes with controlled quality for multiple species and derived networks

Uniprot: a hub for protein information

Psicquic and psiscore: accessing and scoring molecular interactions

Predicting protein-protein interactions using signature products

Probability weighted ensemble transfer learning for predicting interactions between HIV-1 and human proteins

The NCBI taxonomy database

Understanding eukaryotic linear motifs and their role in cell signaling and regulation

Peptides mediating interaction networks: new leads at last

Distributed representations of sentences and documents

Pytorch: an imperative style, high-performance deep learning library

The generalization of 'student's' problem when several different population varlances are involved. Biometrika

On comparing classifiers: pitfalls to avoid and a recommended approach

Handbook of parametric and nonparametric statistical procedures

A mass spectrometric-derived cell surface protein atlas

Gene ontology: tool for the unification of biology

The gene ontology resource: enriching a gold mine

Cell entry mechanisms of SARS-COV-2

Molecular mechanism of interaction between SARS-COV-2 and host cells and interventional therapy

SARS-COV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor

Angiotensin-converting enzyme 2 is a functional receptor for the SARS coronavirus

The pathogenicity of SARS-COV-2 in HACE2 transgenic mice

SARS-COV-2 infection of human ACE2-transgenic mice causes severe lung inflammation and impaired function

Structure of MERS-COV spike receptor-binding domain complexed with human receptor DPP4

Emerging covid-19 coronavirus: glycan shield and structure prediction of spike glycoprotein and its interaction with human cd26

Human aminopeptidase n is a receptor for human coronavirus 229e

Highly accurate protein structure prediction with alphafold

A multitask transfer learning framework for novel virus-human protein interactions

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

A preliminary version of this work [80] was presented at the ICLR Workshop on AI for Public Health 2021.Authors' contributions ND designed the study, collected the data, implemented the models and analyzed the results. GB and GG qualitatively validated the design and results of the case study. MK designed and supervised the study as well as analyzed the results. All authors wrote the manuscript. All authors read and approved the final manuscript.

Open Access funding enabled and organized by Projekt DEAL. N.D is funded by VolkswagenStiftung's initiative "Niedersächsisches Vorab" (Grant No.11-76251-99-3/19 (ZN3434) ). G.B and G.G are supported by the Ministry of Lower Saxony (MWK, Project 76251-99 awarded to G.G.). M.K is supported the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor (Grant No. 01DD20003). The funding bodies did not play any role in the design of the study, collection, analysis, interpretation of data, and in writing the manuscript.