key: cord-0293673-v2qdjh86 authors: Roy, Trinita; Sharma, Khushal; Dhall, Anjali; Patiyal, Sumeet; Raghava, Gajendra P. S. title: In-silico method for predicting infectious strains of Influenza A virus from its genome and protein sequences date: 2022-03-21 journal: bioRxiv DOI: 10.1101/2022.03.20.485066 sha: 79cb608a81e459d1b3eab10ca23671fd7b96d354 doc_id: 293673 cord_uid: v2qdjh86 Influenza A is a contagious viral disease responsible for four pandemics in the past and a major public health concern. Being zoonotic in nature, the virus can cross the species barrier and transmit from wild aquatic bird reservoirs to humans via intermediate hosts. Virus gradually undergoes host adaptive mutations in their genome and proteins, resulting in different strain s/vari ants which might spread virus from avians/mammals to humans. In this study, we have developed an in-silico models to identify infectious strains of Influenza A virus, which has the potential of getting transmitted to humans, from its whole genome/proteins. Firstly, machine learning based models were developed for predicting infectious strains using composition of 15 proteins of virus. Random Forest based model of protein Hemagglutinin, achieved maximum AUC 0.98 on validation data using dipeptide composition. Secondly, we obtained maximum AUC of 0.99 on validation dataset using one-hot-encoding features of each protein of virus. Thirdly, models build on DNA composition of whole genome of Influenza A, achieved maximum AUC 0.98 on validation dataset. Finally, a web-based service, named “FluSPred”(https://webs.iiitd.edu.in/raghava/fluspred/) has been developed which incorporate best 16 models (15 proteins and one based on genome) for prediction of infectious strains of virus. In addition, we provided standalone software for the prediction and scanning of infectious strains at large-scale (e.g., metagenomics) from genomic/proteomic data. We anticipate this tool will help researchers in prioritize high-risk viral strains of novel influenza virus possesses the capability to spread human to human, thereby being useful for pandemic preparedness and disease surveillance. Key Points Influenza A is a contagious viral disease responsible for four pandemics. Virus can cross species barrier and infect human beings. In silico models developed for predicting human infectious strains of virus. Models developed were build using 15 proteins and whole genome datasets. Webserver and standalone package for predicting and scanning of high-risk viral strains. Influenza is a contagious viral disease, having a zoonotic origin, where a sudden rise in body temperature, headache, myalgia, lethargy, and dry cough [1] are some of the symptoms of the infection. Severe cases ranging from influenza-induced pneumonia [2] , encephalitis, myocarditis [3] , to death within a few hours, can have fatal outcomes. Patients with chronic heart or lung disease [4] , immune disorders, diabetes, have an increased risk of influenza infection [5] [6] [7] . Influenza A virus occurs naturally among wild aquatic birds like geese, swans, waterfowl and it is responsible for causing avian influenza in birds, including domestic poultry [8] . However, virues jump species barriers to affect animals and humans sporadically [9] . As shown in Figure 1 , different virus strains can affect humans and cause epidemics as the strains may have the potential to overcome species barriers and infect humans. Influenza A virus may further get adapted in such a way that it could spread human-to-human, leading to a potential pandemic, which is a cause of great concern [10, 11] . The pandemic influenza strains arise from reassortment of gene segments, where there is major antigenic change occurs, i.e., a phenomenon known as antigenic shift [12] . The non-human viruses undergo changes so that they become capable of infecting humans and spreading efficiently and sustainably [13] . In other words, mutations in amino acid residues, especially haemagglutinin, can alter the receptor-binding site and specificity, altering receptor preference and spread at large scale in various countries [14] . In past, Influenza A was responsible for the outbreak of four pandemics. One of the most devastating pandemics, with a high mortality rate among children below 5 years of age and 20-40 age group, was caused by the H1N1 subtype in 1918. The virus had an avian origin and resulted in 50 million estimated deaths worldwide [15] . The H2N2 subtype emerged in East Asia from an avian source in 1957 causing approximately 1.1 million deaths worldwide [16] . In 1968, the H3N2 subtype was first noted in the USA and had over 1 million deaths globally. This virus later started circulating as the seasonal flu [17] . In 2009 (H1N1) pdm09 was first noted in the USA and leads to 151,700-575,400 deaths worldwide in the first year of infection, then it started circulating as seasonal flu [18] . Host adaptive mutations in the genome and proteins are the major reasons behind the transmissibility of the viral strains from the avian and mammalian reservoirs to human hosts [19] . In the recent past, several methods have been developed for the identification of host tropism of Influenza A virus [20] [21] [22] [23] using protein and genome sequences. These studies show that protein and genome sequences are very efficient for determining the host tropism of zoonotic pathogens and changes take place at the molecular level for the virus being capable of crossing the species barrier. In this study, we have made a systematic attempt to develop computational models for the prediction of infectious strains of the influenza A virus. The aim of this study is to monitor infectious strains in non-human reservoirs to control the spread of strain from non-human to human hosts. We compiled 15 influenza A virus proteins and genome sequences from IRD and VIDHOP resources respectively. Models using different machine learning techniques were developed where features of proteins and genome were computed using Pfeature and Nfeature. In order to serve the scientific community, we provide a freely accessible web-server, and a standalone package named "FluSPred", for the prediction of infectious influenza A strains with the help of either of the proteins or the genome sequence using the best prediction models. The complete architecture of the study is illustrated in Figure 3 . The description of each step is given below. We obtained a total of 985,720 protein sequences from the Influenza Research Database (IRD) database (www.fludb.org) corresponding to 15 proteins of the Influenza A virus. These proteins were used to create 15 datasets (one dataset for one type of protein), where each dataset has sequences of the viral proteins from human host and non-human host (avian/mammalian host). In order to avoid biases, we removed the redundant and incomplete sequences (See Table 1 ). In this study, human infectious strains and associated sequences were considered positive datasets, and non-human (avian and other mammalian) was taken as a negative dataset. After preliminary data cleaning, we get 258790 unique protein sequences out of which, 86653 were positive (i.e. viral protein sequences derived from human hosts) and 172137 were negative sequences (i.e., viral Influenza A genomic sequences were obtained from the VIDHOP method [24] . A total of 312617 genome sequences were available in the dataset in which 159526 humans and 153091 nonhuman genomes were selected for this study. In order to maintain the homogeneity in protein and genome datasets, here we considered only avian, mammalian, and human. Viral genome sequences derived from human hosts were taken as positive and viral genome sequences derived from non-human hosts were taken as the negative dataset. Apart from this, the duplicate sequences were removed, and finally, a total of 308632 sequences were considered, with 159526 positive sequences and 149106 negative sequences. In order to train and evaluate our models, we perform both internal and external validation. Thus, the final dataset was divided into training and validation data, where 80% data was used for internal and 20% data was used for external validation [25] [26] [27] [28] . In order to compute the amino acid composition (AAC) and dipeptide composition (DPC) features of protein sequences, we have used the Pfeature standalone package [29] . In case of AAC, a feature vector of the length of 20, and for the DPC, a vector length of 400 features was computed. Furthermore, to generate the nucleotide-based features for the genome sequence data, we have used Nfeature standalone package [30] . Here, we computed composition-based features such as the Composition of DNA sequence for all K-Mers (CDK) and Reverse complement of DNA for all K-mers (RDK). CDK is a feature extraction method that calculates the composition of all 4 nucleotides. Here K could be 1, 2, or 3, generating a feature vector of length 4, 16, and 64 respectively. Likewise, RDK is a feature extraction method that calculates reverse complement K-Mer composition of nucleotides where we used K as 1,2 and 3 having feature vector sizes of 2,10, and 32 respectively. In addition, we have used one hot encoding approach for feature extraction, where the length of the sequences needed to be equal. In order to generate equal length vectors, we fix the length of all the sequences to 0.95 quantiles, where the sequences having shorter than that were extended by normal repeats, and the longer sequences were truncated until it reaches 0.95 quantiles of all sequences mark [24] . Once the lengths of the sequences are equalized, the sequences are encoded numerically by the one-hot encoding method. Here, each of the amino acid residues of the entire sequence is converted to a vector size of length of the sequence x 20, where 20 is the canonical protein residues, where the respective residue is given the value 1 in its position and the rest is given the value 0. This generates a sparse matrix, for instance, Alanine (A) can be represented by a 20-dimensional vector 10000000000000000000. Similarly, Cysteine(C) can be represented as a 20-dimensional vector 001000000000000000000. In this study, several machine learning algorithms have been implemented for developing binary classification models that include Random Forest ( GNB, and RF-based models were generated for genome sequences using CDK and RDK features. We fine-tuned the model on the training data by taking different thresholds for probabilities as 0.8,0.5, 0.4, 0.2, 0.1, and we found that 0.5 works best for all the models. These classification techniques were implemented using python-library sci-kit learn [31] . Identifying the critical set of features for the machine learning models is one of the significant challenges in this study. For one hot encoding feature generation technique, each residue of the protein sequences was represented as a 20-dimensional vector (sequence length x 20 dimensions) where the resultant matrix became sparse. Therefore, we needed dimensionality reduction where we could reduce and choose the final number of components which aids to the faster execution of the program as well as reduces the noise in the data. We have used the Truncated Singular Value Decomposition (tSVD) method for one-hot encoded feature selection to improve computational efficiency. The highly dimensional matrix was then reduced to 100 principal components by calculating the explained variance [32] . The dataset was split into an 80:20 ratio, where 80 percent data consisted of training data and the rest 20 percent was validation data. The training data was further divided into training and testing datasets during the 5-fold cross-validation process and the mean of the results for each fold of the cross-fold validation was noted down. In the 5-fold cross-validation process, the entire training data gets divided into five equivalent folds and then four folds were used for training and the 5th fold was used for testing. The whole procedure gets iterated five times where every fold gets a chance for being used as testing data. This is a standard procedure used in many studies [33] [34] [35] [36] [37] . In the current study, we have used the standard performance evaluation parameters such as accuracy, sensitivity, specificity, precision, recall, Matthews Correlation Coefficient (MCC), and Area Under the Curve (AUC) were computed. Whereas, accuracy, sensitivity, and specificity are threshold-dependent parameters and AUC is threshold independent parameter that is plotted by sensitivity vs 1-specificity. The parameters are computed by: Where, T P , T N , F P and F N stand for true positive, true negative, false positive and false negative, respectively. The In order to perceive the different motifs occurring exclusively in either of the positive or negative datasets/the positive datasets of each of the proteins and the genome, we used MERCI software [40] . It is a tool that can identify and return top K motifs that are most frequent in the positive sequences and absent in the negative sequences. The top 10 motifs of each of the proteins which are exclusively present in the positive and negative datasets are enclosed in Supplementary Table S1 . The complete results of other datasets are shown in Table 3 . Table 4 . The comprehensive results are provided in Table 4 and the complete results are given in Supplementary Table S4 . The other proteins also act as human infection-causing strains. There is the scope of further research on sequence analysis or molecular studies on these proteins to identify their roles. In addition, we found that proteins like NS1, NS2 or M1, M2 which get encoded from the same segment, respectively, give different prediction results. This shows that even though they are encoded by the same segments, they are functionally different. This may be attributed to both proteins responding individually to structural constraints or selective pressures since a mutation in the respective segment can affect either protein separately [41] . In order to develop the genome-based classification models, we have used various machine On the other side, RDK based features achieve maximum accuracy of 97.6% and AUC of 0.976 on the training data and validation data using RF-based classifier. Similarly, KNN based models also show comparable results on both training and validation dataset (Table 6 ). However, XGB performs quite less with an accuracy of 92.5% and an AUC of 0.925 on validation datasets. Results of SVM, DT, and NB classifiers could not perform well on the training as well as validation dataset, as given in Table 6 . The complete results are provided in Supplementary Table S6 . To serve the scientific community, we have developed a web-server (https://webs.iiitd.edu.in/raghava/fluspred/) and standalone package (https://github.com/raghavagps/FluSPred) named "FluSPred" using the best model for each protein and the genome datasets. For the frontend part of the web-server, the structure was developed using HTML5, the styling was done by CSS3 and the logic was made using VanillaJS. The backend part was made using Python and PHP. The web-server is responsive as it is compatible with almost all devices including desktop, laptop, mobile phone, tablets, and iMac. Based on the results, the random forest model for one hot encoding feature showed comparable results but even after dimensionality reduction, the models were computer expensive and time consuming, which is why we did not select this for the webserver. On the other hand, execution of the AAC and the DPC models were efficient and so we selected the DPC models for most of the proteins for incorporating them into the web-server since it showed high results in the evaluation parameters compared to AAC and was faster in feature generation compared to the one-hot encoding of the sequences. For NS2 and PB1F2, we chose the AAC models for random forest as they gave the highest results compared to their DPC results. For the genome sequence dataset, CDK feature for K=3 at threshold = 0.5 was incorporated in the web server for the prediction of human and non-human infectious strains using genomic sequences. VIDHOP uses deep neural network approach for the classification and get maximum AUC of 0.94 on the influenza A dataset [24] . Whereas, the methods are based on the zoonotic host prediction of influenza A virus uses limited amount of protein sequence data and do not provide any web-server facility for the community. In this study, we have made a methodical attempt to develop computational tool using the 15 proteins and genome sequences for the prediction of human infectious strains of the Influenza A virus. The sequence datasets were obtained from IRD and VIDHOP, on which, we calculate compositional-based descriptors/features and one hot encoding-based features using Pfeature and Nfeature tools. Our models included a wide range of data, with 308632 genome sequences pertaining to 34 hosts as well as achieved a much higher accuracy in comparison with existing methods. The relevant features were selected on which the model was trained and validated. Our compositional analysis indicates which residues (A, I, K, E, H, G, P, Q) composition is higher in the positive datasets of three major zoonotic proteins such as HA, NA and PB1-F2. This shows that simple composition-based techniques can identify the important features and help in the prediction. Motif analysis on the sequence data highlighted the important set of motifs that were present on the human sequences and not on the non-human sequences. The results of this analysis have been provided for use in further research. Here, we used AAC and DPC features for the infectious strain prediction between human (positive) and mammalian /avian(negative) datasets. The AUC varies from 0.88 to 0.99 on validation datasets for 15 different influenza A proteins, with minimum performer PA-N182 and maximum for HA protein. Our results show that Random forest models gave the best prediction, where HA protein achieved the highest AUC of 0.99 in training and 0.98 in validation datasets. Additionally, both HA and NA showed high results on the computational prediction models, which clearly indicate their importance in zoonosis. In case of the genome sequences, random forest showed the best results using CDK (K = 3) features, with an accuracy of 98% and AUC of 0.98 on validation dataset. Of Note, we have used our best models for 15 proteins and genome to implement the web-server named "FluSPred" (https://webs.iiitd.edu.in/raghava/fluspred/ ). To the best of our knowledge, this is the first attempt to develop a web-server for the prediction of the zoonotic host, whether the particular strain would be capable of crossing the species barrier to infect humans or not, using 15 Influenza A proteins and genome sequence data. The models will be able to predict whether the sequence pertaining to the virus is infectious or not, from avian/mammalian to humans. The current work has not received any specific grant from any funding agencies. Influenza: Diagnosis and Treatment The role of pneumonia and secondary bacterial infection in fatal and serious outcomes of pandemic influenza a(H1N1)pdm09 The hidden burden of influenza: A review of the extra-pulmonary complications of influenza infection The contribution of influenza to combined acute respiratory infections, hospital admissions, and deaths in winter Influenza virus-related critical illness: prevention, diagnosis, treatment Influenza A outbreak among adolescents in a ski hostel Fatal influenza A virus infection in a child vaccinated against influenza Influenza: an emerging disease The importance of animal influenza for human disease Influenza: the once and future pandemic The evolution and future of influenza pandemic preparedness A chimeric hemagglutinin-based universal influenza virus vaccine approach induces broad and long-lasting immunity in a randomized, placebo-controlled phase I trial Complications of seasonal and pandemic influenza Haemagglutinin mutations responsible for the binding of H5N1 influenza A viruses to human-type receptors Reviewing the History of Pandemic Influenza: Understanding Patterns of Emergence and Transmission Global Mortality Impact of the Influenza Pandemic Impact of pandemic A/H1N1/2009 influenza on children and their families: comparison with seasonal A/H1N1 and A/H3N2 influenza viruses Impact of the fall 2009 influenza A(H1N1)pdm09 pandemic on US hospitals Adaptive Mutations in Influenza Enhance Polymerase Activity and Infectious Virion Production Predicting transmission of avian influenza A viruses from avian to human by using informative physicochemical properties Predicting host tropism of influenza A virus proteins using random forest Predicting the host of influenza viruses based on the word vector Predicting Influenza A Tropism with End-to-End Learning of Deep Networks Computer-aided prediction of antigen presenting cell modulators for designing peptide-based vaccine adjuvants VIRsiRNApred: a web server for predicting inhibition efficacy of siRNAs targeting human viruses An approach for identifying Nacetylglucosamine interacting residues of a protein from its primary sequence Computing wide range of protein/peptide features from their sequence and structure Nfeature: A platform for computing features of nucleotide sequences Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone In silico approaches for designing highly effective cell penetrating peptides Computing Skin Cutaneous Melanoma Outcome From the HLA-Alleles and Clinical Characteristics Computer-aided prediction and design of IL-6 IL-6 plays a crucial role in COVID-19 AlgPred 2.0: an improved method for predicting allergenic proteins and mapping of IgE epitopes Computer-aided prediction of inhibitors against STAT3 for managing COVID-19 associated cytokine storm Influenza A virus recycling revisited Single gene reassortants identify a critical role for PB1, HA, and NA in the high virulence of the 1918 pandemic influenza virus Identifying discriminative classification-based motifs in biological sequences Evolutionary analysis of the influenza A virus M gene with comparison of the M1 and M2 proteins Risk factors for human disease emergence Emerging Infectious Diseases Global trends in emerging infectious diseases Zoonotic Diseases: Etiology, Impact, and Authors are thankful to the Department of Bio-Technology (DBT) and Department of Science and Technology (DST-INSPIRE) for fellowships and the financial support and Department of Computational Biology, IIITD New Delhi for infrastructure and facilities. All the datasets generated in this study are available at "FluSPred" web server, https://webs.iiitd.edu.in/raghava/fluspred/download.php. The authors declare no competing financial and non-financial interests.