key: cord-0736838-u5tgqstr authors: Yerukala Sathipati, Srinivasulu; Ho, Shinn-Ying title: Identification and Characterization of Species-Specific Severe Acute Respiratory Syndrome Coronavirus 2 Physicochemical Properties date: 2021-04-15 journal: J Proteome Res DOI: 10.1021/acs.jproteome.1c00156 sha: bad99b8bc66443bc804158befdef832535f24c4b doc_id: 736838 cord_uid: u5tgqstr [Image: see text] There is an urgent need to elucidate the underlying mechanisms of coronavirus disease (COVID-19) so that vaccines and treatments can be devised. Severe acute respiratory syndrome coronavirus 2 has genetic similarity with bats and pangolin viruses, but a comprehensive understanding of the functions of its proteins at the amino acid sequence level is lacking. A total of 4320 sequences of human and nonhuman coronaviruses was retrieved from the Global Initiative on Sharing All Influenza Data and the National Center for Biotechnology Information. This work proposes an optimization method COVID-Pred with an efficient feature selection algorithm to classify the species-specific coronaviruses based on physicochemical properties (PCPs) of their sequences. COVID-Pred identified a set of 11 PCPs using a support vector machine and achieved 10-fold cross-validation and test accuracies of 99.53% and 97.80%, respectively. These findings could provide key insights into understanding the driving forces during the course of infection and assist in developing effective therapies. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for the COVID-19 pandemic that has spread around the globe since its first appearance in Wuhan, Hubei province of China, in early December. 1 As of 21 February 2021, the World Health Organization has reported 110.38 million confirmed cases and 2,446,008 deaths globally, becoming a major health concern. As of 18 February 2021, at least seven different vaccines across three platforms have been rolled out in countries. Coronaviruses are enveloped single-stranded positive-sense RNA viruses that belong to a large family of viruses that constitute a subfamily Orthocoronavirinae in the family of Coronaviridae. 2 The genome sequence of SARS-CoV-2 is closely related to severe acute respiratory syndrome coronavirus (SARS-CoV) and bat coronaviruses. It shares 79.6% sequence identity to SARS-CoV, and it is 96% identical to bat coronavirus. 3 Human coronavirus (HCoV) genomes encode four major structural proteins including the spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins. 4 Each protein plays a significant role in the structure of the virus and in other aspects of the replication process. The S glycoprotein of coronaviruses binds to appropriate receptors to facilitate viral entry into human host cells. SARS-CoV-2 uses the SARS-CoV receptor antigen converting enzyme 2 (ACE2) to enter into the host cell primed by TMPRSS2. 5 The S glycoprotein of SARS-CoV-2 and its entry into the host cell through ACE2 is well characterized. 6, 7 The primary role of the N protein is to pack the viral genome into a nucleocapsid, 8 and it is considered to be a multifunctional protein in coronaviruses involved in the host cellular response to viral infection and replication. 9 The N protein is a key molecule in the egress and assembly of SARS-CoV, and transient expression of N is involved in the production of viruslike particles of coronaviruses. 10 The M protein plays an important role in virus assembly and in the production of viral particles. 11 Homotypic interactions of M proteins are involved in envelope formation, 12 and they contribute to the core stability of coronaviruses. 13 Elongated and compacted M proteins are associated with the flexibility and density of S proteins. 11 The interaction between the M and E proteins is involved in envelope formation and budding of coronavirus particles. 11 The coronavirus E protein is a minor component of the virus particles, but it plays an important role in virion assembly and virus host−cell interactions. 14 The absence of E protein in gastroenteritis coronaviruses blocks virus trafficking in the secretory pathway and prevents virus maturation. 15 Together, the structural and functional studies of the SARS-CoV-2 proteins can provide invaluable information about the binding potential of viruses to host cells, that is, information necessary for vaccine design. Each protein has distinct properties that allow it to perform its functions, and its interactions and dynamics depend on its physicochemical properties (PCPs). HCoV shares 89.1% nucleotide and 77.2% amino acid sequence similarity with some bat coronaviruses. 16 More importantly, the amino acid sequence comparison of the receptor-binding domain of the S proteins from HCoV and SARS-CoV showed that they shared only 73.8−74.9% sequence identities. 16 In addition, small changes in the amino acid sequence of the S protein are crucial for binding to its host. For instance, the bat SARS-like CoV strain cannot bind to human ACE2 17 due to minor amino acid differences from SARS-CoV. Therefore, knowing the HCoV protein PCPs and understanding the amino acid differences at the sequence level would be crucial for determining the mechanisms behind their species specificity and the functions of the HCoV proteins. Extensive efforts are being made to eradicate the COVID-19 pandemic; the number of COVID-19 tests is rapidly increasing and it produces a huge dataset, which makes it difficult to derive the key elements that are essential for treatment. Artificial intelligence and machine learning are playing a critical role in COVID-19, especially by decreasing the workload of medical experts using computed tomography scans to detect COVID-19. 18 Machine learning techniques can broaden the screening process and identify potential antiviral agents based on their protein structures and DNA sequences to predict the drug binding sites of SARS-CoV-2. 19 Therefore, machine learning methods are ideal tools for analyzing large volumes of data and for identifying promising candidates for treating In this study, we retrieved the protein sequences of 4320 coronaviruses from the Global Initiative on Sharing All Influenza Data (GISAID) and the National Center for Biotechnology Information (NCBI) databases. We constructed a dataset with 2225 human−host coronaviruses (HCoV) as positive samples and 2095 nonhuman−host coronaviruses (nHCoV) as negative samples. We used a support vector machine (SVM)-based optimization method called COVID-Pred to distinguish HCoV and nHCoV using their amino acid sequences. COVID-Pred uses an optimal feature selection algorithm called the inheritable bi-objective combinatorial genetic algorithm (IBCGA) 20 to select informative PCPs that are differentiated between HCoV and nHCoV. COVID-Pred identified 11 PCPs that are able to distinguish HCoV and nHCoV proteins. The objective of this study was to explore the PCPs and amino acid compositions that are specific to HCoV, which may be helpful in understanding how HCoV proteins function and may provide a guide for vaccine design. The protein sequences of 2225 HCoV were retrieved from the GISAID database (https://www.gisaid.org) on June 3, 2020, and 2095 nHCoV protein sequences were retrieved from the NCBI database. The initial dataset thus consisted of the protein sequences of 4320 coronaviruses. Since each amino acid sequence is crucial for the binding of coronaviruses to their hosts, we reduced the sequence identity to 90%. After removal of redundancy and sequence uncertainties, the final dataset consisted of 141 HCoV structural proteins of coronaviruses as positive samples, whereas 163 nHCoV S protein sequences were negative samples. Furthermore, the dataset was divided into training and test sets in a ratio of 7:3. There were 213 sequences (HCoV and nHCoV) in the training set and 91 sequences (HCoV and nHCoV) in the test set. Additionally, we used seven S protein sequences of HCoV from the NCBI database after sequence identity reduction for an independent test. All of the data set information is summarized in Tables S1,S2 (Supplementary Data 2). This study used 531 PCPs retrieved from the AAindex database developed by Kawashima and Kanehisa 21 as candidate features to construct COVID-Pred to distinguish species-specific coronavirus proteins. The original coronavirus' amino acid sequences were converted into numerical indices according to the 531 PCP values. The feature representation of the 531 PCPs is described as follows a) Collect the HCoV and nHCoV protein sequences from the dataset. b) Calculate the composition f(a i ) of a protein for the i th amino acid a i of 20 amino acids to encode the protein sequence of variable length into a feature vector of length 531. c) Calculate the feature value of the n th physicochemical property, PCP(n), of a coronavirus protein, where n = 1, 2, ..., 531. where PCP n (a i ) is the value of the a i amino acid of the n th physicochemical property. To investigate the properties of the coronavirus proteins, we proposed the COVID-Pred method, which was customized using the SVM incorporating the optimal feature selection algorithm IBCGA. Inheritable Bi-objective Combinatorial Genetic Algorithm To construct the COVID-Pred method, IBCGA was used for feature selection. IBCGA is a well-known feature selection algorithm that has been used for solving biological problems such as cancer survival predictions, 22−24 protein function predictions, 25 and modeling gene regulatory networks. 26, 27 IBCGA is an efficient global optimization technique with an intelligent evolutionary algorithm (IEA) to select a small set of informative features from a large pool of candidate features while optimizing the prediction performance. COVID-Pred utilized the SVM classifier for distinguishing the HCoV and nHCoV. In COVID-Pred, the SVM classifier was implemented in the LIBSVM package. 28 The radial basis function (RBF) kernel was used for the implementation of SVM in the LIBSVM package. The scoring function of the RBF kernel was computed in the feature space between the two data points, x i and x j . The RBF kernel function is defined as follows In IBCGA, the commonly used genetic algorithm (GA) terms such as gene and chromosome, represented as GA-gene and GA-chromosome, were used. The chromosome of IEA consists of m = 531 binary genes for selecting informative PCPs and two 4-bit GA-genes for encoding the parameters C and γ of SVM. The high performance of COVID-Pred arises from the simultaneous optimization of feature selection and fine-tuning of SVM using IBCGA. In COVID-Pred, numerical protein sequences encoded as 531 PCPs in the training dataset were used as the input. The IBCGA can simultaneously provide a set of solutions, X r , where r = r end , r end + 1, ..., r start in a single run. The feature selection algorithm IBCGA used can be described as follows Step 1: (Initialization) Randomly generate an initial population of Npop individuals. In this work, Npop = 50, r start = 50, r end = 10, and r = r start . Step 2: (Evaluation) Evaluate the fitness value of all individuals using the fitness function, that is the prediction ACC in terms of 10-fold cross-validation. Step 3: (Selection) Use a conventional method of tournament selection that selects the winner from two randomly selected individuals to generate a mating pool. Step 4: (Crossover) Select two parents from the mating pool to perform an orthogonal array crossover operation of IEA. Step 5: (Mutation) Apply a conventional bit mutation operator to parameter genes and a swap mutation to the binary genes for keeping r selected features. The best individual was not mutated for the elite strategy. Step 6: (Termination test) If the stopping condition for obtaining the solution X r is satisfied, output the best individual as the solution X r . Otherwise, go to Step 2. Step 7: (Inheritance) If r > r end , randomly change one bit in the binary genes for each individual from 1 to 0; decrease the number r by one and go to Step 2. Otherwise, stop the algorithm. Step 8: (Output) Obtain a set of m PCPs from the chromosome of the best solution X m among the solutions X r , where r = r end , r end + 1, ..., r start . We used eight famous machine learning methods in Weka data mining software 29 to distinguish HCoV and nHCoV for performance comparison with COVID-Pred. They were Naive Bayes, multilayer perceptron (MLP), sequential minimal optimization (SMO), stochastic gradient descent (SGD), logistic model tree (LMT), J48, decision tree, and random forest. The classifier subset evaluator and the best first search were used for feature selection to design classifiers for distinguishing HCoV and nHCoV. We evaluated the predictive performance of COVID-Pred using the following evaluation metrics: sensitivity (SN), specificity (SP), Matthews correlation coefficient (MCC), accuracy (ACC), and area under the ROC curve (AUC). Amino acid composition (AAC) was measured for the HCoV and nHCoV. For the 20 amino acids denoted as A 1 ...A 20 , the frequency of each amino acid (Af i ) was measured for the protein sequence length (L). AAC is represented as follows Dipeptide composition (DPC) is defined as pairs of amino acids denoted as dipeptides, A i A j (i.e., AA, AC.... YY), and the frequency of occurrence of dipeptides is defined as df i,j . The DPC is computed as where n = df 1,1 + df 2,2 + ... + df 20,20 . The objective of this study was to identify and analyze the PCPs that are specific to different coronavirus species and to explore the crucial driving forces that are involved in HCoV protein functions. For this purpose, 4320 protein sequences from HCoV and other organisms (nHCoV) in FASTA format were extracted. After preprocessing the initial dataset, the final dataset consisting of 141 HCoV and 163 nHCoV protein sequences was obtained from the GISAID and NCBI databases. The COVID-Pred method was established using the SVM incorporating the optimal feature selection algorithm IBCGA to identify the PCPs that could distinguish between HCoV and nHCoV. COVID-Pred selected 11 PCPs and achieved 10-fold cross-validation (10-CV) ACC, SN, SP, MCC, AUC, test ACC, and test AUCs of 99.53%, 1.00, 0.99, 0.99, 0.996, 97.80%, and 0.991, respectively. COVID-Pred obtained 100% (7/7) accuracy on an independent dataset consisting of seven HCoV S protein sequences. The COVID-Pred performance was evaluated using ROC curves as shown in Figure S1 (Supplementary Data 1). Next, the prediction performance of COVID-Pred was compared with some machine learning methods of the Weka classifier using the full dataset (n = 304). We used the classifier subset evaluator and the best first search for the feature selection and selected 28 features to distinguish HCoV and nHCoV. Eight standard classifiers such as Naive Bayes, MLP, SMO, SGD, LMT, J48, decision tree, and random forest were used for the performance comparison. The Naive Bayes classifier achieved 10-CV ACC, MCC, SN, SP, and AUC of We ranked the identified 11 PCPs based on their prediction performance using the main effect difference (MED). A larger MED score indicates a greater contribution toward prediction accuracy. The identified 11 PCPs and their corresponding ranks and MED scores are listed in Table 2. The identified 11 properties, including FAUJ880103, ONEK900101, PALJ810116, AURR980102, FAUJ880106, TANS770103, F A S G 7 6 0 1 0 1 , M O N M 9 9 0 1 0 1 , A U R R 9 8 0 1 1 6 , DAYM780201, and RICJ880117, were analyzed further to explore their roles in SARS-CoV-2 proteins. Normalized van der Waals Volume. The top PCP based on the MED results was normalized by the van der Waals volume (FAUJ880103), 30 with a MED score of 9.94. Fauchere et al. measured the side chain parameters of the 20 amino acids. The relevance of the parameters for hydrophobicity and steric and electric properties of the amino acid side chains was assessed 30 in which the normalized van der Waals volume of the amino acid side chains was measured. There are different mechanisms involved in protein molecule interactions, including electrostatic forces, salvation forces, and van der Waals forces. Van der Waals forces act during interactions of proteins with other molecules. 31 Recently, stronger van der Waals interactions were found between SARS-CoV-2 and ACE2 compared to those between SARS-CoV and ACE2. 32 A molecular docking study on SARS-CoV-2 reported that van der Waals interactions play a major role in the binding process. 33 Yan et al. found that subtle amino acid changes improve the van der Waals interactions between SARS-CoV-2 and ACE2 and might determine the stronger interaction. 34 More amino acids that formed hydrogen bonds and van der Waals interactions were found at the SARS-CoV-2 interaction sites when compared to those at the SARS-CoV interaction sites. Wang et al. identified that the SARS-CoV-2-CTD binding interface has more amino acid residues forming van der Waals interactions than SARS-RBD that directly interacts with ACE2. 35 Stronger electrostatic and van der Waals interactions were observed between SARS-CoV-2 and ACE2 compared to those between SARS-CoV and ACE2. 36 We thus measured the normalized van der Waals volumes for HCoV and nHCoV according to FAUJ880103. 30 We observed that the average normalized van der Waals volumes for HCoV were slightly higher than those for nHCoV. The mean normalized van der Waals volumes obtained for HCoV and nHCoV were 0.17 ± 0.10 and 0.16 ± 0.09, respectively. Among the 20 amino acids, larger van der Waals volume differences were observed between HCoV and nHCoV for L, K, N, R, and V. Additionally, we also observed slightly larger van der Waals volumes for the HCoV S proteins compared to the nHCoV S proteins. The amino acids R, Y, K, and P showed larger differences in van der Waals volumes between the HCoV and nHCoV proteins, as shown in Figure S2A ( Supplementary Data 1) . Delta G Values for the Peptides Extrapolated to 0 M Urea. The conformational preferences of the amino acids influence the secondary and tertiary structures of proteins. Aydin et al. reported a 50% α-helical content in a designed recombinant SARS-CoV S2 domain fusion protein. 37 Subsequent conformational changes at the helices are critical to the fusion of viral and host membranes and the release of the viral genome into the host cells. 38 Karyn et al. measured the free energy difference (ΔΔG 0 ) values of amino acids by substituting them in the guest sites of alpha helices. 39 We further calculated the ΔΔG 0 values for HCoV and nHCoV according to ONEK900101. 39 The mean ΔΔG 0 values did not show much difference between HCoV and nHCoV, but among the 20 amino acids, L, N, K, and E showed the largest differences in ΔΔG 0 values between HCoV and nHCoV. A slight difference in the ΔΔG 0 value was observed between the HCoV and nHCoV S proteins in which amino acids R, Y, V, and K showed a larger difference in ΔΔG 0 value compared to the others. Normalized Frequency of Turn in α/β Class. The property of PALJ810116 is described as "Normalized frequency of turn in α/β class." 40 Palau et al. calculated the conformational propensities of each amino acid for secondary structural alignments. The utilization of amino acids depends on the amount and topology of different secondary structures, and there are distinct preferences for α/β protein amino acids, such as I and V being the preferred amino acids in the α/β structures. 40 A circular dichroism spectroscopy study reported that a SARS-CoV-2 fusion peptide has an α-helical content. 41 Therefore, we measured the normalized propensities of α/β in HCoV and nHCoV. We observed a slight difference in the mean normalized α/β turns between HCoV and nHCoV with a mean normalized frequency of α/β turns of 0.048 ± 0.02 and 0.050 ± 0.03, respectively. Larger differences in the amino acids for this property between HCoV and nHCoV were observed for N, K, G, S, and Y. There was no mean difference observed for this property between the HCoV and nHCoV S proteins, but amino acids Y, R, P, G, and S showed a difference in the normalized frequency of turns in α/β between the HCoV and nHCoV S proteins. This analysis indicates that amino acid propensities at α/β structures of SARS-CoV-2 might play an important role in the ACE2 binding process. Normalized Positional Residue Frequency at the Helix Termini N″. The property of AURR980102 describes the normalized positional residue frequency at the helix termini N″. Aurora and Rose examined the role of helix capping in the secondary structures of proteins and identified seven distinct capping motifs at the helices C-terminus and N-terminus, where each motif exhibits a pattern of hydrogen bonds with hydrophobic interactions. 42 Various experiments demonstrated that the capping stabilizes the α-helices in proteins 43−45 and mutations of interacting residues in the capping motifs affect protein stability. 46 According to a previous study, 42 the normalized frequency of Pro is higher in N-terminal motifs, and also Pro functions as a hydrophobic residue. The CoV S glycoprotein is characterized by a complex of heptad-repeated regions (HR1 and HR2) . The amide groups at the N-terminus of HR2 are capped by Asn, which interacts with the amide group via ordered water molecules, which may be one of the influential factors that stabilize the S glycol protein. 47 To examine the helix capping preferences, we measured the normalized frequencies of amino acids at helix capping in HCoV and nHCoV. We observed a slight difference in the mean normalized positional residue frequency at the helix termini N″ between HCoV and nHCoV. Although there was no large mean difference between HCoV and nHCoV for this property, we observed a larger difference in the normalized positional residue frequency at the helix termini N″ for the amino acids N, L, K, E, and G between HCoV and nHCoV as well as for R, P, K, S, and E between the S proteins of HCoV and nHCoV. Relative Mutability. Dayhoff et al. calculated the relative mutability of amino acids, which indicates the probability of amino acid changes in a given small evolutionary interval. 48 The genome analysis of SARS-CoV-2 revealed that nearly 80% of the recurrent mutations produced nonsynonymous changes at the protein level, and these mutations are possible candidates for continuing adaptation of SARS-CoV-2 to its novel human host. 49 The genetic analysis of SARS-CoV-2 discovered various mutations and deletions in coding and noncoding regions. 50 The rapid mutations of SARS-CoV-2 play important roles in the virus spread. Hence, we measured the relative mutability of HCoV and nHCoV according to DAYM780201. 48 There was a slight difference in the mean relative mutability between HCoV and nHCoV, 3.77 ± 2.24 and 3.9 ± 2.85, respectively. A larger difference in the relative mutability of amino acids between HCoV and nHCoV was observed for N, E, S, L, and T, and differences in the relative mutability of amino acids between the S proteins of HCoV and nHCoV were observed for R, S, V, N, and P. Furthermore, the mutations in the S proteins were compared with the reference strain SARS-like bat virus, which falls under nHCoV (SLCoVZXC21/2015). We observed 203 mutations in the S protein (PDBID:6ACJ) compared with the reference sequence Table S3 (Supplementary Data 3) . Furthermore, we performed a statistical analysis using t-test to identify the significant amino acids of the six PCPs across HCoV and nHCoV. The p <0.05 was considered as statistical significance in the analysis. A significant difference (p <0.005) in van der Waals volume between HCoV and nHCoV was observed for the amino acids K, L, N, R, and V. The amino acids L, N, K, and E showed a significant difference in ΔΔG 0 values between HCoV and nHCoV. A significant difference in the amino acids for the normalized frequency of turn in α/β class between HCoV and nHCoV was observed for N, K, G, S, and Y. A significant difference in the normalized positional residue frequency at the helix termini N″ between HCoV and nHCoV was observed for the amino acids N, L, K, E, and G. A significant difference in the relative mutability of amino acids between HCoV and nHCoV was observed for the amino acids N, E, S, L, and T. Additionally, the other six properties identified were the STERIMOL maximum width of the side chain (FAUJ880106), normalized frequency of the extended structure (TANS770103), molecular weight (FASG760101), turn p r o p e n s i t y s c a l e f o r t r a n s m e m b r a n e h e l i c e s (MONM990101), normalized positional residue frequency at the helix termini Cc (AURR980116), and the relative preference value at C″ (RICJ880117). Their differences between HCoV and nHCoV are shown in Figure 1 . The amino acid compositional preferences for the 11 PCPs between the S proteins of HCoV and nHCoV are shown in Figure S2 (Supplementary Data 1). Graphical representations of the five analyzed properties, FAUJ880103, ONEK900101, PALJ810116, AURR980102, and DAYM780201, are shown in Figure 2 . The comparison of the mutations between the S protein and the reference strain SARS-like bat virus is shown in Figure 3 . Furthermore, we ranked the amino acids based on their compositional preference differences between HCoV and nHCoV for the 11 PCPs. The amino acid rank is proportional to the compositional preference difference, meaning that the rank one amino acid has the greatest difference between HCoV and nHCoV. The amino acids that show compositional preference differences for the 11 PCPs between HCoV and nHCoV are shown in Figure 4 . The amino acids that show compositional preference differences for the 11 PCPs between the S proteins of HCoV and nHCoV are shown in Figure S3 (Supplementary Data 1). The identified 11 PCPs and their amino acid compositional preferences are reported in Table S4 (Supplementary Data ). Amino acid differences in different proteins could shed light on how SARS-CoV-2 is functionally and structurally different from humans and other organisms. Hence, the AAC differences were measured between HCoV and nHCoV. The maximum amino acid compositional differences between HCoV and nHCoV were obtained for L, K, and N with a ±2% composition difference and for E, H, R, M, Y, Q, G, V, S, and T with a ±1% composition difference, while the other amino acids did not show any differences, as shown in Figure 5A . The AAC differences for all of the amino acids are listed in Table 3 . When we compared the S proteins of HCoV and nHCoV, the maximum AAC difference was observed for the amino acid R with a 2% composition difference and for P, K, G, V, N, and V with a ±1% difference, as shown in Figure S4 (Supplementary Data 1). Dipeptides play an important role in folding and peptide binding. Therefore, DPCs were measured for the HCoV, nHCoV, and HCoV S proteins. The top five DPCs obtained for HCoV were LL, FL, LV, VL, and TL, while for nHCoV, they were LL, VN, NG, SV, and SL; for the HCoV S proteins, they were RR, VL, IA, SN, and SV. A heatmap showing the differences in the DPC of HCoV and nHCoV is shown in Figure 5B . These AAC and DPC differences may be important factors for functional and pathogenic divergence of SARS-CoV-2. Heatmaps showing the DPCs of HCoV and nHCoV are shown in Figure S5A,5B (Supplementary Data 1) . Currently, substantial efforts are being made to develop therapeutic strategies 51−53 to eradicate the COVID-19 health crisis. Identifying the informative PCPs of COVID proteins could assist in vaccine design and COVID prevention. Due to the potential role of machine learning in solving many biological issues, it is considered a suitable tool for COVID-19 research. Hence, machine learning-based prediction models for COVID-19 are necessary to identify and analyze the important biomarkers for vaccine design. Here, to explore the PCPs of HCoV, we developed COVID-Pred for identification of valuable information of COVID proteins that could help in understanding their functions. A dataset consisting of protein sequences from 4320 HCoV and nHCoV was retrieved from the GISAID and NCBI databases. COVID-Pred was developed for the identification of informative PCPs and for the prediction of species-specific coronavirus proteins. COVID-Pred selected 11 PCPs and achieved 10-CV ACC, AUC, test ACC, and test AUC of 99.53%, 0.996, 97.80%, and 0.991, respectively, and obtained 100% (7/7) accuracy on an independent data set consisting of seven HCoV S protein sequences. Further analysis of five informative PCPs revealed that van der Waals forces, α-helices, frequencies of amino acids at α/β turns, helix capping, and mutability played some significant roles in differences between HCoV and nHCoV proteins. First, the characterization analysis of these informative PCPs revealed that there was a slight difference observed in the van der Waals volume between HCoV and nHCoV. Second, a difference in the ΔΔG 0 value was observed between the S proteins of HCoV and nHCoV in which the amino acids R, Y, V, and K showed a larger difference in ΔΔG 0 value compared to the other amino acids. Third, a larger difference in the amino acids for PALJ810116 was observed between HCoV and nHCoV for N, K, G, S, and Y. Fourth, we observed a larger difference in the normalized positional residue frequency at the helix termini N″ for the amino acids N, L, K, E, and G between HCoV and nHCoV as well as for R, P, K, S, and E between the S proteins of HCoV and nHCoV. Fifth, a larger difference in the relative mutability of amino acids between HCoV and nHCoV was observed for N, E, S, L, and T, whereas the relative mutability of amino acids between the S proteins of HCoV and nHCoV was observed for R, S, V, N, and P. The mutational analysis showed the mutations in the S proteins compared with the reference strain SARS-like bat virus, which falls under nHCoV (SLCoVZXC21/2015). Furthermore, we observed a difference in the AACs and DPCs between HCoV and nHCoV. The amino acid and dipeptide compositional differences for specific amino acids and dipeptides were also observed between HCoV and nHCoV. We believe that these findings could be helpful in understanding the functions of COVID proteins, which will be invaluable in designing vaccines. The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.1c00156. Prediction performance of COVID-Pred evaluated using the ROC curve ( Figure S1 ), comparison of PCPs between the S proteins of HCoV and nHCoV ( Figure S2 ), normalized amino acid compositional preferences showing difference in the 11 PCPs between the S proteins of HCoV and nHCoV ( Figure S3 ), amino acid compositional differences between the S proteins of HCoV and nHCoV ( Figure S4 ), dipeptide compositional differences between HCoV and nHCoV ( Figure S5 ), and 11 PCP values for the amino acids (AA index) ( The authors declare no competing financial interest. All the data used in this analysis can be found at the GISAID (https://www.gisaid.org) and the NCBI databases. Virus Taxonomy; King The Molecular Biology of Coronaviruses SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor Characterization of spike glycoprotein of SARS-CoV-2 on virus entry and its immune cross-reactivity with SARS-CoV Neutralization of SARS-CoV-2 spike pseudotyped virus by recombinant ACE2-Ig Molecular Interactions in the Assembly of Coronaviruses The coronavirus nucleocapsid is a multifunctional protein The M, E, and N Structural Proteins of the Severe Acute Respiratory Syndrome Coronavirus Are Required for Efficient Assembly, Trafficking, and Release of Virus-Like Particles A structural analysis of M protein in coronavirus assembly and morphology Assembly of the coronavirus envelope: homotypic interactions between the M proteins The Membrane M Protein Carboxy Terminus Binds to Transmissible Gastroenteritis Coronavirus Core and Contributes to Core Stability Coronavirus envelope protein: A small membrane protein with multiple functions Absence of E protein arrests transmissible gastroenteritis coronavirus maturation in the secretory pathway A new coronavirus associated with human respiratory disease in China Difference in Receptor Usage between Severe Acute Respiratory Syndrome (SARS) Coronavirus and SARS-Like Coronavirus of Bat Origin Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases Potential inhibitors against 2019-nCoV coronavirus M protease from clinically approved medicines Inheritable genetic algorithm for biobjective 0/1 combinatorial optimization problems and its applications AAindex: amino acid index database, progress report Identifying the miRNA signature associated with survival time in patients with lung adenocarcinoma using miRNA expression profiles Identification and characterization of the lncRNA signature associated with overall survival in patients with neuroblastoma Novel miRNA signature for predicting the stage of hepatocellular carcinoma Characterizing informative sequence descriptors and predicting binding affinities of heterodimeric protein complexes GREMA: Modelling of emulated gene regulatory networks with confidence levels based on evolutionary intelligence to cope with the underdetermined problem GeNOSA: inferring and experimentally supporting quantitative gene regulatory networks in prokaryotes A library for support vector machines The WEKA data mining software: an update Amino acid side chain parameters for correlation studies in biology and pharmacology Van der Waals interactions involving proteins Comparing the Binding Interactions in the Receptor Binding Domains of SARS-CoV-2 and SARS-CoV An investigation into the identification of potential inhibitors of SARS-CoV-2 main protease using molecular docking study Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2 Structural and Functional Basis of SARS-CoV-2 Entry by Using Human ACE2 Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation Influence of hydrophobic and electrostatic residues on SARS-coronavirus S2 protein stability: Insights into mechanisms of general viral fusion and inhibitor design Tectonic conformational changes of a coronavirus spike glycoprotein promote membrane fusion A thermodynamic scale for the helix-forming tendencies of the commonly occurring amino acids Protein secondary structure. Studies on the limits of prediction accuracy Characterization of a highly conserved domain within the severe acute respiratory syndrome coronavirus spike protein S2 domain with characteristics of a viral fusion peptide Helix capping Dissection of helix capping in T4 lysozyme by structural and thermodynamic analysis of six amino acid substitutions at Thr 59 Helix capping propensities in peptides parallel those in proteins Influence of N-cap mutations on the structure and stability of Escherichia coli HPr The folding of an enzyme. II. Substructure of barnase and the contribution of different interactions to protein stability Central ions and lateral asparagine/glutamine zippers stabilize the post-fusion hairpin conformation of the SARS coronavirus spike glycoprotein 22 a model of evolutionary change in proteins. Atl. protein seq. struct Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infect Genetic diversity and evolution of SARS-CoV-2. Infect. gene. evol Immunological and inflammatory profiles in mild and severe cases of COVID-19 Structural plasticity of SARS-CoV-2 3CL Mpro active site cavity revealed by room temperature X-ray crystallography