key: cord-0908827-d8onhd04 authors: Mei, Hu; Sun, Lili; Zhou, Yuan; Xiong, Qing; Li, Zhiliang title: Identification of encoding proteins related to SARS-CoV date: 2004 journal: Chin Sci Bull DOI: 10.1360/03wb0198 sha: 76265ef68bfb73401e97c8fe5d3a883e78a43401 doc_id: 908827 cord_uid: d8onhd04 By sampling 100 encoding proteins from SARS-coronavirus (SARS-CoV, NC 004718) and other six coronaviruses and selecting 23 variables through stepwise multiple regression (SMR) from 172 variables, the multiple linear regression (MLR) model was established with good results of the quantitative modelling correlation coefficient R (2) = 0.645 and the cross-validation correlation coefficient R (CV)(2) = 0.375. After removing 4 outliers, the quantitative modelling and cross-validation correlation coefficients were R (2)= 0.743 and R (CV)(2) = 0.543, respectively. The coronaviruses (order Nidovirales, family Coronaviridae, genus Coronavirus) are a diverse group of large, enveloped, positive-stranded RNA viruses. At ~30000 nucleotides, their genome is the largest found in any of the RNA viruses. The viruses can cause severe disease in many animals, and several viruses, including infectious bronchitis virus, feline infectious peritonitis virus, and transmissible gastroenteritis virus, are significant veterinary pathogens [1] . Human coronaviruses are found in both group 1 (HCoV-229E) and group 2 (HCoV-OC43) and are responsible for 30% of mild upper respiratory tract illnesses. The severe acute respiratory syndrome (SARS) associated with coronavirus (SARS-CoV) has proved to be a new group and the prime criminal for SARS infection [2, 3] . At present, researches in molecular biology related to SARS-CoV are mainly focused on genome organization, virus replication, transcriptions, pathogenesis and protein structure prediction. In this paper, by stepwise multiple regression (SMR), 23 variables are selected to establish multiple linear regression (MLR) model from 172 variables which are mainly about amino acids constitutions, physicochemical and 3-D structural properties of 100 encoding proteins from SARS-CoV and other 6 coronaviruses. Form the model established, we can derive some overall characters, which distinguish the SARS-CoV from other coronaviruses, and can provide some valuable information for protein recognition, genome approaches and promote researches. The characteristics that distinguish the encoding proteins of SARS-CoV from those of other coronaviruses are related to not only the sequence of amino acids but also amino acids constitution, physicochemical and 3-D structural properties. These characteristics finally induce the functional difference between them. So, here we make a systematic study on 172 variables, which mainly describe amino acids constitution, physicochemical and 3-D structural properties of 100 encoding proteins from SARS-CoV and other 6 coronaviruses. By stepwise multiple regression, the model is established from which we can deduce the most important variables that contribute significantly to the functional difference between encoding proteins from SARS-CoV and other 6 coronaviruses. Total 100 encoding proteins, i.e. 30 of SARS-CoV (NC 004718) and 70 of other 6 coronaviruses, are used as calibration samples (please refer to supplemental materials for details). The latter 70 samples are randomly selected from 140 encoding proteins of 6 coronaviruses. The 6 coronaviruses comprise Group 1: human coronavirus 229E (NC 002645), porcine epidemic diarrhea virus (NC 003436) and transmissible gastroenteritis virus (NC 002306); Group 2: bovine coronavirus (NC 003045) and murine hepatitis virus (NC 001846) and Group 3: avian infectious bronchitis virus (NC 001451) 1) . The original 172 variables mainly described the amino acids frequency, molecular weight, violet absorbing, hydrophobicity, bulk, and electronic properties, 3-D structural properties of encoding proteins. By stepwise multiple regression, 23 variables (available as supplemental materials) are selected to establish the calibration model ( Table 1 ). The stepwise multiple regression and cross-validation technique with the leave-one-out procedure are performed with SPSS 10.0 package and an in-house program, respectively. For two classes of encoding proteins under consideration, we use a categorical variable Y where one class is set to 1 for the encoding proteins of SARS-CoV and the other is set to 0 for the others. Then, the categorical variables together with the original 172 variables are modeled by stepwise multiple regression. The results are shown in model 1 ( Table 2) . From the results of model 1, we can see that the model has both the modeling robustness and pre-1) http://www.ncbi.nlm.nih.gov//entrez/query.fcgi?db=Genome Table 1 The 23 independent variables used in the multiple linear regression modeling processes atomic weight ratio of hetero elements in end group to C in side chain [5] b) V 11 molar fraction (%) of 2001 buried residues [6] b) V 12 conformational parameter for beta-sheet [7] b) V 13 recognition factors [8] ln V a) Predicted by vector NTI suite 8.0 package, b) the values of variables are weighted mean by frequency, c) natural logarithm of the frequency of amino acids (V14-V23). Table 2 The summary results of multiple linear regression and cross validation procedures When plotting residual against observation ID (Fig. 1) , we find that residuals of the observations labeled as 19, 53, 74, 79 are relatively very large and the absolute standardized residuals of these 4 samples are 2 times larger than standard deviation. So we identify these 4 observations as outliers. To measure the influence of a point on regression fitness, we plot cook's distance against centered leverage values (Fig. 2) , from which we can see that the observation labeled as 28 has a high leverage and high influence. Its high leverage gives it extra weight in the computation of the regression line, and the high influence indicates that it did affect the slope of the regression line. So we examine this influential point by using a weighting variable (0.5) that gives the influential point less weight. After removing 4 outliers and giving observation 28 less weight, we rebuilt the multiple linear regression model, the results of which are shown in model 2 ( Table 2) . As we anticipate, the robustness and predictive capability of model 2 are superior to that of model 1. The percentage of correct prediction for SARS-CoV and other 6 coronavi- From the partial correlation coefficient, we can obtain some useful information, i.e. the direction and strength of relationships between X and Y variables. In model 2, there are 4 independent variables, whose partial correlation coefficients are larger than 0.6. These 4 independent variables are as follows: atomic weight ratio of hetero elements in end group to C in side chain (V 10 ), conformational parameter for beta-sheet (V 12 ) and natural logarithm of the frequency for both leucine (V 18 ) and proline (V 20 ). The high partial correlation coefficient indicates that these 4 variables give more influence on the encoding protein of SARS-CoV and other 6 coronaviruses. The positive correlation indicates that higher values of these 4 variables tend to form encoding proteins of SARS-CoV. However, there is an interesting thing to note, the absolute value of partial correlation coefficient of V 20 (the natural logarithm of the frequency of proline) is the largest among 23 variables. As we know, proline is very particular amino acids. When formed amide linkage or peptide bonding with other amino acids, it can easily form cisoid conformation which can cause increased interactions among groups and influence the conformation of backbone at the same time. In the process of protein de-naturalization and renaturalization as well as folding of peptide chain, it is a restrictive step of dynamics for converting from transoid conformation to cisoid conformation. Then, how about glycin? Glycin is only amino acids with 2 hydrogen atoms linking to its alpha-carbon atom. So for glycin, there are fewer interactions among groups as well as less steric hindrance. Interestingly, the natural logarithm of the frequency of glycin is also included in model 2, of which the partial correlation coefficient is -0.45. Further investigation is required for reasonable explanation about that. There are 3 variables in model 2 of which the partial correlation coefficients are less than -0.50. These 3 variables are as follows: weight percentage of charged amino acids (V 3 ), weight percentage of hydrophobic amino acids (V 6 ), and recognition factors (V 13 ). The negative correlation indicates that higher absolute values of these 3 variables tend to give more negative impact on the tendency of forming encoding proteins of SARS-CoV. The partial correlation coefficients of V 3 indicate that low weight percentage of charged amino acids, i.e. R, K, H, Y, C, D and E tends to form encoding protein of SARS-CoV. However, the partial correlation coefficients of V 4 , V 5 and V 17 indicate that high weight percent The calculated values, which are equal to or larger than 0.5, are thought to be predicted correctly, or else incorrectly. age of acidic amino acids (D, E), basic amino acids (K, R) and H tends to form encoding protein of SARS-CoV. The only reasonable solution is that high weight percentage of Y and C tend to give extremely negative impact on the tendency of forming encoding proteins of SARS-CoV which mask the influence of D, E, K, R and H. As for cysteine (C), it is another particular amino acids, which can influence the conformation of protein by form disulfide bond. Furthermore, cysteine has relatively high reaction activity, which can affect the biological properties of proteins. From model 2, we can get overall characteristics about difference between encoding proteins of SARS-CoV and those of other 6 coronaviruses. However, for more information about it, further investigation is required. Besides, we compare the results of gene recognition related to SARS-CoV BJ01 [9] . In the genome of SARS-CoV BJ01, there are 35 Open Reading Frames (ORFs), in which 14 ORFs are identified and other 21 ORFs are not confirmed, which maybe are new genes. Table 3 lists the results of 3 gene recognition approaches, i.e. Heuristic models, Gene Identification, ORF Finder together with established model 2. Characterization of a novel coronavirus associated with severe acute respiratory syndrome Identification of a novel coronavirus in patients with severe acute respiratory syndrome A novel coronavirus associated with severe acute respiratory syndrome Peptide quantitative structure-activity relationships, a multivariate approach Amino acid difference formula to help explain protein evolution Surface and inside volumes in globular proteins Conformational preferences of amino acids in globular proteins Prediction of the secondary structure and functional sites of major histocompatibility complex molecules SARS--the Severe Acute Respiratory Syndrome