key: cord-1054090-pmtcor3r authors: Alkady, Walaa; ElBahnasy, Khaled; Leiva, Víctor; Gad, Walaa title: Classifying COVID-19 based on amino acids encoding with machine learning algorithms date: 2022-03-15 journal: Chemometr Intell Lab Syst DOI: 10.1016/j.chemolab.2022.104535 sha: 8285072640c8dcea9fb89caf5858f5462421b1ca doc_id: 1054090 cord_uid: pmtcor3r COVID-19 disease causes serious respiratory illnesses. Therefore, accurate identification of the viral infection cycle plays a key role in designing appropriate vaccines. The risk of this disease depends on proteins that interact with human receptors. In this paper, we formulate a novel model for COVID-19 named “amino acid encoding based prediction” (AAPred). This model is accurate, classifies the various coronavirus types, and distinguishes SARS-CoV-2 from other coronaviruses. With the AAPred model, we reduce the number of features to enhance its performance by selecting the most important ones employing statistical criteria. The protein sequence of SARS-CoV-2 for understanding the viral infection cycle is analyzed. Six machine learning classifiers related to decision trees, k-nearest neighbors, random forest, support vector machine, bagging ensemble, and gradient boosting are used to evaluate the model in terms of accuracy, precision, sensitivity, and specificity. We implement the obtained results computationally and apply them to real data from the National Genomics Data Center. The experimental results report that the AAPred model reduces the features to seven of them. The average accuracy of the 10-fold cross-validation is 98.69%, precision is 98.72%, sensitivity is 96.81%, and specificity is 97.72%. The features are selected utilizing information gain and classified with random forest. The proposed model predicts the type of Coronavirus and reduces the number of extracted features. We identify that SARS-CoV-2 has similar physicochemical characteristics in some regions of SARS-CoV. Also, we report that SARS-CoV-2 has similar infection cycles and sequences in some regions of SARS CoV indicating the affectedness of vaccines on SARS-CoV-2. A comparison with deep learning showed similar results with our method. Coronavirus crosses the species barrier and infects humans with very dangerous diseases [12, 13] . Public 66 health is challenged due to the novelty of the antigen for the human host. Due to the danger of COVID-19, 67 many methods have been introduced to develop models as well as to diagnose and differentiate it from other 68 viruses. In [14] , eighteen radiological semantic features and seventeen clinical features were selected from 67 70 features (41 radiological and 26 clinical features). Also, univariate analysis methods, as the least absolute 71 shrinkage and selection operator (LASSO), have been used for feature selection and its classification perfor-72 mance has been evaluated based on accuracy, sensitivity, and specificity measures. Using clinical features, 73 accuracy, sensitivity, and specificity were reached to 88.8%, 89.2%, and 88.4%, respectively. Employing ra-74 diological features, accuracy was 92.9%, sensitivity was 99%, and specificity was 85.1%. Utilizing both types 75 of features, accuracy, sensitivity, and specificity were of 95.9%, 96.1%, and 95.7%, respectively. In [15] , the infection risk of non-human-origin coronavirus is classified considering the spike protein In [17, 18] , a diagnostic model based on X-ray images was proposed. In, this model, related features to Note that the selection of human monoclonal antibodies may identify immunodominant antigenic sites 92 associated with neutralization and provide reagents for stabilizing and solving the structure of viral surface 93 proteins [19] . Understanding the structural basis of the SARS-CoV-2 can guide the selection of vaccine targets. Machine learning methods and artificial intelligence [20, 21, 45] can be used to analyze, predict, and diag-99 nose infection rates. In addition, machine learning helps in the drug discovery process for a vaccine. To the 100 best of our knowledge, there are no studies that accurately differentiate between SARS-CoV-2 and other coro-101 navirus types with a small number of features. Therefore, machine learning models are needed to make this 102 differentiation. As mentioned in [44] , differentiating between coronavirus types is helpful for designing an effective vac-104 cine. Moreover, the mutation pattern of the COVID-19 can be partially predicted after applying more analysis 105 on the difference of the protein sequences of different strains. In this section, all abbreviations and symbols employed in this paper are defined in Table 1 in alphabetical 110 order. One of the used abbreviations is the standard one-letter abbreviation of the twenty amino acids found 111 in [22] . and specificity. • To reduce the number of features of the novel model to enhance its performance. • To analyze the protein sequence of SARS-CoV-2 for understanding the viral infection cycle. • To implement the obtained results computationally. • To apply the results to real-world data, in our case using the NGDC database. • Analysis of variance (ANOVA) test [26] , and 147 • Chi-square (χ 2 ) statistic test [27] . The number of extracted features is reduced by removing irrelevant features. Specifically, protein sequence 149 analysis requires feature extraction to convert sequence characters to a numerical form using amino acid en- In summary, as shown in Figure 1 , the AAPred model consists of three phases: • Feature extraction, In the feature extraction phase, the amino acid encoding method is used to obtain features from the viral protein 170 sequences. The amino acid encoding method employs two physicochemical properties of the amino acids: 171 volume and dipole [30] . The volume and dipole values are calculated utilizing molecular modeling and den-172 sity-functional methods [31] . The calculated volumes and dipole values of amino acids divide the twenty 173 amino acids into seven classes as shown in Table 2 . According to Table 2 , the dipole scale varies between 1.0 to 3.0 as follows: • "+'+'+'" dipole value is greater than 3.0 with opposite orientation. Note that the volume scale "-" means that the volume is less than 50; the "+" means that volume is greater 185 than 50; and cysteine (C) amino acid is moved from Class 3 to Class 7 because it can form disulfide bonds. where ( ) given in (4) where MI(f1, f2) stated in (5) is the mutual information for f1 and f2, IG(d, f1) is the information gain for f1, and 240 IG(f1|f2) is the conditional information gain for f1 given f2. Therefore, we have that The ANOVA is a statistical test that studies the variance of mean sum of squares in different groups. The ANOVA is constructed from the expressions stated as 247 248 where MMS between defined in (8) is the mean sum of squares between groups, with ̅ and G being defined as 253 in (5), ̅ is the mean of all features, S is the total number of samples, and (S -G) corresponds to the degrees 254 of freedom between groups. Therefore, the statistic used to compare groups is given by 255 256 ANOVA = MMS between / MMS within. (9) 257 258 Note that the ANOVA establishes the variation between the frequency of each amino acid class with respect 259 to the variation of all virus sequences. The χ 2 test is applied to the groups of features to evaluate the association between them using the corre- where O and E are observed and predicted or expected frequencies in class i, respectively, with G, as men-266 tioned in (7), being the number of groups. All these feature selection methods, that is, IG, ANOVA, χ 2 test, are efficient in selecting features on In the classification phase, the significant features are used to classify the coronaviruses to COVID-19 276 and non-COVID-19 employing the BE, DT, GB, KNN, RF, and SVM algorithms. Prediction of coronavirus 277 types is considered a binary classification problem. The BE algorithm [42] utilizes a basis classifier for a given number of iterations. The BE classifier applies 279 weight to the training set to help the next iteration of the classifier to achieve better performance. The DT classifier splits the dataset based on the entropy by means of the expression stated as 281 282 where Ent(d) given in (11) The GB algorithm [43] aims to maximize the results of the classifier correlated with the negative gradient 294 of the loss function. The KNN classifier is applied using the Manhattan distance (MD), which is calculated as the sum of the 296 absolute values of the differences of the samples and then defined as 299 300 where k stated in (13) is the number of features (dimensions), and x, y are two samples (vectors). The RF algorithm is an ensemble classifier that combines several classifiers. The RF method consists of 302 multiple DTs, each of which works as a classifier to predict the class label and then the majority voting of 303 these trees' outputs is utilized to predict the class label. The SVM algorithm is applied using different kernel functions: Linear, hyperbolic tangent or sigmoid, 305 polynomial, and radial basis function (RBF) [34] . The simplest kernel is the linear function, which is computed 306 by the inner product of the feature vector and the class label vector, being it expressed as 307 308 LK(f, y) = (f . y) + g, 309 310 where LK(f, y) is the linear kernel, f is the feature vector, y is the class label vector, "." is the inner product, 311 and g is a free general parameter (constant). The polynomial kernel function is calculated as where PK(f, y) given in (15) is the polynomial kernel, f, y, g are defined as in (14) , and q is the polynomial where RK(f, y) given in (17) is the radial basis kernel, f, y, g are defined as in (14) As mentioned, the proposed model is evaluated using the NGDC dataset that is described by two types of files: Center for Biotechnology Information (NCBI) coronavirus dataset [35] . For COVID-19, the minimum protein sequence length is 21 amino acids, and the maximum protein se-345 quence length is 7,097 amino acids. For non-COVID-19, the minimum protein sequence length is 26 amino 346 acids, and the maximum protein sequence length is 7,247 amino acids. The CSV file has data about viruses' 347 protein sequences, such as accession numbers, collection date of protein sequences, species, genus, family, 348 protein sequence length, isolation source, host, and geographical location as shown in Table 4 . The FASTA file contains the protein sequences with the accessions number and virus type as a header for each 354 protein sequence as shown in Figure 2 . where TP is given in (18) and FP in (19) is the total number of false positives. The summation of TP and FP 373 represents the total number of protein sequences predicted as COVID-19 (predicted as presence of the virus). where TN is given in (18) and FP stated in (21) In the dataset, protein sequences are categorized into two classes: COVID-19 and non-COVID-19. The 392 106,776 protein sequences are divided into 80% for training and 20% for testing. Moreover, all protein se-393 quences were evaluated using 10-fold cross-validation. Figure 3 shows the frequencies of the eight amino acid 394 classes for the samples in the dataset. In Figure 3 , the x-axis represents the eight classes of amino acids, and 395 the y-axis is the frequency of each amino acid class in each sample. The accuracy of the SVM classifier utilizing χ 2 is 94.15%. In addition, the DT classifier reaches an accu-439 racy of 94.39%, a precision of 90.48%, a sensitivity of 96.31%, and a specificity of 90.48% using ANOVA. The accuracy of the RF classifier with IG is 98.69% and this is the maximum accuracy reached. The proposed model performance using 10-cross-validation based on IG, ANOVA, and χ 2 test is shown 445 in Figures 4, 5, and 6 tively. Figure 7 shows the performance of the RF classifier based on the spike protein dataset using the three 451 selection methods: IG, ANOVA, and χ 2 test. The proposed model is evaluated using six different classifiers of machine learning algorithms [20, 28, 29] To verify the proposed AAPred model performance, we compare this model with the performance of two 487 different models: LASSO [14] and AAC [15] 3.58 seconds compared to Qiang et al. [15] that has twenty features as shown in Table 8 . J o u r n a l P r e -p r o o f The results report that SARS-CoV-2 has similar infection cycles and sequences in some regions of the 508 SARS CoV proteins that indicate the affectedness of vaccines on SARS-CoV-2. In addition, the 10-Fold cross-validation method was used. Moreover, after studying the polarity and di-523 pole values of the protein sequence of the coronavirus, we conclude that the COVID-19 protein sequence has 524 a high level of amino acids of the second class that contains nonpolar amino acids, compared to the non-525 COVID-19 protein sequence. 526 We choose machine learning rather than deep learning methods because machine learning techniques can 527 reach excellent results in our dataset. Indeed, it has only seven features and is not complicated as an image 528 classification, for example. In any case, we applied a generative adversarial network and resulted in an accu-529 racy of around 98% as the random forest classifier. A limitation of the proposed model is that it relies only on protein sequence analysis and does not consider 531 other aspects such as the protein structure of the virus or its DNA sequence. As further research directions, we 532 aim to apply the same model using different feature extraction methods according to the sequence and the 533 structure of the proteins to obtain more detailed biological information about the virus behavior and its infec-534 tion cycle. Other classification methods will also be explored in future studies such as principal components 535 analysis and its new derivations, including supervised and unsupervised approaches, as well as functional data 536 analysis, partial least squares structures, and other recent methodologies [36] [37] [38] [39] [40] [41] [46] [47] [48] [49] . Author statement: All persons who meet authorship criteria are listed as authors, and all authors certify that they have 542 participated sufficiently in the work to take public responsibility for the content, including participation in the concept, 543 design, analysis, writing, or revision of the manuscript. Declaration of competing interest: The authors declare that they have no known competing financial interests or per-545 sonal relationships that could have appeared to influence the work reported in this paper. Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. The species severe acute 554 respiratory syndrome-related coronavirus classifying 2019-CoV and naming it SARS-CoV-2 A pneumonia outbreak associated with a new 557 coronavirus of probable bat origin A statistical analysis for the 559 epidemiological surveillance of COVID-19 in Chile World Health Organization. WHO announces COVID-19 outbreak a pandemic Structure and expression of large (+)RNA genomes of viruses of higher eukaryotes Committee on Taxonomy of Viruses Structure, function, and evolution of coronavirus spike proteins Middle East respiratory syndrome coronavirus: another zoonotic be-572 tacoronavirus causing SARS-like disease Genome composition and divergence of the novel coronavirus 574 (2019-nCoV) originating in China Structure, function, and evolution of coronavirus spike proteins Ratification vote on taxonomic 577 proposals to the International Committee on Taxonomy of Viruses A SARS-like cluster of circulating 579 bat coronaviruses shows potential for human emergence A diagnostic model for coronavirus disease based on radiological semantic and clinical features: a multi-center study Using the spike protein feature to predict infection risk and monitor the 583 evolutionary dynamic of coronavirus The 2019 novel coronavirus resource IKONOS: An intelligent tool to support diagnosis of COVID-19 Leiva Machine learning techniques as an efficient 588 alternative diagnostic tool for COVID-19 cases Coronavirus biology and replication: implications for SARS-CoV-2 Fundamentals of Pattern Recognition and Machine Learning tion student retention based on data mining: Machine learning algorithms and case study in Chile The DDBJ/ENA/GenBank Feature Table Definition. International Nucleotide Sequence Database Collaboration Application of machine learning approaches for protein-protein interac-598 tions prediction A comparative study of feature selection approaches Gabor feature selection based on information gain Case study using analysis of variance to determine groups' variations. MATEC Web of Conferences 603 Seven proofs of the Pearson chi-squared independence test and its graphical interpretation Structural, Syntactic, and Statistical Pattern Recog-607 Swarm intelligence optimization for feature selection of biomolecules Prediction of protein-protein interaction by metasample-based sparse 612 representation Density functional theory in the solid-state Computational predictions for protein sequences of COVID-19 virus via ma-616 chine learning algorithms Computational predictions for protein sequences of COVID-19 virus via machine learning 618 algorithms A systematic literature review on support vector machines applied to Clas-620 Engineering International Research Conference (EIRCON) 2020 ponent analysis by particle swarm optimization with an environmental application for data science Res Risk Assess Wilcoxon and Mann-Whitney tests for functional data: An approach based 628 on random projections. Mathematics 2021 Estimating the covariance matrix of the coefficient estimator in multivar-630 iate partial least squares regression with chemical applications. Chemometrics and Intelligent Laboratory Systems 631 2021 Cross-predicting essential genes between two model eukaryotic species 633 using machine learning COVIDomic: A multi-modal cloud-635 based platform for identification of risk factors associated with COVID-19 severity Anti-COVID-19 activity of some benzofused 1, 2, 3-triazolesulfonamide hybrids using in silico and in 639 vitro analyses Bagging and boosting ensemble 641 classifiers for classification of multispectral, hyperspectral and PolSAR data: A comparative evaluation Gradient boosting machines: A tutorial Features, evaluation, and treatment of Coronavirus (COVID-19 StatPearls Overview of explainable artificial intelligence for prognostic 648 Abnormality Detection and Failure Prediction Using Ex-651 plainable Bayesian Deep Learning: Methodology and Case Study with Industrial Data. Mathematics 2022 On a partial least squares regression model for asymmet-653 ric data with a chemical application in mining A new clustering algorithm based on a radar scanning strategy with 655 applications to machine learning data A new approach to predicting cryptocurrency returns based *Correspondence author 1: victor.leiva@pucv.cl or victorleivasanchez@gmail