key: cord-0901105-qqsoodgz authors: Arslan, Hilal title: COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus date: 2021-09-08 journal: Comput Ind Eng DOI: 10.1016/j.cie.2021.107666 sha: f6f096db98b3ed60c26b5fa7d8990ea9c84a6372 doc_id: 901105 cord_uid: qqsoodgz This paper proposes an efficient and accurate method to predict coronavirus disease 19 (COVID-19) based on the genome similarity of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and a bat SARS-CoV-like coronavirus. We introduce similarity features to distinguish COVID-19 from other human coronaviruses by comparing human coronaviruses with a bat SARS-CoV-like coronavirus. In the proposed method each human coronavirus sequence is assigned to three similarity scores considering nucleotide similarities and mutations that lead to the strong absence of cytosine and guanine nucleotides. Next the proposed features are integrated with CpG island features of the genome sequences to improve COVID-19 prediction. Thus each genome sequence is represented by five real numbers. We exhibit the effectiveness of the proposed features using six machine learning classifiers on a dataset including the genome sequences of human coronaviruses similar to SARS-CoV-2. The performances of the machine learning classifiers are close to each other and k-nearest neighbor classifier with similarity features achieves the best results with an accuracy of 99.2%. Moreover, k-nearest neighbor classifier with the integration of CpG based and similarity features has an admirable performance and achieves an accuracy of 99.8%. Experimental results demonstrate that similarity features remarkably decreases the number of false negatives and significantly improve the overall performance. The superiority of the proposed method is also highlighted by comparing with the state-of-the-art studies detecting COVID-19 from genome sequences. (RT-PCR) (Anika et al., 2020) . In some studies, the sensitivity and accuracy of the RT-PCR test have been criticized and they reported that the RT-PCR test suffers from a great number of false negatives and false positives (Udugama et al., 2020; Holshue et al., 2020; Silva et al., 2020; F. Jiang et al., 2020; due to the mutation of SARS-CoV-2 (Ai et al., 2020) . Many state-of-the-art machine learning algorithms have recently been published to detect COVID-19 (Zoabi, Deri-Rozov, & Shomron, 2021; Muhammad et al., 2021; Ardabili et al., 2020; Tayarani N., 2021; Kushwaha et al., 2020; De Felice & Polimeni, 2020) . These methods can identify COVID-19 disease mostly by analyzing Computed Tomography (CT) scans, chest X-ray images, and whole genome sequences (Annarumma et al., 2019; Udugama et al., 2020; Arslan, 2021) . Studies detecting COVID-19 from CT scans or X-ray images have some limitations. For instance, medical images may not distinguish COVID-19 from other viral pneumonia since they have several common features (Udugama et al., 2020; Zargari Khuzani, Heidari, & Shariati, 2021) . The absence of any abnormalities on a chest X-ray or CT scan does not guarantee to the absence of SARS-CoV-2. Furthermore, there is a lack of various annotated images that can be used in experiments in image-based analysis (Mohamadou, Halidou, & Kapen, 2020) . Molecular techniques have attracted attention recently as they can easily track specific pathogens and coronavirus genes. Hence, they have produced more satisfactory results compared to CT scans. Machine learning algorithms exhibit remarkable performance for analyzing genome sequences in a state-of-the-art manner. In these techniques, it is critical to identify features that distinguish COVID-19 from other coronaviruses. Many evolutionary and population based methods may be used for choosing the most relevant features discriminating COVID-19 (L. Abualigah, Yousri, et al., 2021; L. Abualigah & Diabat, 2021; L. Abualigah, Diabat, Mirjalili, Abd Elaziz, & Gandomi, 2021 ; L. M. Q. Abualigah, 2019; Too & Mirjalili, 2021) and modelling the potential effect of coronavirus (Salgotra, Gandomi, & Gandomi, 2020) . When the origin of SARS-CoV-2 virus is investigated, some studies point out that bats might be the original host of SARS-CoV-2 (Lu et al., 2020; Wu et COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus 3 al., 2020; X. . Zhou et al. (2020) showed that SARS-CoV-2 presents overall genome sequence similarity to a bat coronavirus and is 96% identical to a bat coronavirus. Taking advantages of this similarity, in this study, we propose an accurate and efficient COVID-19 detection method based on genome similarity between a bat SARS-like coronavirus and SARS-CoV-2. We define three similarity features by comparing hu- The remaining parts of the study are organized as follows. In Section 2, we summarize studies that have diagnosed COVID-19. In Section 3, we explain the proposed method and summarize machine learning techniques used in this study. In Section 4, we evaluate experimental results and compare them with the-state-the-art methods. Finally, Section 5 presents important conclusions pointing out future directions. In this section, we briefly explain the studies diagnosing COVID-19 in three categories: studies using laboratory findings and common symptoms, studies using CT scans and chest X-ray images, and studies using whole genome sequences. Jiang et al. (2020) These studies pointed out that diagnosis of COVID-19 using clinical data or symptoms achieves a constrained diagnosis efficiency. Mohamadou et al. (2020) and Shi et al. (2021) reviewed artificial intelligence methods for predicting and managing COVID-19. In these techniques, X-ray and CT scans have been mostly used. Akram et al. (2021) cases from other types of pneumonia. They used 420 images and they created an optimum set of features of X-ray images by performing dimensionality reduction method. In the classification step, they performed the multilayer neural network and they achieved satisfactory results. On the other hand, DNA sequencing techniques for identifying SARS-CoV-2 are highly useful for monitoring coronavirus genes that change frequently as the disease passes from one person to another as well as understand the behaviour of the virus (Nawaz, Fournier-Viger, Shojaee, & Fujita, 2021; Annarumma et al., 2019; Udugama et al., 2020; Arslan, 2021) . The DNA sequencing methods can be investigated in two categories, which are alignment-based and alignment-free methods. Alignment based methods are preferred when the size of the data is small, and they are expensive if the size of the data is large. Machine learning algorithms are effectively used for analyzing large numbers of genome sequences. database. In their method, first they converted the sequences into discrete numeric values using Chaos Game Representation (CGR) (Jeffrey, 1990) , which was based on two dimensional k-mer. In their study, the value of k was selected as 7. After they performed discrete After the feature extraction step, any machine learning classifier may be used to determine SARS-CoV-2 sequences. In the classification step, we recommend us- ratioCG ← compute ratio of CG nucleotide in S 8: CpG 1 = ratioC + ratioG 9: CpG 2 = ratioCG / (ratioC × ratioG) 10: end for Parameter Tuning 11: Determine hyperparameters to be tuned for the classifier and perform grid search Classification Step: 12: Extract CpG 1 , CpG 2 , sim1, sim2, and sim3 features for testSeq 13: Apply the machine learning classifier (KNN is suggested) with its optimum hyperparameters 14: Determine whether testSeq is SARS-CoV-2 or not Bennett, & Wills, 2016; Das, Mishra, & Saraswathy Gopalan, 2020) , and we employ the grid search to determine the optimum value of the hyperparameters. In the following subsection, we briefly explain these classifiers and their hyperparameters that are tuned. We perform several types of classifiers to predict COVID-19 disease, and in this section, we briefly summarize these classifiers. SVM (Burges, 1998; Vapnik, 1995) separates the data samples by finding a hyperplane maximizing the margin, which is the smallest distance between two classes. The hyperplane can be determined as a solution to the following problem: The RBF is the most commonly used kernel function to classify multi-dimensional data. Moreover, the linear kernel is a specific version of the RBF (Keerthi & Lin, 2003) . Furthermore, the RBF requires fewer parameters to set than polynomial kernel, and performance of the RBF has similar to the other kernel functions (Lin, Ying, Chen, & Lee, 2008) . For these reasons, in this study, we used the RBF kernel function given in Equation 2. Grid search is used to determine optimum parameters and we follow the similar strategy with (Duarte & Wainer, 2017) to determine possible values of C = 2 −5 , 2 −1 , 2 3 , 2 9 , 2 15 and γ = 2 −15 , 2 −9 , 2 −5 , 2 −1 , 2 3 . They concluded that L 1 type metrics such as Manhattan and Chebyshev metrics have the highest accuracy. In this study, we perform the grid search to determine the optimum k value and the distance measure. The possible metric types are set to Manhattan and Chebyshev (Arslan & Arslan, 2021) and k parameter is changed from 1 to 30 in the grid search. The basic idea behind DT (Safavian & Landgrebe, 1991; Aha, Kibler, & Albert, 1991; Bishop, 2006) is to divide (Ho, 1998; Breiman, 1996) . We note that the hyperparameter tuning is the same as the DT method. AdaBoost ( For this reason, the sequence of the bat coronavirus is labelled as neither COV19+ nor COV19-. Precision, recall, F1-score, and accuracy (Goutte & Gaussier, 2005 ) are used to evaluate performance of the machine learning methods. In addition to these metrics, we also used the Matthews correlation coefficient (MCC) (Chicco, Tötsch, & Jurman, 2021) Table 2 where true positive (TP) is the number of accurately In this section, we discuss and compare the efficiency of (Sulistiana & Muslim, 2020; Lin et al., 2008; Merghadi et al., 2020; Aggarwal, 2015) . Thus, the dataset is separated into ten equal size random groups for training and testing. Nine groups are used for training and the remaining one group is used for testing. This process is repeated ten times until each group is tested. Average measures are computed to achieve the effectiveness of the model. Table 3 presents the optimum hyperparameters obtained by the grid search function for each machine learning classifier when CpG based, similarity, and combination of them are separately used. In the following subsections, we evaluate the performance of MLP, Ad-aBoost, RF, KNN, DT, and SVM methods for CpG based features, the similarity features, and integration of them, separately. All results are obtained by using these hyperparameters. In this part, we discuss the performance of the machine learning methods when CpG based features are only used. Table 4 In this part we discuss the performance of the machine learning methods when the similarity features, which are defined in this study, are used. Table 5 In this part, we discuss the results of the machine learning methods using the integration of the CpG based Table 3 The best performing hyperparameters of the machine learning methods with CpG-based, similarity, and integration of them, seperately. Hyperparameter CpG based Similarity Integration features features of them The confusion matrices for each machine learning classifier are presented in Figure 4 . As can be seen in this In this part, we compare the proposed method with existing studies detecting COVID-19 disease from genome sequences. Table 7 gives comparative results of COVID-19 detection methods. Naeem et al. (2020) used genomic signal processing to detect COVID-19. They converted the nucleotide bases in the genome sequences into numbers to extract features (Ghosh & Barman, 2013 Table 7 Overview of the state-of-the-art COVID-19 detection methods softmax layer containing five units. The convolution layer identified the subsequences whose lengths were 21 base pairs (bps). They extracted 3827 features (i.e. 21 bps sequences) using 553 human coronavirus sequences. Next, they applied a state-of-the-art feature selection algorithm to decrease the sequences required to determine different virus strains to the bare minimum. Finally they listed 12 SARS-CoV-2 specific 21 bps subsequences to describe SARS-CoV-2 sequences. They performed 10-fold cross validation and their method could detect SARS-CoV-2 from any other virus with >99% accuracy. Furthermore, they showed that their methods separated the genome sequences obtaining from different coronavirus families with 98.73% accuracy. When we compare this method with the proposed method, their feature extraction method is expensive and the learning time of their method is long. We recall that the proposed method uses only five effective features. Furthermore, their dataset is imbalanced and includes few SARS-CoV-2 sequences. Therefore, there is relatively less information about the class of SARS-CoV-2 sequences and their proposed method may not separate rare SARS-CoV-2 sequences from majority (Sun, Wong, & Kamel, 2009; Japkowicz & Stephen, 2002 In the future, various types of viruses causing zoonotic diseases can appear, and it may possible to apply the same strategy to detect the disease. We will also investigate whether CpG motifs with similarity features may be used to develop vaccines in future studies. COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus 19 hensive survey The arithmetic optimization algorithm Aquila optimizer: A novel meta Feature Selection and Enhanced Krill Herd Algorithm for Text Document Clustering Data mining: The textbook Instance-based learning algorithms. Machine Learning Correlation of chest CT and RT-PCR testing for coronavirus disease 2019 (COVID-19) in china: A report of 1014 cases January). A novel framework for rapid diagnosis of COVID-19 on computed tomography scans. Pattern Analysis and Applications Duration of infectiousness and correlation with rt-pcr cycle threshold values in cases of covid-19 Automated triaging of adult chest radiographs with deep artificial neural networks COVID-19 Outbreak Prediction with Machine Learning. Algorithms Pattern recognition and machine learning Bagging predictors. Machine Learning Random forests. Machine Learning An experimental comparison of classification algorithms for imbalanced credit scoring data sets A tutorial on support vector machines for pattern recognition The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation Early diagnosis of COVID-19-affected patients based on X-ray and computed tomography images using deep learning algorithm. Soft Computing COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus 21 Predicting CoVID-19 community mortality risk using machine learning and development of an online prognostic tool Coronavirus Disease (COVID-19): A Machine Learning Bibliometric Analysis Efficient knn classification algorithm for big data pii/S0167865517300077 doi A decisiontheoretic generalization of on-line learning and an application to boosting Evaluating Overfit and Underfit in Models of Network Community Structure Decision tree-based diagnosis of coronary artery disease: Cart model An online coronavirus analysis platform from the national genomics data center A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation Evaluation of k-nearest neighbor classifier performance for direct marketing Optimization of machine learning algorithms hyper-parameters for improving the prediction of patients infected with covid-19 The random subspace method for constructing decision forests First case of 2019 novel coronavirus in the united states Multilayer feedforward networks are universal approximators The class imbalance problem: A systematic study Chaos game representation of gene structure Review of the clinical characteristics of coronavirus disease 2019 (COVID-19) Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus 23 Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel Significant applications of machine learning for covid-19 pandemic Genetic evolution analysis of 2019 novel coronavirus and coronavirus from other species Evolutionary history, potential intermediate animal host, and crossspecies analyses of sars-cov-2 A neural network model with bounded-weights for pattern classification. Computers & Operations Research pii/S0957417407003752 doi Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classifica-Hilal Arslan tion of spectral data Machine learning methods for landslide susceptibility studies: A comparative overview of algorithm performance A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19 Retrieved 2021-06-30 Supervised Machine Learning Models for Prediction of COVID-19 Infection using A diagnostic genomic signal processing (GSP)-based system for automatic feature analysis and detection of COVID-19. Briefings in Bioinformatics Using artificial intelligence techniques for COVID-19 genome analysis Coronavirus Infections more Than Just the Common Cold Scikit-learn: Machine learning in python Coronaviruses post-SARS: update on replication and pathogenesis January) COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus 25 lular components. Travel Medicine and Infectious Disease, 39 , 101911 Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: Covid-19 case study A survey of decision tree classifier methodology Modified K-NN algorithm for classification problems with improved accuracy Evolutionary modelling of the covid-19 pandemic in fifteen most affected countries Review of artificial intelligence techniques in imaging data acquisition, segmentation, and diagnosis for covid-19 COVID-19 detection in CT images with deep learning: A votingbased scheme and cross-datasets analysis. Informatics in Medicine Unlocked The global landscape of SARS-CoV-2 genomes, variants, and haplotypes in 2019ncovr Support vector machine (svm) optimization using grid search and unigram to improve e-commerce review accuracy SVM Parameter Optimization using Grid Search and Genetic Algorithm to Improve Classification Performance A hyper learning binary dragonfly algorithm for feature selection: A covid-19 case study. Knowledge-Based Systems, 212 , 106553 Automated detection of covid-19 disease using deep fused features from chest radiography images Diagnosing COVID-19: The Disease and Tools for Detection The Nature of Statistical Learning Theory Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in wuhan Human sars-cov-2 has evolved to reduce cg dinucleotide in its open reading frames A new coronavirus associated with human respiratory disease in China COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus 27 Transmission dynamics and evolutionary history of 2019-ncov COVID-Classifier: an automated machine learning model to assist in the diagnosis of COVID-19 infection in chest X-ray images A pneumonia outbreak associated with a new coronavirus of probable bat origin Machine learning-based prediction of COVID-19 diagnosis based on symptoms