key: cord-0958704-f2uaxzy2
authors: Arslan, Hilal; Arslan, Hasan
title: A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier
date: 2021-01-09
journal: nan
DOI: 10.1016/j.jestch.2020.12.026
sha: d60abcccafce93825bbe2f80ffd36fd601927943
doc_id: 958704
cord_uid: f2uaxzy2

Various viral epidemics have been detected such as the severe acute respiratory syndrome coronavirus and the Middle East respiratory syndrome coronavirus in the last two decades. The coronavirus disease 2019 (COVID-19) is a pandemic caused by a novel betacoronavirus called severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). After the rapid spread of COVID-19, many researchers have investigated diagnosis and treatment for this terrifying disease quickly. Identifying COVID-19 from the other types of coronaviruses is a difficult problem due to their genetic similarity. In this study, we propose a new efficient COVID-19 detection method based on the K-nearest neighbors (KNN) classifier using the complete genome sequences of human coronaviruses in the dataset recorded in 2019 Novel Coronavirus Resource. We also describe two features based on CpG island that efficiently detect COVID-19 cases. Thus, genome sequences including approximately 30,000 nucleotides can be represented by only two real numbers. The KNN method is a simple and effective non-parametric technique for solving classification problems. However, performance of the KNN depends on the distance measure used. We perform 19 distance metrics investigated in five categories to improve the performance of the KNN algorithm. Some efficient performance parameters are computed to evaluate the proposed method. The proposed method achieves 98.4% precision, 99.2% recall, 98.8% F-measure, and 98.4% accuracy in a few seconds when any L 1 type metric is used as a distance measure in the KNN.

Coronaviruses, which are positive sense and single-stranded RNA viruses, are known to have the largest viral genomes among all RNA viruses [1] . The family of coronaviruses has been classified into four genera [2] : alphacoronavirus (AlphaCoV), betacoronavirus (BetaCoV), gammacoronavirus (GammaCoV), and deltacoronavirus (DeltaCoV). AlphaCoV and BetaCoV infect mammalian hosts, however, GammaCoV and DeltaCoV mainly infect bird species [3] .

There have been recorded various coronaviruses, which are the types of AlphaCoV and BetaCoV. Human coronavirus 229E (229E-CoV) and human coronavirus NL63 (NL63-CoV) are the types of AlphaCoV. Moreover, human coronavirus HKU1 (HKU1-CoV), Severe Acute Respiratory Syndrome coronavirus (SARS-CoV), and Middle East Respiratory Syndrome coronavirus (MERS-CoV) are the types of BetaCoV recorded recently. While 229E-CoV, NL63-CoV and HKU1-CoV cause simple respiratory traces, SARS-CoV and MERS-CoV cause fatal and dangerous respiratory infections.

SARS-CoV was first identified in the Guangdong province of southern China on 16 November 2002 [4] and MERS-CoV emerged in June 2012 in Jeddah, Saudi Arabia [5] . In late December 2019, a novel coronavirus SARS-CoV-2 emerged in China and spread rapidly throughout the world. The SARS-CoV-2 virus, which is genetically similar to SARS-CoV, caused a severe illness known as COVID-19 disease and a serious number of deaths worldwide. Since several early infected people visited a local seafood Market in December 2019 in Wuhan city of China, the virus is therefore thought to be a pathogen (SARS-CoV-2 pathogen) that jumped from an animal to a human and that caused an infectious disease. Similarity rate between SARS-CoV-2 and a bat coronavirus is 96% [6] . The outbreak COVID-19 was declared as a global pandemic on 11 March 2020 by the World Health Organization (WHO). As of 8 October 2020, the disease spread to 188 countries and territories has infected over 36.2 million people and has caused more than 1.05 million deaths; more than 25.2 million people have recovered from this illness.

The most common symptoms of COVID-19 disease are cough, fever, gastrointestinal and musculoskeletal symptoms as well as loss of taste or smell. Shortness of breath and chest pressure or pain are among less common symptoms [7] . Because these common symptoms are similar to the common flu, it is difficult to conduct an early diagnosis of SARS-CoV-2. It is essential to quickly identify positive cases since it spreads rapidly and poses a threat to the public health system. Furthermore, there is still no specific antiviral drug recommended for the treatment of the novel coronavirus disease other than supportive care so far. In the period of pre-clinical researches, some medications already prescribed for other diseases have shown positive effects against COVID-19 virus. Due to the absence of an effective COVID-19 treatment, early detection of this life-threatening infectious disease plays a very important role in all medical therapies and in the prevention of the spread of the disease. Recently, many data scientists have been working extensively on remarkable features of the virus. For this purpose, artificial intelligence applications and machine learning methods have also been used successfully to speed up the diagnosis process of the COVID-19 cases [8] [9] [10] . Therefore, the learning based classification models will not only reduce the burden on healthcare professionals but will also facilitate early diagnosis of the disease.

In this paper, we introduce an effective classification method to distinguish SARS-CoV-2 from common human coronaviruses, which are AlphaCoV, BetaCoV-1, MERS-CoV, HKU1-CoV, NL63-CoV, and 229E-CoV using CpG island features and KNN classification method. Main contributions of this research can be summarized as follows:

The choice of the differentiable features based on characteristic of SARS-CoV-2 is a critical step to improve classification performance. In this study, we propose CpG island features to differentiate SARS-CoV-2 from the other human coronaviruses. We propose a robust prediction method by using KNN classifier with any L 1 type distance metric selected from among 19 distance metrics investigated in five categories, which are L 1 type, L 2 type, vicissitude, inner product metrics, and the other types of metrics. We construct a larger dataset containing almost all types of these human coronaviruses which are genetically similar to SARS-CoV-2. They are AlphaCoV, BetaCoV_1, MERS-CoV, NL63-CoV, HKU1-CoV, and 229E-CoV. The proposed method efficiently detects COVID-19 cases in a few seconds on the relatively large dataset.

The rest of the paper is organized as follows. In Section 2, a literature survey on COVID-19 is presented. In Section 3, the Knearest neighbors (KNN) method and distance measures used in the KNN method are summarized. The proposed classification strategy to detect COVID-19 cases is introduced in Section 4. Section 5 reports experimental results and evaluation of the experimental observations conducted by various distance metrics. In Section 6, we compare the proposed method with the other detection methods. Finally, Section 7 presents concluding remarks and future directions.

Deep learning and machine learning techniques have been widely used in a variety of research areas such as big data analysis [11] [12] [13] , image classification [14] , face detection [15, 16] , and disease prediction [17] [18] [19] [20] [21] . The computer based diagnostic systems developed by the help of artificial intelligence techniques will speed up early diagnosis of COVID-19 and thus will decrease the workload on health-workers. Chen et al. [22] published a comprehensive survey about studies performed by using artificial intelligence in the literature on the fight against COVID-19. In addition, they observed that machine learning, deep learning and artificial neural networks technologies have been successfully implemented at almost every stage of combating COVID-19 when compared to other coronaviruses such as SARS-CoV and MERS-CoV.

Randhawa et al. [23] suggested a combination of supervised machine learning with digital signal processing methods (ML-DSP) for an accurate and scalable taxonomic classification of genomic sequences. They matched each genomic sequence to discrete values corresponding to its genomic signals. They employed six supervised machine learning classifiers (linear discriminant, linear support vector machine, quadratic support vector machine, fine KNN, subspace discriminant, and subspace KNN) to detect SARS-CoV-2. They used 29 SARS-CoV-2 genome sequences and 20 genome sequences for each of alphacoronavirus, betacoronavirus, and delta-coronavirus. They observed that the linear discriminant method achieved 100% accuracy. Naeem et al. [24] proposed a method to distinguish COVID-19, SARS-CoV, and MERS-CoV viruses by using the K-nearest neighbors and the trainable cascadeforward back-propagation neural network methods. They extracted genomic signal processing features using a dataset that contains 76 genome sequences for each type of coronavirus from the National Center for Biotechnology Information (NCBI). Their results showed that performance of the KNN algorithm was higher than the cascade-forward back-propagation neural network in all COVID-19/SARS-CoV, COVID-19/MERS-CoV and COVID-19/SARS-CoV/MERS-CoV classification processes, and achieved an accuracy of 100%. Batista et al. [25] developed a new method to predict whether patients in the emergency care unit are COVID-19 positive or not by using five supervised machine learning algorithms, which are neural networks, random forests, logistic regression, support vector machines, and gradient boosting regression trees. They collected data from 235 patients, and they used 15 different types of features such as age, gender, hemoglobin etc. Their experimental results showed that the support vector machine classifier achieved the highest performance with AUC value of 84.7% among the other classifiers.

Unal and Dudak [26] studied on diagnosis of COVID-19 viral disease. They applied Naive Bayes, KNN, support vector machines and decision tree algorithms to the dataset named as COVID-19 Mexico Patient Health Dataset. The dataset consists of 95839 cases recorded by Mexican government, and 19 different types of features like as the sex of the patient, age of the patient, the state of pneumonia and intubation as well as the state of many other diseases. They performed four types of supervised machine learning algorithms. Their experimental results showed that support vector machine achieved the best predictive performance with the classification accuracy of 100%.

Although there are only few studies detecting COVID-19 cases from genome sequences, a number of papers have been published for detecting COVID-19 cases from X-ray or computed tomography (CT) images [27] [28] [29] [30] [31] [32] [33] [34] , recently. Barstugan et al. [27] developed a COVID-19 classification method by using the support vector machine classifier on a dataset including different types of 150 abdominal CT images. Their experimental results showed that their proposed method achieved an accuracy of 99.6%. Ozturk et al. [28] applied the DarkCovidNet deep learning model to raw chest X-ray images and generated a binary classification (COVID-19, no findings). They also used the DarkCovidNet classifier method on the same dataset to create a triple classification (COVID-19, nofindings, pneumonia). They declared that the highest accuracy of the classifier was 98.08% for the binary classification and 87.02% for the triple case. Sekeroglu et al. [29] proposed an alternative COVID-19 detection method by using deep learning and machine learning classifiers on a publicly available dataset which contains 1583 healthy, 4292 pneumonia and 225 confirmed COVID-19 chest X-ray images. They stated that a convolutional neural network (CNN) without pre-processing and with minimized layers achieved an accuracy of 98.50%. Jain et al. [30] developed a new approach to detect COVID-19 cases among chest X-ray images of healthy, bacterial pneumonia, viral pneumonia and COVID-19 by using ResNet18, ResNet101, DenceNet121 and VGG-16 deep learning models. Their experimental results demonstrated that ResNet101 method achieved the highest accuracy of 98.93%. Apostolopoulos and Bessiana [31] used VGG19, Mobile Net, Inception, Xception and Inception ResNet v2 deep transfer learning classifiers to detect COVID-19 positive cases among 224 COVID-19, 700 bacterial pneumonia and 504 normal chest X-ray images. Their results presented that VGG19 achieved the best binary classification accuracy of 98.75% over the other CNN methods.

Ahuja et al. [32] used deep learning methods to detect COVID-19 positive cases from chest X-ray images. Asnaoui and Chawki [33] constructed a novel method for detecting COVID-19 from pneumonia chest X-ray and tomography images. They performed recent deep learning methods, VGG16, VGG19, DenseNet201, Inception_ResNet_V2, Inception_V3, Resnet50, and MobileNet_V2. They reported that the performance of Inception_Resnet_V2 demonstrated the best accuracy of 92.18%. Basu et al. [34] proposed an alternative screening method of COVID-19, which is called Domain Extension Transfer Learning. They extracted some discriminate features from the chest X-ray dataset. They then classified the images as normal, pneumonia, other diseases, and COVID-19 with an accuracy of 95.3%.

We propose a new COVID-19 detection method based on CpG island features and the KNN classifier. The KNN is extremely useful for large data classification and its performance depends on the distance metric used [35] [36] [37] [38] . Several studies have been conducted to detect optimum metrics for KNN algorithm in [39, 40] . In this section, first, we provide a brief description of the KNN method and some recent studies based on the KNN. Second, we investigate the metrics used in the KNN classifier under five categories to improve the performance of the KNN classifier on our model.

K-nearest neighbors (KNN), a supervised machine learning algorithm, can be efficiently used to solve classification problems. The KNN was introduced in 1951 by [41] and then recasted in 1967 by [42] . The KNN is a non-parametric classifier known as one of the simplest and laziest algorithms. That is, there is no need to create a learning model in this classification method. Despite the lazy structure of KNN, it was proposed as one of the 10 most effective methods in the process of analyzing information in a database given [43] . In the prediction process, the class to which a new observation data belongs is determined by calculating the shortest distance between the observation sample and its K-nearest neighbors samples.

There are some recent studies to improve the performance of the KNN. To decrease the sensitivity of the neighborhood size of k and improve voting strategy in the region of neighborhoods, Gou et al. [44] proposed two k-nearest neighbor rules, which are the weighted representation-based k-nearest neighbor rule and the weighted local mean representation-based k-nearest neighbor rule. Their experimental results demonstrate that the proposed methods have a lower sensitiveness to k. Gou et al. [45] proposed another study to improve the selection of the neighborhood size of k by introducing the generalized mean distance-based KNN classifier. They stated that the proposed method is less sensitive to k over the KNN-based classifier.

To improve the performance of the KNN classifier, we investigate the metrics used in the KNN algorithm under five categories by following the similar categorization in [39, 40] . We list them as below: 

In this section, we present a new COVID-19 detection method based on the KNN classifier and CpG island features. Main steps of the proposed method are described in 1. The first step of the algorithm is feature extraction. Extracting robust and discriminative features from human coronaviruses genome sequences is the most critical step to improve diagnosis of SARS-CoV-2. In this step, we propose to use CpG based features since CpG dinucleotides in the open reading frames of SARS-CoV-2 has extremely lowfrequency [46] . The main reason of lower CpG dinucleotides density is the mutation of C into A and G into T [47, 48] . The proposed features are extracted by using Eq. 1 and Eq. 2.

CGp ¼ ratioðCÞ þ ratioðGÞ ð 1Þ where ratioðCÞ; ratioðGÞ, and ratioðCGÞ are computed as divided the number of occurrences of C; G, and CG in the sequence, respectively by the sequence length. Thus, each sequence containing 30 000 nucleotides is represented by two features only. 1 provides an example of how features are calculated from a part of the sequence. After feature extraction step, we apply the KNN method to classify SARS-CoV-2 sequences. The performance of the KNN mainly depends on the metric that is used to compute the distances between different data samples. To improve the performance of the KNN algorithm, we perform 19 distance metrics investigated in five categories, which are L 1 type, L 2 type, vicissitude, inner product metrics, and the other types of metrics. In this step, we propose to use L 1 type metric as a distance measure in the KNN.

Require: S: training genome sequences, L: COVID-19 positive, COVID-19 negative seq: a test sequence, and k: the neighborhood size Ensure: Determine the class label of the test sequence seq

Step 1: Feature Extraction 1: for all genome sequences do 2: feature 1 ¼ ratioðCÞ þ ratioðGÞ 3: feature 2 ¼ ratioðCGÞ=ratioðCÞratioðGÞ 4: end for

Step 2: Apply KNN and use L 1 type metric 5: Compute the distance between seq and every sample in S using any L 1 type metric 6: Choose k samples in S that are nearest to seq 7: Assign seq to majority class

In this section, we evaluate performance of the proposed method. Before discussing the results, first we explain our dataset, second we mention experimental setup, and then we provide a brief information about the performance measures. Finally, we will give the results of the experimental observations conducted by using different distance metrics.

The 2019 novel coronavirus resource (2019nCoVR) [49] at China National Center for Bioinformation is one of the most important sources of various types of coronaviruses. It integrates various important databases including the GISAID, NCBI, NMDC and CNCB/NGDC. In this study, we used complete genomic sequences of human coronaviruses obtained from 2019nCoVR. Genome sequence of each coronavirus has approximately a length of 30 000 nucleotides. The properties of human coronavirus sequences are presented in 6. The dataset includes various types of human coronaviruses such as AlphaCoV, BetaCov-1, MERS-CoV, NL63-CoV, HKU1-CoV and 229E-CoV as well as SARS-COV-2. We refer SARS-CoV-2 sequences as COVID-19 positive, and the sequences that do not include SARS-CoV-2 are referred as COVID-19 negative. In addition to 1000 SARS-CoV-2 sequences, we used 592 genome sequences of other human coronaviruses in our experiments. We note that all available genome sequences of human coronaviruses, which are different from SARS-CoV-2 are downloaded.

The experiments were performed using a core i7, 2.7 GHz processor, 16 GB RAM under Linux operating system. Feature extrac- 

Hassanat metric P n i¼1 Dðx i ; y i Þ Fig. 1 . CpG based features. The numbers of C, G, and CG are 13, 20, and 3, respectively. Thus, CGp = ratio(C) + ratio(G) = 0.55, and CpGo = ratio(CG)/(ratio(C) ratio(G)) = 0.69.

The properties of complete genome sequences of human coronaviruses. tion and the classification process were performed using Python programming language. Precision, recall, F-measure, and accuracy in predicting COVID-19 positive case are used as the major performance measures. Next, we define these performance measures.

COVID-19 prediction as a binary classification problem has four prediction outcomes. True Positive (TP) is the number of genomic sequences which are correctly classified as COVID-19 positive, True Negative (TN) is the number of genome sequences which are correctly classified as COVID-19 negative, False Positive (FP) is the number of genome sequences which are incorrectly classified as COVID-19 positive, and False Negative (FN) is the number of sequences which are incorrectly classified as COVID-19 negative. The performance evaluation metrics such as precision, recall, Fmeasure, and accuracy are defined by using these outcomes and presented in confusion matrix by assigning actual and predicted labels in 2. Next, we briefly explain performance parameters used in this study.

Precision is the accuracy of the classifier in the presence of false positive case. It is computed as the ratio of the number of correctly classified positive samples to the number of samples labeled by the system as positive and it is exhibited in Eq. 3.

Recall refers to as the number of positive class predictions made out all positive samples in the data set. It is computed as in Eq. 4.

F-Measure is determined by the harmonic mean of precision and recall, and computed using Eq. 5.

Accuracy is computed by dividing the total number of true cases by all cases and it is indicated in Eq. 6.

In this section, we evaluate the efficiency of proposed features and the performance of the KNN classifier with respect to metrics using a fivefold cross-validation that was performed on the dataset. The dataset is randomly divided into two sets which are training and testing. Eighty percent of the entire human genome sequences is used for training and the remaining 20% of the dataset is used for validation. The experiments are repeated five times as presented in 3. In our experiments, we increase the neighborhood size of k between 1 and 20, and observe that the classification performance of the proposed method remains the same or slightly decreases with increasing the neighborhood size of k. For this reason, in all experiments, the k value is basically set to 1.

7 presents precision, recall, and F-measure values of the KNN classifier with respect to five metric groups, which are L 1 type, L 2 type, vicissitude, inner product metrics, and the other types of metrics. The results of the metrics in the same group are close to each other; thus, we present the average of the results of the experiments performed by the metrics in the same group. First, we analyze the results of the KNN equipped with L1 type metrics, which are KM, ChebM, ManM, SM, CanM, and MCM. The KNN with L1 type metrics presents the best result, and achieves a precision of 98.4%, a recall of 99.2%, and an F-measure of 98.8%. This means that almost all sequences are classified correctly. Next we look at the results of the KNN with L2 type metrics, which are ClaM, EM, SquM, NCSM, DivM, and SCSM. The KNN classifier with this group metrics achieves a precision of 96.0%, a recall of 98.2%, and an Fmeasure of 97.1% on average. When we look at the results of the KNN with vicissitude metrics, it presents better result than KNN with L2 type metrics. It achieves a precision of 98.4%, a recall of 99.0%, and an F-measure of 98.7%. Next, we investigate the results of the KNN with inner product metrics, ChoM and DicM. This group of metrics presents the worst results among the other groups, and achieves a precision of 94.4%, a recall of 92.8%, and an F-measure of 93.4%. Finally, we investigate the other type of metrics, which are HasM and MotM. The results of KNN with HasM are close to the results of the KNN with inner product type metrics, and achieves a precision of 94.4%, a recall of 92.9%, and an F-measure of 93.4%. On the other hand, the results of the KNN with MotM is close to them of L2 type metrics, and are better than that of HasM. It achieves a precision of 96.4%, a recall of 98.2%, and an F-measure of 97.3%.

In addition to precision, recall, and F-measure values, we present accuracy results. We take the average of the accuracies of KNN classifier obtained with the metrics in the same group, and show the results for each metric group separately in 4. The accuracy results of the KNN classifier using inner product metrics and HasM are close to each other and have the worst accuracy, 91.7%, and 91.8%, respectively. When L2 type metrics are used, the method achieves an accuracy of 96.2%. The KNN with MotM metric has remarkable accuracy values, which is 96.5%. The accuracy values of the KNN with L1 and Vicissitude type metrics are close to each other, and our method achieves the best accuracy with 98.4% when L1 type metrics are used. 

Engineering Science and Technology, an International Journal xxx (xxxx) xxx

In this part, we compare the results of the proposed method with the state-of-the-art methods. 8 provides a simple comparison of the results in terms of the method, dataset, class, and accuracy. We investigate these studies under two categories: the studies using genome sequences datasets and the other datasets.

First, we compare our method with the studies which detect COVID-19 cases from genome sequences. Randhawa et al. [23] combined supervised machine learning methods with digital signal processing. They used 29 sequences of SARS-CoV-2, and 20 sequences for each of alphacoronavirus, betacoronavirus, and deltacoronavirus. They achieved 100% accuracy when linear discriminant method was used. Main limitation of their proposed method is the number of sequences used in their dataset. Furthermore, Randhawa et al. used delta coronaviruses genomes. However, deltacoronavirus mainly causes an infectious disease among bird species rather than human. When the number of sequences in their dataset is increased and human coronavirus sequences genetically similar to SARS-CoV-2 sequences are added to their dataset, overall accuracy of their method may decrease. Another study predicting COVID-19 from genome sequences is introduced by Naeem et al. [24] . They used the classical KNN method to distinguish SARS-CoV-2 sequences among SARS-CoV-2, SARS-CoV and MERS-CoV genome sequences. They used 76 sequences for each of SARS-CoV-2, SARS-CoV, and MERS-CoV, and achieved an accuracy of 100%. They extracted features by using Discrete Fourier transform, Discrete cosine transform, and Seven Moment Invariants methods. When comparing to our method, their feature [24] is that they work with a smaller dataset just like Randhawa.

Next we discuss advantages/drawbacks of the methods using the other datasets. Unal and Dudak [26] used Mexico Patient Health Dataset to detect COVID-19 cases. Although the size of their dataset is large, their dataset only represents the features of a specific region. Furthermore, they used 19 different features including the sex of the patient, age of the patient, the state of pneumonia and intubation as well as the state of many other diseases. One of the advantages of our model over the method of Unal and Dudak [26] is to use powerful and effective two features derived from the complete genome sequences of human coronaviruses. Thus, our proposed method detects COVID-19 positive cases within a few seconds. However, Unal and Dudak [26] did not provide any explanation about the detection time. Another different aspects of the proposed method from their method is that they use the classicial KNN classifier with default parameters although we use the KNN classifier with an optimum distance metric in our proposed model.

There are a number of the existing studies detecting COVID-19 cases from image dataset in the literature. In 8, we also exhibit six image based studies achieving remarkable accuracy values. These image based methods used different deep learning or machine learning methods. Although image based studies have a higher accuracy, when considering a high mutation rate of SARS-CoV-2, using genomic sequences is extremely beneficial when tracking coronavirus genes that change frequently as the disease spreads from one person to another. Moreover, the radiation generated by X-ray or CT scanning machines may cause permanent damages to people. For this reason, X-ray or CT scanning may not be obtained for some people, which can be considered as a disadvantage of image based studies.

COVID-19 is an existing epidemic that sets new records in terms of cumulative and daily numbers for global infection. The pandemic is an unprecedented situation in healthcare systems worldwide, and to overcome this pandemic, it is essential to accurately detect COVID-19 cases by analyzing data of patients in a minimum amount of time. In this study, we propose an accurate and fast method to detect COVID-19 positive cases from genome sequences of human coronaviruses. In the proposed method, first, the features that significantly differentiate COVID-19 cases are extracted from the complete genome sequences of human coronaviruses. In this step, we propose to use CpG island features. Each genome sequence of human coronavirus, which includes about 30,000 nucleotides is represented by two real numbers only. Feature extraction step takes just a few seconds. Second, the KNN method is used for the classification of COVID-19 positive cases from the other types of human coronaviruses, AlphaCoV, Beta-Cov-1, MERS-CoV, NL63-CoV, HKU1-CoV, and 229E-CoV. The KNN classifier is the simplest method and has high flexibility for solving complex classification problems. The accuracy of the KNN is higher than state-of the-art classifiers in certain cases, and it often produces efficient performance. However, the performance of the KNN greatly depends on the metric performed. To detect the most appropriate metric, we review five groups of metrics used in the KNN classifier. The selection of different distance metrics for the KNN can result a variation in accuracy outcomes for the same dataset. Experimental results reveal that the proposed method achieves the highest accuracy, which is 98.4% on average in a few seconds when L 1 type metrics are used as a distance measure in the KNN. In future studies, we will compare human SARS-CoV-2 sequences to other types of coronavirus sequences such as bat SARS-CoVlike coronaviruses 2, and propose a similarity based feature to increase overall accuracy. In addition, the factors affecting the recovery status of patients suffering from COVID-19 may be investigated in future studies by combining machine learning and parallel computing methods with effective features.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. 

Coronavirus envelope protein: current knowledge

Origin and evolution of pathogenic coronaviruses

A chronicle on the sars epidemic

Middle east respiratory syndrome

Transmission dynamics and evolutionary history of 2019-ncov

Clinical characteristics of coronavirus disease 2019 (covid-19) in china: A systematic review and meta-analysis

Artificial intelligence (ai) applications for covid-19 pandemic

Review on machine and deep learning models for the detection and prediction of coronavirus

Machine learning techniques for sequence-based prediction of viral-host interactions between sars-cov-2 and human proteins

Scalable machine-learning algorithms for big data analytics: a comprehensive review

Deep learning applications and challenges in big data analytics

Big data analysis and deep learning applications: proceedings of the first international conference on big data analysis and deep learning

Machine learning algorithms for image classification of hand digits and face recognition dataset

Facial detection using deep learning

Robust real-time face detection

Metagenomic nanopore sequencing of influenza virus direct from clinical respiratory samples

Assessment of metagenomic nanopore and illumina sequencing for recovering whole genome sequences of chikungunya and dengue viruses directly from clinical samples

Machine learning based approaches for detecting covid-19 using clinical text data

Using artificial intelligence to detect covid-19 and community-acquired pneumonia based on pulmonary ct: evaluation of the diagnostic accuracy

Artificial intelligence and machine learning to fight covid-19

A Survey on Applications of Artificial Intelligence in Fighting Against COVID-19

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: Covid-19 case study

A diagnostic genomic signal processing (GSP)-based system for automatic feature analysis and detection of COVID-19

Covid-19 diagnosis prediction in emergency care patients: a machine learning approach

Classification of covid-19 dataset with some machine learning methods

Coronavirus (covid-19) classification using ct images by machine learning methods

Automated detection of covid-19 cases using deep neural networks with X-ray images

Detection of covid-19 from chest x-ray images using convolutional neural networks

A deep learning approach to detect covid-19 coronavirus with X-ray images

Covid-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks

Deep transfer learning-based automated detection of covid-19 from lung ct scan slices

Using X-ray images and deep learning for automated detection of coronavirus disease

Deep learning for screening covid-19 using chest Xray images

A mapreduce-based k-nearest neighbor approach for big data classification

Efficient knn classification algorithm for big data

knn-is: an iterative spark-based design of the k-nearest neighbors classifier for big data, Knowl.-Based Syst

Efficient tree classifiers for large scale datasets

Comprehensive survey on distance/similarity measures between probability density functions

Effects of distance measure choice on k-nearest neighbor classifier performance: a review

Discriminatory analysis. nonparametric discrimination: consistency properties

Nearest neighbor pattern classification

Top 10 algorithms in data mining

Locality constrained representation-based k-nearest neighbor classification, Knowl.-Based Syst

A generalized mean distancebased k-nearest neighbor classifier

Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense

Human sarscov-2 has evolved to reduce cg dinucleotide in its open reading frames

Unfolding sars-cov-2 viral genome to understand its gene expression regulation

The 2019 novel coronavirus resource