key: cord-1016699-ruy4qk5r authors: Mansour, Nehal A.; Saleh, Ahmed I.; Badawy, Mahmoud; Ali, Hesham A. title: Accurate detection of Covid-19 patients based on Feature Correlated Naïve Bayes (FCNB) classification strategy date: 2021-01-15 journal: J Ambient Intell Humaniz Comput DOI: 10.1007/s12652-020-02883-2 sha: 1e54ef52fec31158ded3b4cc2338e4d879437a2a doc_id: 1016699 cord_uid: ruy4qk5r The outbreak of Coronavirus (COVID-19) has spread between people around the world at a rapid rate so that the number of infected people and deaths is increasing quickly every day. Accordingly, it is a vital process to detect positive cases at an early stage for treatment and controlling the disease from spreading. Several medical tests had been applied for COVID-19 detection in certain injuries, but with limited efficiency. In this study, a new COVID-19 diagnosis strategy called Feature Correlated Naïve Bayes (FCNB) has been introduced. The FCNB consists of four phases, which are; Feature Selection Phase (FSP), Feature Clustering Phase (FCP), Master Feature Weighting Phase (MFWP), and Feature Correlated Naïve Bayes Phase (FCNBP). The FSP selects only the most effective features among the extracted features from laboratory tests for both COVID-19 patients and non-COVID-19 people by using the Genetic Algorithm as a wrapper method. The FCP constructs many clusters of features based on the selected features from FSP by using a novel clustering technique. These clusters of features are called Master Features (MFs) in which each MF contains a set of dependent features. The MFWP assigns a weight value to each MF by using a new weight calculation method. The FCNBP is used to classify patients based on the weighted Naïve Bayes algorithm with many modifications as the correlation between features. The proposed FCNB strategy has been compared to recent competitive techniques. Experimental results have proven the effectiveness of the FCNB strategy in which it outperforms recent competitive techniques because it achieves the maximum (99%) detection accuracy. Coronavirus is highly threatening for both animal and human life. Many types of coronavirus can transfer from animals to the human population (Shaban et al. 2020 ; Barstugan et al. 2020 ). Humans have not previously identified COVID-19 because it is a new species that appeared in 2019. COVID-19 is a global epidemic problem that can spread rapidly among people (Shaban et al. 2020; Li et al. 2020a ). On the 7th of January 2020, COVID-19 has been identified by the World Health Organization (WHO) and the Chinese government as a global pandemic (Kang et al. 2020) . COVID-19 has typical symptoms that involve shortness of breath, fever, headache, cough, fatigue, sore throat, and muscle pain (Huang et al. 2020) . Physical contact is the main reason of the spread of COVID-19 disease among people. The infections are transferred from the infected COVID-19 person to the healthy person through hand contact, mucous contact, or breathe contact. Because of the rapid spread of COVID-19 around the world, it causes a destructive impact on issues like public health, the global economy, and daily activities. Moreover, COVID-19 infection takes less than 4 weeks to quash the medical system once it begins to spread (Shaban et al. 2020) . To this end, early detection of COVID-19 especially with the lacking of specific cures or vaccines, is an essential process for treating and controlling the disease from spreading. A real-time Reverse Transcription-Polymerase Chain Reaction (RT-PCR) is the most preferable test that is currently used for detecting COVID-19 patients (Zu et al. 2020) . Although RT-PCR tests are sensitive, fairly quick, and reliable, these tests suffer from the risk of eliciting false-negative and false-positive results. Consequently, the spread of COVID-19 infection has been increased because RT-PCR tests cannot immediately distinguish the infected people (Zu et al. 2020) . Chest radiological imaging such as Computed Tomography (CT) images and X-rays play an important role in the early detection and treatment of COVID-19 patients. Despite the advantages of CT images for detecting COVID-19 patients, misclassification may occur between the imaging features of COVID-19 and other types of diseases (Shaban et al. 2020; Li et al. 2020a, b) . With increasing demand toward providing accurate tests, the dependency on CT images or RT-PCR tests as accurate tools for the detecting of COVID-19 patients is decreased dramatically. To this end, fast and accurate detection of COVID-19 patients is very important to prevent the sources of infection. Recently, machine learning is an adjunct tool for clinicians. Machine learning can automatically support medical diagnosis as a helping tool for identifying and detecting the novel coronavirus. Machine learning is an application of Artificial Intelligence (AI) that is used for the concept of software that automatically learns how to execute a task or solve a problem (Rabie et al. 2015; Rabie et al. 2019a) . Machine learning techniques become more and more accurate over time, and they work on the same principle. Firstly, they receive some input training data. Then, build the mathematical model depending on this training input data. Finally, the mathematical model is used to solve the problem at hand. Many methods have been provided for COVID-19 detection based on machine learning techniques (Zhong et al. 2020; Rustam et al. 2020; Alazab et al. 2020) . Despite the efficiency of these methods, they suffer from many limitations such as low diagnosis accuracy, high complexity, and long prediction time. Naïve Bayes (NB) is a simple, popular classifier, and powerful machine learning technique. It has been verified as the highly professional probabilistic classifier that has solid mathematical fundamentals (Kumar et al. 2019; Rabie et al. 2015 Rabie et al. , 2019a . NB has worked very well in several complex real-world applications such as; medical diagnosis, real-time prediction, spam filtering, and weather forecasting despite its oversimplified assumptions and its Naïve design (Dada et al. 2019; Ali and Ali 2020; Hewage et al. 2020; Lei et al. 2020) . Thus, NB can be considered as one of the best classifiers that can be applied for COVID-19 detection. This is due to many reasons, which can be summarized as follows; (1) NB can provide fast predictions rather than other classification algorithms because the training time has an order O(N) with the dataset, (2) it can be easily trained with small amount of input training dataset and it can be used also for large datasets as well, (3) the simplicity and easy implementation with the ability of real-time training for new items, (4) the implementation of this classifier has no required adjusting parameters or domain knowledge, (5) It handles both continuous and discrete data, (6) NB is less sensitive to missing data, (7) NB has high capability to handle the noise in the dataset, (8) NB is an Incremental learning approach because its functions work from an approximation of low-order probabilities which are extracted from the training data. Hence, these can be quickly updated as new training data are obtained, (9) If the Naive Bayes conditional independence assumption holds, then it will converge quicker than discriminative models like logistic regression, (10) NB can be used for both binary and multiclass classification problems and (11) NB is sufficient for real-time applications such as diseases diagnoses because it relies on a set of pre-computed probabilities that make the classification done in a very short time (Khotimah et al. 2020; Kaur and Oberoi 2020) , Although NB has proven efficiency with real-time applications, its performance is sometimes thumping in many cases because of the unrealistic assumption that all features have the same degree of importance and are independent of the given class value. Hence, this unrealistic assumption should be mitigated to overcome such hurdles. Recently, there have been extensive researches to provide solutions for this issue such as feature selection and weighting. However, the desired performance of NB has not been introduced yet. More efforts should be performed to enhance the performance of NB to match realworld conditions. The contributions of the proposed work are listed as follows. • A novel Feature Correlated Naïve Bayes (FCNB) classification strategy for accurate detection of Covid-19 patients has been proposed. • The FCNB consists of two stages, namely; (1) Pre-Processing Stage (P 2 S) and (2) Classification Stage (CS). • The P 2 S contains the first three phases of the FCNB strategy called Feature Selection Phase (FSP), Feature Clustering Phase (FCP), and Master Feature Weighting Phase (MFWP). Moreover, the CS contains the Feature Correlated Naïve Bayes Phase (FCNBP) that represents the last phase of the FCNB strategy. • In the P 2 S, the collected historical data on both COVID-19 patients and non-COVID-19 people are represented in suitable form after performing many essential processes to enable the diagnostic model in the next stage to accurately diagnose COVID-19 patients. • During P 2 S, the most significant features will be selected by using the Genetic Algorithm (GA) in FSP, and then these selected features will be put into groups or clusters in the FCP by using a new clustering technique in which each group is called Master Feature (MF) that includes a set of dependent or related features. Then, the MFWP will assign a weight value to each MF by using a new weight calculation method based on the number of features in MF, the correlation between features, and the summation of weights for each feature in MF. • During the second stage (CS), the FCNBP tries to provide a fast and accurate diagnosis for COVID-19 patients based on the data received from the P 2 S by using a new classification model. • The main objective of the proposed classification model is to overcome the problems of traditional weighted NB for improving its performance by (1) taking into consideration the correlation between features, and (2) reduces the classification time because it considers the weights of the used MFs rather than the weights of many individual features. The paper is organized as follows: In Sect. 2, the main problem of this study is formulated. In Sect. 3, the diagnosing methodologies of COVID-19 are presented. In Sect. 4, related work is reviewed. In Sect. 5, the weighted Naïve Bayes is explained. In Sect. 6, the proposed Feature Correlated Naïve Bayes (FCNB) classification strategy is elaborated. An illustrative example is introduced in Sect. 7. The experiments are presented and the results are analyzed in Sect. 8. In Sect. 9, the paper is concluded and the future work is presented. Due to the unavailability of a specific vaccine against COVID-19 infections with no drug has proven a high clinical efficacy, the early detection of COVID-19 disease is essential for disease cure and control. Undoubtedly, the management of COVID-19 will place considerable pressure on health-care systems Li et al. 2020a, c) . Moreover, the low availability of appropriate personal protective equipment for front-line health-care staff causes these key staff to be disproportionately affected by COVID-19. Nowadays, fast detection and isolation of the infected people is an effective method for the healthcare system protection from becoming overwhelmed because it will flat the epidemic curve as depicted in Fig. 1 . Otherwise, with no protective measures, the capacity of the health-care systems will be broken. Disruption or complete breakdown of health-care systems would result in high mortality since the care of all illnesses will be degraded. Due to the unavailability of the diagnosis system everywhere, the detection of COVID-19 is currently a tedious task, which will cause panic. Because of the limited availability of COVID-19 testing kits especially in developed countries, there is a critical need to rely on other diagnosis strategies (Li et al. 2020a, c) . Rapid and accurate detection of COVID-19 is an increasingly vital issue since the infected people may not be recognized and get suitable treatment on time. The infected people will spread the virus to healthy people due to the communicable nature of COVID-19. Although several COVID-19 diagnosis strategies based on data mining and artificial intelligence have been recently introduced, the desired diagnose accuracy to flatten the COVID-19 epidemic curve has not been reached yet (Li et al. 2020a, b; Jamshidi et al. 2020) . The aim objective of this paper is to introduce an accurate, fast, and reliable COVID-19 diagnosis strategy, called FCNB, which inherits the advantages of NB with several modifications. Generally, the diagnosing of COVID-19 can be achieved using three different methodologies as depicted in Fig. 2 . These three methodologies are (1) Real-Time reverse transcriptase-Polymerase Chain Reaction (RT-PCR) (Tahamtan and Ardebili 2020; Waller et al. 2020; Li et al. 2020a, d) , (2) chest CT imaging scan (Mishra et al. 2020; Li et al. 2020a, e; Kovács et al. 2020) , and (3) numerical laboratory tests (Brinati et al. 2020; Kukar et al. 2020; Cabitza et al. 2020; Qiu et al. 2020) . RT-PCR tests are fairly quick, sensitive, and reliable. The sample is collected from a person's throat or nose; adding some chemicals for removing any proteins, fats, and other molecules, leaving behind only the existing Ribonucleic Acid (RNA) (Huang et al. 2020) . The separated RNA is a mixture of a person's RNA and the coronavirus's RNA if exists. Despite its popularity, the RT-PCR test suffers from the risk of false-negative and false-positive results (Chen et al. 2020a, b; Kasteren et al. 2020) . Although several studies had observed that the sensitivity of Chest CT in the diagnosing of COVID-19 is higher than that of RT-PCR, the American College of Radiology (ACR) has issued guidance that CTs and X-rays are not accurate tools for diagnosing COVID-19 (Gietema et al. 2020 ). There are three significant reasons for ACR's recommendation, which are; (1) both chest CT and X-ray cannot accurately distinguish between COVID-19 and other respiratory infections. They can only point to signs of an infection, which could be due to other reasons such as seasonal flu. (2) A huge number of patients infected with COVID-19 have normal chest CTs, which wrongly convince them that they are healthy. Those convince patients can easily spread the virus to others. (3) The usage of the imaging equipment on COVID-19 patients is a critical hazard for doctors and other patients. CT scanners are complex and large machinery pieces (Gietema et al. 2020) . They need to be carefully cleaned after each potential COVID-19 patient. However, even with precise cleaning, there is a high risk that the virus could remain on the surface of the CT scanner room. Moreover, the movement of COVID-19 patients to and from a CT scanner room increases the risk of spreading the virus inside of the healthcare system. On the other hand, the use of accurate Numerical Laboratory Tests (NLTs) can be considered as the most accurate method for diagnosing COVID-19. Recently, the use of NLTs is the only method that the Centers for Disease Control (CDC) currently endorse. Hence; it makes perfect sense that the use of NLTs will provide more accurate diagnosis with less waiting time. The work in this paper is concentrated on providing a new COVID-19 diagnosis system based on NLTs, which have proven to be the most effective methodology for COVID-19 diagnosis. A new diagnosis strategy called FCNB will be introduced, which is based on the weighted Naïve Bayes algorithm with several modifications. Recently, there has been extensive research on COVID-19 patients detection. A Textual Clinical Reports Classification (TCRC) model was provided by Khanday et al. (2020) for detecting COVID-19, Severe Acute Respiratory Syndrome (SARS), Acute Respiratory Distress Syndrome (ARDS), and both (COVID-19, ARDS) by using different classical and ensemble machine learning methods. The experimental results showed that the logistic regression and multinomial Naïve Bayes provided the best results compared to other machine learning algorithms. Ozturk et al. (2020a) developed a Deep Learning (DL) model to detect COVID-19. The proposed model was implemented on the dataset that consists of three classes called; COVID-19, pneumonia, and normal X-ray imagery. This study passed through two main steps, which are; preprocessing step and the classification step. In the pre-processing step, the fuzzy coloring method was used to restructure the data classes and the structured images were stacked. In the classification step, deep learning models (MobileNetV2, SqueezeNet) were trained and then the social mimic optimization technique was used to obtain a set of efficient features. These efficient features were combined to provide the classification by using Support Vector Machines (SVM) as a classifier. The experimental results proved that the proposed classification model could efficiently detect the COVID-19 disease. Maghdid et al. (2020) introduced a Convolution Neural Network (CNN) model to detect COVID-19 cases based on chest X-ray and CT images dataset. The proposed CNN model contained two main algorithms called CNN architecture and AlexNet as a transfer-learning algorithm. Although the simplicity of this proposed model, its accuracy is not enough for the diagnosing of COVID-19 patients. The experimental results illustrated that the maximum accuracy of the utilized models was provided by using a pre-trained network, but the minimum accuracy was provided by using the modified CNN. Chen et al. (2020a, b) introduced a COVID-19 Diagnostic Model (CDM) based on radiological semantic and clinical features without the need for the nucleic acid test. The experimental results demonstrated the effectiveness of the proposed CDM technique for the diagnosing of COVID-19 cases in which CDM provided better diagnostic performance and more considerable net benefits. Waheed et al. (2020) proposed an Auxiliary Classifier Generative Adversarial Network (ACGAN) based GAN called CovidGAN to produce synthetic Chest X-Ray (CXR) images. The synthetic images generated from CovidGAN were utilized to enlarge the dataset and to enhance the performance of Convolutional Neural Network (CNN) for COVID-19 detection. The experimental results proved that the accuracy of the usage of CNN based on the synthetic images generated from CovidGAN was better than the accuracy of using CNN alone. Although the proposed detection model provided the best accuracy, it depended on a small dataset. Additionally, the quality of the synthetic samples needed to be improved by adding more labeled data, which increased the learning process of GAN. An Automatic COVID-19 Detection Model (ACDM) based on using the DarkNet model as a classifier was provided by Ozturk et al. (2020b) . The proposed ACDM method was used as a new detection method based on using chest X-ray images. This model represented the development of deep learning techniques to be able to perform both binary and multi-class classification. The experimental results demonstrated that the effectiveness of ACDM to perform binary tasks was better than its effectiveness in performing multi-class tasks as the accuracy of binary was higher than in multi-class. Sun et al. (2020) presented an Adaptive Feature Selection guided Deep Forest (AFS-DF) based on using chest CT images was introduced to classify COVID-19 patients. For learning a high-level representation of features, the AFS-DF method used a deep forest model. Based on the trained forest, an adaptive feature selection operation was used to decrease the redundancy of the features for improving the performance of the classification process. The experimental results showed that the AFS-DF model outperformed several existing methods in which it could efficiently classify COVID-19 cases based on CT images. Table 1 illustrates a comparative study of the previous efforts on COVID-19 patients detection methods. No doubt, Naïve Bayes (NB) is a popular classifier that had been applied in several domains such as; weather forecasting, bioinformatics, image and pattern recognition, and medical diagnosis. NB allows each feature to contribute towards the classification decision both equally and independently of other features. Although such simplicity increases computational efficiency, it sometimes makes NB insufficient with real-world conditions. Consider F = {f 1 , f 2 , f 3 , …f n } to be a set of feature vectors of a new item IC to be classified and C = {c 1 , c 2 , c 3 , ….c m } be set of target classes. The probability of a new item being in class c j using NB is given by (1) (Berrar 2018; Taha et al. 2013) where, P(c j |F) is the conditional probability of class c j given the feature vector F (also called posterior probability), P(F|c j ) is the conditional probability of class F given the class c j (also called likelihood), and P(c j ) is the prior probability of class c j . Since features are independent, this yields; (1) The dataset is small, and the quality of the synthetic samples need to be improved by adding more labeled data which increases the learning process of GAN Automatic COVID-19 Detection (ACDM) model (Ozturk et al. 2020b) ACDM method was used as a new detection method based on using chest X-ray images. This model represented a development of deep learning techniques to be able to perform both binary and multi-class Classification Chest X-ray images are classified without using feature extraction techniques. Expert radiologist evaluates the heatmaps generated by the model that focus on localizing effective areas on chest X-ray images The COVID-19 public image data has limited data Adaptive Feature Selection guided Deep Forest (AFS-DF) model AFS-DF based on using chest CT images was introduced to classify COVID-19 patients. For learning high level representation of features, AFS-DF method used a deep forest model. Based on the trained forest, an adaptive feature selection operation was used to decrease the redundancy of features for improving the performance of classification process The size of dataset is large The features are extracted depending on the prior knowledge in the current work. To enhance the performance, this can be done by using a deep learning model Substitute in (1), this yield (2) (Jabeen et al. 2019) Since the denominator in (2) remains constant for a given input for all target classes, it can be removed as illustrated in (3) (Zhang et al. 2021; Subramanian and Prabha 2020; Abellán and Castellano 2017) However, the performance of NB is sometimes low due to the unrealistic assumption that all features are independent and equally important given the class value. The performance of NB can be increased by mitigating this assumption. Many improvements have been proposed to resolve this problem including feature selection and feature weighting. Generally, feature selection can be applied to enhance the performance of the traditional Naïve Bayes classifier. Hence, the target class can be identified by (4) (Lee et al. 2011 ). However, assigning an equal value of weight to all considered features breaks the nature of real-world applications. Accordingly, different weights can be assigned to each feature as a generalization of feature selection as illustrated in (5) Jiang et al. 2019) As depicted in (5), unlike traditional NB, each feature f i has its weight w i , which can be a positive number representing the significance of the feature. However, both traditional and Weighted Naïve Bayes (WNB) classifiers are based mainly on probabilities, namely; the conditional probabilities of the input features given the considered target classes as well as the classes prior probabilities. From another point of view, promoting the performance of the WNB classifier can be achieved by compensating its performance with another heuristic besides conditional and prior probabilities. According to the rapid growth of COVID-19, the detection of this virus is an important process for healthcare organizations. Fast and accurate COVID-19 detection will be more helpful to decrease the alarming effect of this pandemic and will support in designing good strategies and taking productive decisions (Shinde et al. 2020) . As illustrated in Fig. 3 , the FCNB strategy composes of two stages, which are; (1) Pre-Processing Stage (P 2 S) and (2) Classification Stage (CS). During P 2 S, three main processes are performed on the collected data by applying data mining techniques to provide a meaningful pattern of data. These three processes are called Feature Selection Phase (FSP), Feature Clustering Phase (FCP), and Master Feature Weighting Phase (MFWP). Thus, P 2 S gives only the most informative data that enables the next stage which is called CS to detect early and accurately COVID-19 cases. On the other hand, during CS, fast and accurate COVID-19 diagnosis is provided by using Feature Correlation Naïve Bayes Phase (FCNBP) that uses a new weighted NB with many modifications. Finally, the FCNB strategy consists of four phases called FSP, FCP, MFWP, and FCNBP in which the first three phases are included in P 2 S while the last phase is presented in CS. In the next sections, there will be a detailed description of the P 2 S, CS stages, and a related discussion of the key algorithms. Data pre-processing plays an important role in providing fast, useful, and accurate decisions for detecting COVID-19 cases. Accordingly, the clinical features of this pandemic must be known and well understood. In P 2 S, three main phases called FSP, FCP, and MFWP are performed on the collected data to provide the most informative data that helps the detection method to quickly and accurately detect COVID-19 patients. The FSP as the first phase in P 2 S aims to select the most effective features on COVID-19 diagnosis. The FCP as the second phase in P 2 S aims to put the selected features into groups. Finally, the MFWP aims to assign a weight value to each master feature for the next COVID-19 classification stage. Usually, records of patients contain many features used to support the medical diagnosis. However, for the early COVID-19 detection task, not all of these features have the same importance. The performance of the diagnostic operation may rely on the selected features in all phases of the FCNB. Hence, the main objective of FSP is to eliminate the irrelevant features and select the best features before using the diagnostic model. Selecting the best features will improve the performance of the machine learning algorithm, decrease the time of processing, increase the computational efficiency, minimize the storage requirement, and increase the convergence of learning (Wosiak and Zakrzewska 2018; Saleh et al. 2016) . In this paper, the considered methodology to select the most effective subset of features on COVID-19 is Feature Selection based on Genetic Algorithm (FSGA) methodology. FSGA is a wrapper method used to select the most important features depending on specific evaluation metrics. Unlike classical selection methods which search from a single point and can deal poorly with large search spaces, FSGA depends on GA that can discover the global optimal solution and prevent the trapping in local optimal solution (Sivanandam and Deepa 2008) . To implement FSGA, consider that the Feature set (F) of 'n' features can be expressed by F = {f1, f2, f3, f4, …, fn}, where the input training data set of 'k' patients can be expressed by I = {I 1 , I 2 , I 3 I 4 , …, I k }. Additionally, the testing dataset of 'q' patients can be expressed by Accordingly, each training patient Y j and testing patient R i can be expressed in an 'n' dimensional space of features. For the considered COVID-19 detection problem, it is important to use FSGA as a suitable feature selection methodology to reduce or eliminate the irrelevant features to enhance the performance of the classifier. After extracting the features from laboratory tests for both COVID-19 patients and non -COVID-19 people, the collected dataset should be passed to FSGA for selecting the most effective features on COVID-19 cases. The FSGA depends on applying GA as it is an optimization technique and adaptive search heuristic algorithm that followed the process of natural evolution. The GA starts with a population of potential solutions, and then it employs the concept of survival of the fittest to generate the closest optimal solutions according to a fitness function of an optimization problem (Saleh et al. 2016; Oluleye et al. 2014) . Hence, FSGA begins with an initial population, which is a group of candidate solutions, or chromosomes in which every chromosome composes of series of genes. The value '1' of a gene denotes that the feature is selected in the particular subset. Otherwise, the value '0' of a gene denotes that the feature is eliminated from the particular subset (Saleh et al. 2016; Kaviani and Dhotre 2017) . Consider that a single chromosome has 'n' genes (i.e., the same number of features in the dataset), hence; F = {f 1 , f 2 , f 3 , f 4 , …, f n }. Assume that "n = 15 features", thus, a single chromosome can be represented Table 2 . The biological functions (three operators of FSGA) such as selection, crossover, and mutation are applied to these chromosomes to produce a new generation of the population. These three operators are repeated until a termination condition has been satisfied. The accuracy of NB is evaluated to be used as a fitness function in FSGA for choosing the best chromosome that includes the most effective features on COVID-19. The main objective of selecting the best subset of the features is to achieve the highest accuracy of the used COVID-19 detection model. Finally, there are many steps to implement FSGA as presented in Fig. 4 . At first, the initial population (p) of FSGA is represented by many candidate solutions, which are called chromosomes. Each chromosome consists of genes; each gene represents a feature in COVID-19's dataset. The existence or absence of a feature is determined by the value of the gene, where the value equals '1' means the feature is existing, and '0' means the feature is eliminated. Secondly, NB's accuracy as a fitness function is calculated for each chromosome (candidate solution) in p to provide a fitness value that indicates the goodness of the solution. The optimal solution is the solution that maximizes the fitness function. Based on the fitness values, the selection of parent members (chromosomes) for reproduction is done according to the probability of selection (p sel ). After that, the crossover between the parent members is done to produce the offspring according to the probability of crossover (p cross ). According to the probability of mutation (p mut ), the mutation is performed for each offspring. Loops over these steps are repeated from the selection until the size of the next population equals the size of the initial population. If the terminal condition is not satisfied, the previous steps will be repeated from the fitness function. In the end, when the terminal condition is satisfied, the chromosomes in the population will be evaluated as the final results by using only the fitness function. Then, the chromosome that provides the highest fitness value contains the best subset of features donated by '1' value. The steps to implement FSGA are illustrated in Algorithm 1. Generally, records of patients can be used to represent the data in supervised learning. Each record is described by a set of features. These features take one of two types, which are; "nominal" or "numeric" values. While nominal values represent members of an ordered set, and numeric values represent real numbers. In fact, the number of features which affected COVID-19 patients is "15" features (n = 15) as described in Table 3 . As an illustrative example, a nominal dataset of 25 patients as well as the features, which affected them, are represented in Table 4 . For simplicity, each patient in Table 4 has been described by a subset of features presented in Table 3 . This subset of features contains '6' features, which are; Platelet Count (PC), White Blood cell (WBC), Monocytes Count (MC), Aspartate aminotransferase (AST), Basophils Count (BC), and Lactate Dehydrogenase (LDH). Hence, the features in Table 4 Table 5 . Each patient has a class label, which indicates one of the two target classes "True, False". True indicates COVID-19 Patient and False indicates non-COVID-19 People. The first 15 records in Table 4 represent the training dataset (e.g., k = 15); I = {I 1 , I 2 , I 3 , I 4 , …, I 15 } and the last 10 records represent the testing dataset (e.g., q = 10); G = {G 1 , G 2 , G 3 , G 4 , …, G 10 }. To implement the NB classifier, it is essential to create the frequency distribution tables (also called "contingency tables") to construct the relationships between the features and the class categories (Huang et al. 2020; Saleh et al. 2016) . Tables 6a-f are the frequency distribution tables that represent the relationships between the features and the class categories in the considered "COVID-19" dataset. Tables 6a-f, are used to calculate the probabilities which are used to apply the NB equation. The FSGA is a feature selection method that is used on the "COVID-19" dataset in Table 4 to choose the most effective features on COVID-19 patients. The accuracy of the NB classifier is used as a fitness function to evaluate each chromosome in the population of FSGA in which NB's accuracy can be calculated by using the confusion matrix (Saleh et al. 2016; Visa et al. 2011 ). There are many assumptions to implement FSGA on the considered"COVID-19" dataset as presented in Table 7 . Based on the previous assumptions, FCP is the second phase in the P 2 S that is used to cluster the selected features from the FSP into many groups where each group contains similar features. Clustering is the main analytical technique in data mining in which data clustering is a procedure that is used to classify the data into homogenous groups based on similarity. Thus, the data in the same cluster are similar, but it should be different as much as possible according to different clusters (Bano and Khan 2018; Ayed et al. 2015) . Clustering methods can be categorized into several techniques, which are; partitioning based algorithms, model-based algorithms, after employing FSGA, the steps followed in the first and the second iterations are illustrated in Figs. 5 and 6 respectively. Finally, the best subset of features according to the Population size 4 "no. of chromosomes" 3 Probability of selection "P sel " various value for each selected chromosome 4 Probability of crossover "P cross " 0.93 5 Probability of mutation "P mut " 0.15 6 Chromosome size "C" 6 "no. of features" (n) density-based algorithms, hierarchical-based algorithms, and grid-based algorithms (Benabdellah et al. 2019) . In this vein, many applications used clustering techniques such as pattern recognition, image processing, disease detection, etc. The similarity between data items can be measured by using distance metrics. Thus, there are many distance functions, which are used to define a distance between items or elements. These distance functions such as; Cosine similarity (Shirkhorshidi et al. 2015) , Jaccard distance (Fletcher and Slam 2018) , Manhattan distance (Pandit and Gupta 2011), Euclidean distance (Dokmanic et al. 2015) , etc. The cluster is constructed in a way that any two data items associated with the same cluster have the minimum value of distance and any two data items associated with different clusters have the maximum value of distance (Zhu et al. 2019) . Although the simplicity of clustering techniques, these techniques suffer from many challenges such as; determining the number of clusters, selecting the centroid of each cluster, and choosing the similarity measurement. Thus, it is essential to introduce a new clustering method to overcome these pre-mentioned challenges. In the FCP, there are three main steps to cluster the features, which are; (1) Construct actual clusters, (2) Isolated feature assignment, and ( The accuracy and the probability of distribution of each chromosome is illustrated in the next Select each feature to be a centroid. Then, determine its neighbors and construct the cluster according to nth (theoretical no. of feature in each cluster). Calculate ZHD for each cluster. Applying ascending order to the features according to ZHD, and then select the first feature in Fordered to be a centroid and determine its neighbors. Then, delete its neighbours from the Fordered set ad update ZHD . Clusters are constructed and f1, f2 are isolated features. These isolated features need to be assigned to the nearest centroid or construct a new dummy cluster. Determine the nearest centroid to f1 and calculate the distance between f1 and the centroid. Then, compare this distance with ZHDmax. Also , repeat this step for f2. Feature assignment stage : f10 is the nearest centroid to f1 and the distance between them is less than ZHDmax; hence, f1 belongs to C1. F6 is the nearest centroid to f2 and the distance between them is less than ZHDmax ; hence, f2 belongs to C3 .Then, The master features are constructed . Assign a weight for each feature in the constructed Master Feature (MF 1 3 Fig. 9 The steps of assigning each isolated feature to its nearest cluster or to a new dummy cluster and the largest radius of all clusters is called the maximum ZHD (ZHD max ). The theoretical number of features in each cluster is denoted by (n th ) while the actual number of features in each cluster is denoted by (n act ). The ascending order of features according to ZHD is denoted by (F ordered ). To assign the isolated features, the distance between each feature f and the nearest centroid of all clusters (c i ) should be calculated by using Euclidean distance (D(f, c i )) (Dokmanic et al. 2015) . The steps of constructing the clusters of similar features are illustrated in algorithm 2. Accordingly, the main steps of FCP to construct the clusters of similar features are presented in Fig. 7 . In Fig. 7 , each cluster is represented as a big circle, but each feature is represented as a small circle or star. According to the first step called construct actual clusters, each feature in the Features set is considered as a centroid of the cluster as illustrated in Fig. 8 . Then, the distance between each centroid and their neighbours should be calculated by using Euclidean distance to determine their nearest neighbours (e.g. no. of nearest neighbours = 2), hence, n th = 3. For simplicity, assume that the Euclidean distance d(c, f) will be implemented between the centroid c and one of its neighbour features f in 2-dimension; c(x 1 , y 1 ) and f(x 2 , y 2 ) by using (6) (Dokmanic et al. 2015; Liu et al. 2020) . where d(c, f) is the distance between the centroid c and one of its neighbours f, x and y are the coordinates of both the centroid and the feature. Distance calculation should be performed between the centroid and all neighbors of features. According to the smallest distances, the centroid can determine their nearest neighbors (e.g. no. of nearest neighbors = 2). On the other hand, the ZHD can be determined for the cluster in which its value is the largest distance between the centroid and its neighbours as illustrated in the third step of Fig. 7 . Then, actual clusters are constructed by placing the features in F ordered in an ascending order based on their ZHD. After that, the feature which has the smallest ZHD is selected to be a centroid of the first actual cluster. The centroid neighbours should be determined and assigned to their corresponding cluster, and then these neighbours should be removed from F ordered as illustrated in the fourth step of Fig. 7 . The same steps will be repeated according to the current F ordered until all actual clusters have been constructed. Although actual clusters that include similar features have been constructed, many isolated features do not belong to any actual cluster such as f 1 and f 2 as illustrated in the fifth step of Fig. 7 . Thus, these isolated features need to be assigned to the nearest cluster or need to construct a Advantages: it is the best methodology because it considers the correlation between the centroid and the features. Additionally, it considers the correlation between the features with each other. The steps of implementing the proposed feature correlation methodology new dummy cluster that includes them. The next subsection describes how to assign the isolated features. After creating all actual clusters, there are many isolated features. An isolated feature is a feature that doesn't belong to any actual cluster. This feature needs to be assigned to the nearest cluster or needs to construct a new dummy cluster that includes it (Arunadevi et al. 2019 ). There are many steps to solve this problem as shown in Fig. 9 . Figure 9 illustrates the assignment of the isolated features to the nearest cluster or a new dummy cluster. At first, the largest radius of all actual clusters (ZHD max ) should be determined. Then, the distance between each isolated feature and the nearest centroid should be calculated by using Euclidean distance and then compared to ZHD max . If the distance is more than ZHD max , this means that the isolated feature will construct a new dummy cluster. Otherwise, the isolated feature will belong to the cluster of the nearest centroid (C i ). If there is more than one centroid has the same distance to the isolated feature, the isolated feature will belong to any one of these nearest centroids. Finally, the Master Features (MFs) are constructed to represent the final clusters after assigning the isolated features to their corresponding clusters. After the construction of actual clusters has been performed and then the isolated features have been assigned to their corresponding cluster, the features should be weighted. Weighting features is a significant process in P 2 S because it can decrease the complexity, increase the performance of the machine-learning algorithm, and increase the resource efficiency of the used classifier. Usually, many classification algorithms suppose that all features have the same importance (same weights) or neglect the consistency of weights assigned to features. To solve this problem, it is important to calculate the feature weight value in which the largest weights should be assigned to the most effective features on COVID-19. Hence, different features can have different levels of importance in class prediction (Arunadevi et al. 2019 ). The last step in the FCP is to calculate the weight of each feature in the constructed Master Feature (MF i ). The weight of each feature f h can be calculated by using (7). where W(f h ) is the weight value of feature f h , Accuracy of classifier (+ f h ) is the accuracy of the used classifier when the feature f h is included in the feature set, and Accuracy of classifier (− f h ) is the accuracy of the used classifier when f h is eliminated. (7) W f h = Accuracy of classifier +f h − Accuracy of classifier −f h MFWP is the third and final phase in the P 2 S stage that is used to assign a weight to each MF. Indeed, the correlation between features is very important before assigning weights to them. Hence, it is an essential process to determine the correlation between features by using a suitable correlation method. Correlation analysis is one of the wellknown widely used techniques that identifies; (1) the relationship between the features and the predicted class and (2) the relationship between the features with each other. Mathematically, the relationship between features can be determined by a decimal value called the correlation coefficient. The positive sign of the coefficient indicates that the two features are positively correlated, the negative sign means negative correlation, and the '0' value means no correlation (Li et al. 2016) . In this paper, the proposed feature correlation method based on the distance measurement has been introduced to calculate the relationship between the features of MF i . The steps of implementing the proposed feature correlation methodology are illustrated in Fig. 10 . Figure 10 shows three proposed feature correlation methods to measure the correlation between the features of the MF i . The first method measures the correlation by determining the multiplicative inverse of ZHD, but it does not take into consideration the correlation between features. For example, as shown in Fig. 10 , if "ZHD of MF i = 10" and "ZHD of MF j = 5" then, the correlation of MF i and MF j are 0.1 and 0.2 respectively. This means that MF j is more correlated than MF i , but it is not correct. In the second method, the correlation is measured by determining the total distance between the centroid and all features of the master feature. The limitation of this method is that it does not take into consideration the correlation between the features with each other. In the third and final method, the correlation is measured by determining the total distance between the features of the master feature. This method considers the correlation between the features and the centroid, and also the correlation between the features with each other. Accordingly, the third method is the best correlation method used to measure the correlation between features. After implementing the third correlation method to calculate the relationship between the features of the MF, MFWP can be implemented to calculate the weights of MFs. NB classifier is a common that is used in machine learning and data mining. It is crucial to use NB for solving different data classification problems because it is simple to be trained, easy to implement, and can provide fast and accurate predictions. However, it assumes that all features are conditionally independent which is often harming the performance of classification. This is not correct in real-world applications because the features don't have the same importance . To improve the performance of NB, many modified methods based on NB have been proposed. One of these modified methods is to assign a weights value to each feature. In this paper, the weight of each master feature (MF i ) depends on three parameters, which are; the number of features in this master feature, the correlation value between features in MF i , and the summation of weights values for each feature in MF i . The weight of master feature MF i can be calculated by using (8). where W(MF i ) is a weight of master feature MF i , N i is the number of features in the master feature MF i , and corr(MF i ) is the correlation value between features in the master feature MF i . w i is the weight value of each feature f j that belongs to the master feature (MF i ). After calculating the weights of all MFs, the weighted MFs will be used in the next stage (CS) to implement the weighted NB in FCNBP. In the next section, FCNBP will be explained in detail to classify the COVID-19 patients by implementing the weighted NB classifier on the weighted MFs. NB is known to be an effective, robust, and efficient classification algorithm. NB is a promising solution as it only requires a little amount of training data to estimate the parameters required for classification and able to accommodate new incoming data for training both efficiently and incrementally. Although NB had received extensive attention due to its excellent classification performance and simplicity, it sometimes has a degraded performance due to the naïve assumption that features are independent and equally weighted. To compensate the performance of the traditional NB, a new classifier is proposed in this phase, which is called Feature Correlated Naïve Bayes (FCNB). The proposed FCNB enhances the performance of the traditional NB by clustering the selected features into groups called master features (MFs) in which each MF includes a set of dependent or related features. Moreover, each MF is weighted based on the importance of the features it includes as well as the correlation among the included features. FCNB operates just like weighted NB; however, it replaces the employed features with a set of constructed MFs. Also, it considers the weights of the used MFs rather than the weights of the individual features. This has a positive effect in (1) promoting the performance of the traditional weighted NB as it considers the correlation among features and (2) minimizes the classification time as it considers a smaller number of MFs rather than many individual features. To explain how FCNB operates, consider a diagnosis database that includes 'Ca' cases in which 'A' cases are infected with COVID-19 and 'B' cases are not, hence; Ca = A + B. Consider 's' selected features labeled as; f 1 , f 2 , …, f s , which are clustered into mm master features labeled as MF 1 , MF 2 , MF 3 , … and MF mm , where s > mm. Like any supervised learning-based classifier, FCNB operates in two sequential phases; namely training and testing. The training of the proposed FCNB is accomplished by constructing a Conditional Probability Table ( CPT) for each master feature MF i as illustrated in Table 8 based on the input diagnosis database. As depicted in Table 8 , for simplicity, it is assumed that MF i includes three dependent features, namely; f x , f y , and f z in which each feature takes 'L' or 'H' value, which corresponds to "Low" or "High" respectively. Accordingly, MF i has 8 distinct values labeled as; X ij ∀ j∈{1, 2, 3, …, 8}. However, each MF can include more dependent features in which each feature can takes one of many values rather than 'L' and 'H' only. For illustration, a feature may take a value V∈{VL, L, M, H, VL}, which indicates "Very Low", "Low", "Medium", "High", or "Very High" respectively. Table 8 illustrates CPT of MF i in which the conditional probability for each value X ij ∀ j ∈ {1, 2, 3, …, 8} of MF i for each target class (e.g., T or F) is calculated given the input diagnose database. It is assumed that the weight of MF i is W i while the prior probabilities of the considered target classes are; On the other hand, the task during the testing phase of FCNB is to diagnose the input case to indicate whether the case is infected with COVID-19 or not. Assuming an input case IC who has the following feature vector 〈f 1 , f 2 , f 3 ,…, f s-1 , f s 〉 with the corresponding values〈L, L, H,…, h, L〉. Initially, the input features are clustered into the corresponding master features (e.g., MF 1 , MF 2 ,…, MF mm ) with the corresponding values. Considering the CPT of each employed MF, it will be easy to find the conditional probability for each value of the employed master features. Hence, it will be easy to diagnose the new case by estimating the posterior probability that the case is belonging to each class (T or F) as shown in (9) . where P(c i |IC) is the posterior probability that the case IC belongs to class c i , P(c i ) is the prior probability of class c i , P(MF j | c i ) is the conditional probability of the master feature MF j given the target class c i , and W j is the weight of MF j . Considering two target classes (e.g., T and F), this yields (10) and (11) (Lee et al. 2011 ). Finally, the target class for the input case IC can be calculated by using (12) (Ji et al. 2019) . In this section, an illustrative example showing how the diagnosis decision can be taken in the Classification Stage (CS) of the proposed Feature Correlated Naïve Bayes (FCNB) classification strategy. As illustrated in Table 9 , consider a COVID-19 diagnosis database for 100 persons in which 40 persons are infected by COVID-19 while the other 60 persons are not. For simplicity, Considering 8 selected features labeled f 1 , f 2 , …, f 8 , which are clustered into three master features labeled MF 1 , MF 2 , and MF 3 , as well as two target classes, namely; "True" and "False" diagnose. The symbols 'L', 'M', and 'H' represents "low", "medium", and "high" respectively, while 'T' and 'F' represents "true" or "false" diagnose of the COVID-19 virus. The weight of each master feature is also reported in the last row of Table 9 . On the other hand, the conditional probability for each feature value given different classes as well as the prior probability for each class are illustrated in Tables (10, 11, 12) . Now, it is required to diagnose a new case IC who has the following feature vector 〈f 1 , f 2 , f 3 , f 4 , f 5 , f 6 , f 7 , f 8 〉 with the corresponding values 〈L, L, H, H, L, L, h, L〉. Initially, the input features are clustered into the corresponding master features (e.g., MF 1 , MF 2 , MF 3 ) with the corresponding values. Considering the input values of the selected features, it is found that; MF 1= X 1, 2 , MF 2= X 2, 5 , MF 3= X 3, 3 . From Tables 10, 11, 12, it will be easy to find the conditional probability for each value of the employed master features, which are; P(X 1, 2 |T)= 0.145, P(X 1, 2 |F)= 0.098, P(X 2, 5 |T)= 0.077, P(X 2, 5 |F)= 0.143, P(X 3, 3 |T)= 0.212, and P(X 3, 3 |F)= 0.194. On the other hand, the weights of the employed master features are illustrated at the bottom of Table 9 , numerically; 0.42, 0.38, and 0.59 for MF 1 , MF 2 , and MF 3 respectively. Since the employed database has 40 infected persons with COVID-19, while the remaining persons are not, the prior probability for the target classes (e.g., T and F) are; 0.4 and 0.6 respectively. Hence, it will be easy to diagnose the new case by estimating the posterior probability that the case is belonging to each class (T or F) as shown below. where QT indicates the degree of confidence that IC is infected with COVID-19 and QF indicates the degree of confidence that IC is not infected with COVID-19. Hence, since QT < QF, then the input case IC is not infected with COVID-19. In this section, the evaluation of the proposed FCNB classification strategy is investigated. As mentioned in Sect. 6. In fact, FCNB consists of two main stages, which are; P 2 S and CS. The P 2 S stage is composed of the first three phases of the FCNB strategy called FSP, FCP, and MFWP while the CS stage contains FCNBP that represents the last phase of the FCNB strategy. To this end, the experimental results have many ordered steps. Firstly, the historically collected data on both COVID-19 patients and non-COVID-19 people will be sent to FSP for selecting the meaningful features by using FSGA. Secondly, in FCP, the selected features will be grouped into clusters according to their correlation. Then, MFWP will assign a weight value to each MF that includes a set of dependent or related features by using a new weight calculation method. Finally, the output of P 2 S will be passed to FCNBP in CS for providing a fast and accurate diagnosis of COVID-19 patients by using the weighted NB classifier. In this vein, there are two main scenarios are followed to implement the proposed FCNB classification strategy. In the first scenario, FSGA is applied to select informative features from the COVID-19 dataset comparing to other recent state-of-the-art feature selection methods. The main aim of the first scenario is to illustrate the effectiveness of FSGA against other methods. During the second scenario, the whole FCNB classification strategy is implemented to accurately diagnose COVID-19 patients. Our implementation is based on COVID-19 dataset Brinati et al.2020 ).The dataset is divided into two sets called; training and testing. The model can be learned by using the training data and then the performance of the model can be measured by using the testing data. Many tunable parameters have been used in FSGA and FCP in QF = 0.6 × (0.0.98) 0.42 × (0.143) 0.38 × (0.194) 0.59 = 0.04104 which these parameters with the corresponding implemented values are described in Table 13 . COVID-19 dataset is a real dataset that is used to detect COVID-19 patients. This real dataset contains results of routine blood tests collected from different cases who were admitted to San Raffaele Hospital (Milan, Italy) Brinati et al. 2020) . Additionally, this dataset contains personal information of cases like age and gender (Male or Female). The total number of cases in this real dataset is 207. The dataset is divided into training and testing sets where the number of cases in training data is 140 and the number of cases in testing data is 67. According to this real dataset, it is considered two class categories called; COVID patients and Un-COVID people as shown in Table 14 . The distribution of the used cases in the collected dataset has been represented according to "Age", "Gender" as shown in Figs. 11, 12, 13. During the next experiments, the evaluation parameters such as accuracy, error, recall, and precision will be calculated. Then, F-measure, micro average and macro average related to precision and recall will be measured. The confusion matrix is used to calculate the values of these parameters. A confusion matrix is applied as presented in Table 15 . Various formulas are used as a summarization of the confusion matrix as depicted in Table 16 . Finally, the speed of COVID-19 detection algorithms should be measured by using the second unit. The effectiveness of the proposed feature selection method called FSGA is evaluated and compared with other existing approaches, which are; FSJaya , MGOA (Sehgal et al. 2020) , SDS (Shanthi and Rajkumar 2020) , and ACO (Sowmiya and Sumitra 2020) by using the considered COVID-19 dataset. These feature selection approaches are described in Table 17 . To prove the effectiveness of the feature selection method, the NB classifier is applied as a standard classifier (Rabie et al. 2019a (Rabie et al. , b, 2020 Ayyad et al. 2019 Fig. 23 , the run-time of FSGA is 10 (s) that represents the highest speed while SDS introduces the lowest speed with a run-time value equals to 20 (s). In the end, FSGA outperforms other recent methods, which are; FSJaya, MGOA, SDS, and ACO because it can accurately select the most informative features with high speed. In this section, the proposed FCNB strategy that includes four phases, which are; feature selection, feature clustering, master feature weighting, and classification phases will be evaluated. To ensure the effectiveness of the FCNB strategy, it is compared to some of the recently used COVD-19 classification strategies as presented in Table 1 COVIDGAN (Waheed et al. 2020) , ACDM (Ozturk et al. 2020b ) and AFS-DF . In fact, the proposed FCNB classification strategy depends on many essential techniques which enable the classification model to provide fast and accurate classifications. These essential techniques are FSGA that is employed for selecting the best subset of features in FSP, the proposed clustering method in FCP that is applied on the selected features to put them in clusters, the proposed weighting method in MFWP that is used to weight the master feature, and the weighted NB classifier that is applied on the weighted master features in FCNBP to accurately detect COVID-19 patients. Results are shown in Figs. 24, 25, 26, 27, 28, 29, 30, 31, 32, 33. As shown Figs. 24, 25, 26, 27, 28, 29, 30, 31, 32, 33 The average of the precision and recall of the system on various n classes. It is used for knowing how the system acts overall across the sets of data n ∑ i=1 R i ∕n "for Recall" Micro-average (TP1 + TP2)/(TP1 + TP2 + FP1 + FP2) "for precision" Summation of the individual true positives, false positives, and false negatives of the system for various sets and then applies them to get the statistics. It can be a useful measure when the size of dataset varies (TP1 + TP2)/(TP1 + TP2 + FN1 + FN2) "for Recall" F-measure 2 × P+R P * R Metric for merge both recall and precision into a single score that take both properties Shanthi and Rajkumar (2020) introduced a new wrapper feature selection method based on SDS algorithm. The GLCM with the method of Gabor filter feature have been used to extract the radiomic features. Then, the most significant features were selected to differentiate between different classes in efficient and accurate manner. In order to accomplish the classification task, three classifiers have been used which are; Neural network, Naïve Bayes, and decision tree. An evaluation of the performance of the proposed method proven that the model reaches to effective results and able to achieve better levels of performance compared with the another methods Ant Colony Optimization (ACO) algorithm (Sowmiya and Sumitra 2020) Sowmiya and Sumitra (2020) introduced an enhanced hyper approach with new feature selection for providing accurate predictions. In the first step, the Cleveland dataset is pre-processed. Then, ant colony algorithm was used to choose the most necessary features in dataset to improve the prediction performance. The hybrid KNN (HKNN) used the selected features for the classification task Figs. 24, 25, 26, 27 show that FCNB is better than other recent methods, which are; TCRC, DL, CNN, CDM, COVIDGAN, ACDM, and AFS-DF because FCNB introduces the maximum accuracy and the minimum error. The results in Figs. 28, 29, 30, 31, 32 show that the highest macro-average precision value is provided by FCNB with value reaches to 0.78 at training number of 140 patients. On the other hand, the lowest macro-average precision value is introduced by TCRC with value reaches to 0.61 at the same training number of patients. Additionally, the macro-average recall for FCNB is about 0.77 which represents the highest value concerning techniques, while the lowest one is TCRC with a value of 0.60 at a training number of 140 patients. FCNB gives the highest micro-average precision value equals 0.78 at the same training number of patients, while DL introduced 0.59 which is the lowest value of micro-average precision. FCNB provides micro average recall value that equals 0.78 while TCRC, DL, CNN, CDM, COVID-GAN, ACDM, and AFS-DF provide 0.59, 0.61, 0.60, 0.64, 0.66, 0.67, and 0.71 respectively. The highest F-measure value is introduced by FCNB with a value that equals 0.76, while the lowest value is introduced by TCRC with a value that equals 0.59 at the training number of patients = 140. In Fig. 33 , the run time of FCNB is 11 (s) that represents the highest speed while DL introduces the lowest speed with run-time value equals to 20 (s). Finally, FCNB is better than other recent techniques which are; TCRC, DL, CNN, CDM, COVIDGAN, ACDM, and AFS-DF. It is very important to detect COVID-19 positive cases as early as possible to prevent the further spread of this pandemic and to quickly treat affected patients. In this paper, the FCNB classification strategy has been provided as a new COVID-19 diagnoses strategy to accurately diagnose COVID-19 patients with high speed. FCNB strategy is built upon two essential stages, which are; P 2 S and CS. P 2 S includes three essential phases, which are; FSP, FCP, and MFWP. In FSP, the most effective features on COVID-19 have been selected by using the FSGA method. In FCP, the selected features have been grouped into many clusters called Master Features (MFs) in which each MF includes a set of related features. In MFWP, each MF has been weighted based on the importance of the features it includes as well as the correlation among included features. On the other hand, in CS, the weighted NB has been implemented on the weights of MFs rather than the weights of individual features to introduce fast and accurate diagnosis. Experimental results have shown that the proposed FCNB strategy increases the performance of the traditional weighted NB as it considers the correlation among features. Additionally, FCNB minimizes classification time as it considers small number of MFs rather than many individual features. In the future, we plan to apply the proposed FCNB strategy in fog on the COVID-19 dataset collected in the fog's cache server to provide fast diagnosis and to directly rehabilitate the infected people. In fact, this will greatly reduce the efforts of medical systems (e.g., hospitals) because fog depends on the Internet of Things (IoT) sensors that can automatically measure the body temperature and other symptoms to maintain social distance and to prevent spreading the infection. Improving the naive bayes classifier via a quick variable selection method using maximum of entropy COVID-19 prediction and detection using deep learning QoS provisioning framework for serviceoriented internet of things (IoT) Application of feature weighting for the intensification of data classification Survey on clustering methods: towards fuzzy clustering for big data Gene expression cancer classification using modified K-Nearest Neighbors technique A survey of data clustering methods Coronavirus (COVID-19) Classification using CT images by machine learning methods A survey of clustering algorithms for an industrial contex Bayes' theorem and naive bayes classifier Detection of COVID-19 infection from routine blood exams with machine learning: a feasibility study Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests Clinical characteristics and intrauterine vertical transmission potential of COVID-19 infection in nine pregnant women: a retrospective review of medical records A diagnostic model for coronavirus disease 2019 (COVID-19) based on radiological 33 Run time of the different classification techniques N semantic and clinical features: a multi-center study Machine learning for email spam filtering: review, approaches and open research problems A Jaya algorithm based wrapper method for optimal feature selection in supervised classification Euclidean distance matrices: essential theory, algorithms, and applications Routine blood tests as a potential diagnostic tool for COVID-19 Comparing sets of patterns with the Jaccard index CT in relation to RT-PCR in diagnosing COVID-19 in The Netherlands: a prospective study Deep learningbased effective fne-grained weather forecasting model Clinical features of patients infected with 2019 novel coronavirus in Wuhan An IoT based efficient hybrid recommender system for cardiovascular disease. Peer-to Artificial intelligence and COVID-19: deep learning approaches for diagnosis and treatment A new weighted naive Bayes method based on information diffusion for software defect prediction Class-specific attribute weighted naive Bayes Diagnosis of coronavirus disease 2019 (covid-19) with structured latent multi-view representation learning Comparison of seven commercial RT-PCR diagnostic kits for COVID-19 Data management, analytics and innovation. Advances in intelligent systems and computing (1042) Short survey on naive bayes algorithm Machine learning based approaches for detecting COVID-19 using clinical text data Optimization of feature selection using genetic algorithm in Naïve Bayes classification for incomplete data The sensitivity and specificity of chest CT in the diagnosis of COVID-19 COVID-19 diagnosis by routine blood tests using machine learning Machine learning algorithms for wireless sensor networks: a survey Calculating feature weights in naive bayes with Kullback-Leibler measure Applications of machine learning to machine fault diagnosis: a review and roadmap Feature selection based on multiple correlation measures for medical examination dataset Stability issues of RT-PCR testing of SARS-CoV-2 for hospitalized patients clinically diagnosed with COVID-19 2020b) c) Laboratory diagnosis of coronavirus disease-2019 (COVID-19) Using artificial intelligence to detect COVID-19 and community-acquired pneumonia based on pulmonary CT: evaluation of the diagnostic accuracy Trend and forecasting of the COVID-19 outbreak in China Chest CT imaging characteristics of COVID-19 pneumonia in preschool children: a retrospective study Niching particle swarm optimization based on Euclidean distance and hierarchical clustering for multimodal optimization Diagnosing COVID-19 pneumonia from X-ray and CT images using deep learning and transfer learning algorithms Identifying COVID19 from chest CT images: a deep convolutional neural networks based approach A genetic algorithmbased feature selection COVID-19 detection using deep learning models to exploit Social Mimic Optimization and structured chest X-ray images using fuzzy color and stacking approaches Automated detection of COVID-19 cases using deep neural networks with X-ray images A comparative study on distance measuring approaches for clustering Clinical characteristics, laboratory outcome characteristics, comorbidities, and complications of related COVID-19 deceased: a systematic review and metaanalysis A new strategy of load forecasting technique for smart grids A fog based load forecasting strategy for smart grids using big electrical data A new outlier rejection methodology for supporting load forecasting in smart grids based on big data A fog based load forecasting strategy based on multi-ensemble classification for smart grids COVID-19 future forecasting using supervised machine learning models A data mining based load forecasting strategy for smart electrical grids Optimized grass hopper algorithm for diagnosis of Parkinson's disease A new COVID-19 Patients Detection Strategy (CPDS) based on hybrid feature selection and enhanced KNN classifier Lung cancer prediction using stochastic diffusion search (SDS) based feature selection and machine learning methods Forecasting models for coronavirus disease (COVID-19): a survey of the stateof-the-art comparison study on similarity and dissimilarity measures in clustering continuous data Introduction to genetic algorithms A hybrid approach for mortality prediction for heart patients using ACO-HKNN Customer behavior analysis using Naive Bayes with bagging homogeneous feature selection approach Adaptive feature selection guided deep forest for COVID-19 classification with chest CT Naive Bayes-guided bat algorithm for feature selection Real-time RT-PCR in COVID-19 detection: issues affecting the results Attribute weighted Naive Bayes classifier using a local optimization Confusion matrix-based feature selection CovidGAN: data augmentation using auxiliary classifier GAN for improved Covid-19 detection Diagnostic tools for coronavirus disease (COVID-19): comparing CT and RT-PCR viral nucleic acid testing A fully automatic deep learning system for COVID-19 diagnostic and prognostic analysis Integrating correlation-based feature selection and clustering for improved cardiovascular disease diagnosis Attribute weighted Naive Bayes classifier using a local optimization Toward naive Bayes with attribute value weighting Attribute and instance weighted naive Bayes Early prediction of the 2019 novel coronavirus outbreak in the Mainland China based on simple mathematical model A new unsupervised feature selection algorithm using similarity-based feature clustering Coronavirus disease 2019 (COVID-19): a perspective from China Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations