key: cord-0832733-xz04qre6 authors: Islam, Md. Robiul; Nahiduzzaman, Md. title: Complex features extraction with deep learning model for the detection of COVID19 from CT scan images using ensemble based machine learning approach date: 2022-02-04 journal: Expert Syst Appl DOI: 10.1016/j.eswa.2022.116554 sha: 9c540ced4853647fe43e6af2e070a15e749436c4 doc_id: 832733 cord_uid: xz04qre6 Recently the most infectious disease is the novel Coronavirus disease (COVID 19) creates a devastating effect on public health in more than 200 countries in the world. Since the detection of COVID19 using reverse transcription-polymerase chain reaction (RT-PCR) is time-consuming and error-prone, the alternative solution of detection is Computed Tomography (CT) images. In this paper, Contrast Limited Histogram Equalization (CLAHE) was applied to CT images as a preprocessing step for enhancing the quality of the images. After that, we developed a novel Convolutional Neural Network (CNN) model that extracted 100 prominent features from a total of 2482 CT scan images. These extracted features were then deployed to various machine learning algorithms—Gaussian Naive Bayes (GNB), Support Vector Machine (SVM), Decision Tree (DT), Logistic Regression (LR), and Random Forest (RF). Finally, we proposed an ensemble model for the COVID19 CT image classification. We also showed various performance comparisons with the state-of-art methods. Our proposed model outperforms the state-of-art models and achieved an accuracy, precision, and recall score of 99.73%, 99.46%, and 100%, respectively. At the end of December 2019, the world's catastrophic Coronavirus Disease (COVID19) was first observed in Wuhan, China which is known as a respiratory disease caused by a severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Ai et al. (2020); Jaiswal et al. (2020) . Now 220 million people are infected by COVID19 worldwide, and among them, 97,757 people are in critical condition. Almost 2 million people have died in the last two years around the world Worldometer (2021) . The initial symptoms of COVID19 infected patients are dry cough, loss of taste sensation, fever, headache, diarrhoea, short breathing, sore throat, tiredness, and mild to moderate respiratory illness Singhal (2020) . In the initial steps, the medical experts first guess a patient, whereas a patient has COVID19 infected or not using these symptoms. When a person have some of these symptoms, they are examined by medical experts and performed tests such as CT scan, chest x-ray, etc. Finally, doctors or medical experts inspect COVID19 from these tests Mohamud et al. (2020) . This disease can spread when a COVID 19 infected patient sneezes or coughs travelling through the air and transmitted to the regular person through the nose or mouth. When this virus infects a person, it may take 5 to 6 days to show the symptoms of this disease Tang et al. (2020) . It may easy to recover when the disease is early detected. Still, the people who have chronic respiratory disease, heart diseases, diabetes, etc., may face difficulty recovering from this disease Ahuja et al. (2021) . This disease may be more life-threatening for older people than the younger generation. Since the virus is transmitted from an infected patient to a normal person, the only way to stop it is to quarantine the infected person. COVID19 can be detected from respiratory samplings by using Real-Time RT-PCR Wang et al. (2020a) . But the detection of this disease using J o u r n a l P r e -p r o o f Journal Pre-proof RT-PCR is very time-consuming i.e. its take around 4 to 6 hour for processing the samples Pathak et al. (2020) It also gives error-prone results i.e. high false-negative rates Shah et al. (2020) ; Zu et al. (2020) . For these drawbacks of RT-PCR COVID19 detection creates challenges in preventing the expansion of infection. An alternative solution is to detect SARS-CoV-2 from different types of radiological imaging methods such as CT scans or chest X-ray images Xie et al. (2020) ; Singh et al. (2020) . Using these techniques, the COVID19 patients can be detected quickly and quarantine infected patients timely and overcome this critical situation. But there is a problem with chest X-ray images that can not be detected in the soft tissues Tingting et al. (2019) . We can handle this problem by using chest CT scans which can be discriminated the soft tissues accurately Jaiswal et al. (2020) . A radiology expert is required to detect COVID19 infected patients from these chest CT scan images but it requires a lot of time and maybe defective. Hence it is necessary to design a decision support tool based on Krizhevsky et al. (2017) . For this reason, the main focus of our task is to automatic detection of COVID19 patients from chest CT scan images using CNN. In overall, the paper shows the following contributions: • Enhanced the quality of the CT scan images using CLAHE. • Built a novel CNN for extracting the most relevant features from the CT scan images. • Proposed a soft voting ensemble learning model for improving the classification performance than previous works in terms of accuracy, precision, recall, AUC. J o u r n a l P r e -p r o o f In the next sections, we describe the previous works in this field. Section 3 represents the proposed architecture of our model. The performance analysis of our task is presented in section 4. Finally, section 5 draws the main conclusion of this paper. Several research works and studies have been performed to detect COVID19 patients from chest CT scan images in the last two years. Zhang et al. (2021) proposed a model where DenseNet and the optimization of transfer learning setting (OTLS) strategy were combined to create a revolutionary method. They achieved an accuracy of 96.30 ± 0.56 and specificity of 96.25 ± 1. Wang et al. (2021) proposed a structure that achieved a more outstanding performance. Firstly, pre-trained models (PTMs) were utilized to learn features, and a unique (L, 2) transfer feature learning approach was suggested to extract them. Secondly, they introduced a pre-trained network selection approach for fusion to choose the best two models defined by PTM and NLR. Thirdly, discriminant correlation analysis was developed to help fuse the two features from the two models via deep chest CT (CCT) fusion. They achieved the best sensitivity, Deep learning has dramatically been used in medical imaging in the last few decades. We have collected the CT scan images of COVID19 patients in this work. The images were not cleared; for this reason, we need to preprocess our data using various methods. Then we developed a novel deep CNN for extracting the most discriminant features from the images. After extracting the features, we preprocessed these features then applied several well-known machine learning algorithms -GNB, SVM, DT, LR, and RF. In addition to this algorithm, a voting ensemble-based approach has been considered to make a final prediction. The main idea of this voting approach is that errors in the particular algorithm can be reduced by merging the particular decisions through a majority voting scheme Maclin (2016); Polikar & Polikar (2006) . Finally, we enhanced the overall performance of these algorithms using this ensemble method. Figure 1 shows our proposed model to detect the COVID19 from CT scan images. Image preprocessing is an important task for getting a better result. Various methods had developed so far for enhancing medical images. We utilized CLAHE for image enhancement. Primarily, CLAHE was developed for the im- age enhancement of medical images of low-contrast Pisano et al. (1998) . Clipping the histogram at a user-defined value called clip limit restricts the amplification in CLAHE. The clipping level controls how much noise in the histogram should be smoothed and, as a result, how much contrast should be increased. We used a colour version of CLAHE. For this, we kept the clipping limit of 2.0 and the tile grid size of (8 x 8). • First, we converted our RGB image into a LAB image • After that, we utilized the CLAHE method to L channel • Then merged the enhanced L channel with A and B to get enhanced LAB image • Finally, that enhanced LAB image was converted back into the enhanced After that, all the images were resized into (224 x 224 x 3) as the images in the dataset come with various resolutions. Finally, we performed normalization on each image. In Figure 2 , we can see some original CT scan images and their corresponding enhanced CT scan images with the CLAHE method. Feature Engineering is the most critical part of the classification. Feature extraction using image processing techniques are erroneous and tedious. As the features for COVID19 from CT scan images are complex, we used a deep convolutional neural network for extracting 100 prominent features for COVID19 identification. Figure 3 shows our deep CNN model for extracting the features. We used four convolution layers, followed by batch normalization and maxpooling layers. After about 100 epochs, our model learned the parameters for COVID19 classification. During training time, the learning rate was 0.001, and 'Adam' was used as optimizer and a dropout with a 0.50 probability for getting more generalized results. After completing the learning process, we further deployed our dataset to the trained CNN model. We extracted the 100 promi-J o u r n a l P r e -p r o o f neurons. Table 1 shows the summary of our CNN model. Feature scaling keeps the data's independent features into a normalized range. It is done during data pre-processing to deal with highly varying mag- Naive Bayes classifier is a probabilistic machine learning model that's used for binary (two-class) and multi-class classification problems. The classifier is based on the Bayes theorem: The variable y is the class variable and variable X represent the parameters/features. Where X = (x 1 , x 2 , ...x n ). Naive Bayes classifier assumed that attributes are independent of each other. So, P (y|X) = P (x 1 |y)P (x 2 |y)...P (x n |y)P (y) P (x 1 )P (x 2 )...P (x n )) In the case of gaussian naive bayes, the conditional probability comes from J o u r n a l P r e -p r o o f Journal Pre-proof a normal distribution like SVM is a supervised machine learning algorithm used for classification and regression problems. It performs classification by finding the hyper-plane that differentiates the classes very well. It finds the hyper-plane by maximizing the margin. In the kernel trick method, kernel function transforms low dimensional input space to a higher-dimensional space, i.e. it converts not separable problem to separable problem. It is primarily helpful in non-linear separable problems. We used sigmoid as a kernel function. A decision tree is a flowchart-like tree structure with an internal node representing a function (or attribute), a branch representing a decision law, and each leaf node representing the result. The root node is at the very top of a decision tree. It learns to partition based on the value of an attribute. Recursive partitioning is a method of partitioning the tree recursively. This flowchart-like form assists in making decisions. It's a flowchart diagram-style visualization that highly reflects human thought. As a result, decision trees are simple to comprehend and perceive. Logistic Regression is a widely used mathematical method for predicting binary outcomes (y = 0 or 1). Linear regression is helpful for forecasting continuous-valued outcomes, whereas logistic regression is suitable for categorical outcomes (binomial/multinomial values of y). The standard logistic function, which is an S-shaped curve given by the equation: J o u r n a l P r e -p r o o f Journal Pre-proof Random forest is a supervised learning algorithm. It creates a "forest" out of a set of decision trees, which are typically trained using the "bagging" technique. The bagging method's general premise is that combining several learning models improves the final outcome. Moreover, It can handle by forming multiple numbers decisions tress during training and output is provided by class mode or averaging the individual tree's prediction Ho (1995) . Random forests can handle the overfitting problem of training data for decision trees Hastie et al. . The ensemble model is created by strategically combining base models to create a robust model. The ensemble model employs a mixture of learning algorithms to solve a classification/regression problem that cannot be solved easily by either of the individual models. One can achieve more outstanding performance than a particular model using ensemble learning Wolpert (1992) . Here we used soft voting ensemble learning. First, we trained base models -GNB, SVM, DT, LR, RF using training data. After training, we tested our models' performance using test data, where each model gave an individual prediction. These models' predictions act as an additional input to our ensemble learning that acts as a combined model trained to make the final prediction. Figure 4 shows our proposed ensemble learning model. We collected the SARS-CoV-2 CT scan dataset from Kaggle PlamenEduardo The main idea behind the machine learning algorithm is that we first need to learn our algorithm using some CT scan images called training data. After that, to calculate the performance of our model, we need to use some new CT scan images called test data that have not been used for training. So using this testing data, we can evaluate the efficiency of our model. We divided 2482 CT scan images into training and testing sets. We used 15% images for testing and 85% images for training. Table 2 shows our data splitting. To perform a quantitative analysis of the machine learning algorithms, we F 1 − Score = 2 * P * R P + R (8) The experiments were done at the Pycharm Community Edition19 (2020.2.3x64) software. All the machine learning models had been implemented using Keras with TensorFlow as a backend. The training and testing phases were performed on a 64-bit Windows 10 Pro operating system with 32GB RAM, NVIDIA GeForce GTX 1650 SUPER 4 GB GPU, and Intel(R) Core(TM) i7-6700 CPU @3.40GHz. The code is available in the GitHub repository: https://github.com/robiulRUET/COVID19Detection2. We trained our machine learning algorithms -GNB, SVM, DT, LR, RF using 2109 CT scan images of COVID19 and non-COVID19 infected patients. Then we tested all these models using 373 CT scan images where COVID19 and non-COVID19 infected patients were 185 and 188, respectively. Finally, we performed a soft voting ensemble-based approach considered to make a final detection. We used a confusion matrix for each model to evaluate the robustness of each model by determining the accuracy, precision, recall, f1-score, and AUC. In the case of the medical sector, the recall should be maximized because the patient who has been infected by COVID19 must be detected as COVID19 accurately. Figure 6 shows the confusion matrix of each model. The average accuracy of GNB, SVM, DT, RF, LR, and ensemble models are 99.73%, 99.73%, 98.43%, 99.73%, 99.73% and 99.73%, respectively. Table 3 shows the classification performance measures of all models. This section shows how well our ensemble model works than the previous works in this field. We have already described the details of the previous works in section 2. Jaiswal et al. (2020) 99.79%, 100%, 99.59%, and 99.80% respectively. In this work, we used binary classification to identify COVID 19 and non- We wish to draw the attention of the Editor to the following facts, which may be considered as potential conflicts of interest, and to significant financial contributions to this work: The nature of potential conflict of interest is described below: No conflict of interest exists. We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. Funding was received for this work. All of the sources of funding for the work described in this publication are acknowledged below: [List funding sources and their role in study design, data analysis, and result interpretation] No funding was received for this work. Deep transfer learning-based automated detection of covid-19 from lung ct scan slices Correlation of chest ct and rt-pcr testing for coronavirus disease 2019 (covid-19) in china: a report of 1014 cases Explainable covid-19 detection using chest ct scans and deep learning Multi-task deep learning based ct imaging analysis for covid-19 pneumonia: Classification and segmentation Sars-cov-2 ct-scan dataset: A large dataset of real patients ct scans for sars-cov-2 identification. medRxiv The elements of statistical learning. springer series in statistics Deep transfer learning based classification model for covid-19 disease Contrast limited adaptive histogram equalization image processing to improve the detection of simulated spiculations in dense mammograms Sars-cov-2 ct-scan dataset Ensemble based systems in decision making Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation Covid-densenet: A deep learning architecture to detect covid-19 from chest radiology images Diagnosis of covid-19 using ct scan images and deep learning techniques. medRxiv Classification of covid-19 patients from chest ct images using multi-objective differential evolution-based convolutional neural networks A review of coronavirus disease-2019 (covid-19). The indian journal of pediatrics Deep neural networks for object detection Laboratory diagnosis of covid-19: current issues and challenges Three-stage network for age estimation Covid-19 classification by ccshnet with deep fusion using transfer learning and discriminant correlation analysis. Information Fusion Detection of sars-cov-2 in different types of clinical specimens Contrastive cross-site learning with redesigned net for covid-19 ct classification Stacked generalization Chest ct for typical coronavirus disease 2019 (covid-19) pneumonia: relationship to negative rt-pcr testing a ct image dataset about covid-19 Covid ct-net: Predicting covid-19 from chest ct images using attentional convolutional network Covid-19 diagnosis via densenet and optimization of transfer learning setting Coronavirus disease 2019 (covid-19): a perspective from china Enhance the quality of the ct scan images using CLAHE Build a novel CNN for extracting the most relevant features from the ct scan images Develop a soft voting ensemble learning model for improving the performance Md Robiul Islam and Md Nahiduzzaman contributed equally to this work. All persons who have made substantial contributions to the work reported in the manuscript (e.g., technical help, writing and editing assistance, general support), but who do not meet the criteria for authorship, are named in the Acknowledgements and have given us their written permission to be named. If we have not included an Acknowledgements, then that indicates that we have not received substantial contributions from non-authors. Manuscript title: Complex Features Extraction with Deep Learning Model for the Detection of COVID19 from CT Scan Images Using Ensemble Based Machine Learning Approach.All persons who meet authorship criteria are listed as authors, and all authors certify that they have participated sufficiently in the work to take public responsibility for the content, including participation in the concept, design, analysis, writing, or revision of the manuscript. Furthermore, each author certifies that this material or similar material has not been and will not be submitted to or published in any other publication before its appearance in the Hong Kong Journal of Occupational Therapy. Please indicate the specific contributions made by each author (list the authors' initials followed by their surnames, e.g., Y.L. Cheung). The name of each author must appear at least once in each of the three categories below. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We further confirm that any aspect of the work covered in this manuscript that has involved human patients has been conducted with the ethical approval of all relevant bodies and that such approvals are acknowledged within the manuscript. Written consent to publish potentially identifying information, such as details or the case and photographs, was obtained from the patient(s) or their legal guardian(s). The International Committee of Medical Journal Editors (ICMJE) recommends that authorship be based on the following four criteria:1. Substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data for the work; AND 2. Drafting the work or revising it critically for important intellectual content; AND 3. Final approval of the version to be published; AND 4. Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.All those designated as authors should meet all four criteria for authorship, and all who meet the four criteria should be identified as authors. For more information on authorship, please see http://www.icmje.org/recommendations/browse/roles-andresponsibilities/defining-the-role-of-authors-and-contributors.html#two.All listed authors meet the ICMJE criteria. We attest that all authors contributed significantly to the creation of this manuscript, each having fulfilled criteria as Corresponding Author, about progress, submissions of revisions and final approval of proofs.We the undersigned agree with all of the above.