key: cord-0956832-59ghorzf authors: Kumar, R.; Arora, R.; Bansal, V.; Sahayasheela, V. J.; Buckchash, H.; Imran, J.; Narayanan, N.; Pandian, G. N.; Raman, B. title: Accurate Prediction of COVID-19 using Chest X-Ray Images through Deep Feature Learning model with SMOTE and Machine Learning Classifiers date: 2020-04-17 journal: nan DOI: 10.1101/2020.04.13.20063461 sha: 5da3f835dd8a4e112a76d21db78a85920453478a doc_id: 956832 cord_uid: 59ghorzf According to the World Health Organization (WHO), the coronavirus (COVID-19) pandemic is putting even the best healthcare systems across the world under tremendous pressure. The early detection of this type of virus will help in relieving the pressure of the healthcare systems. Chest X-rays has been playing a crucial role in the diagnosis of diseases like Pneumonia. As COVID-19 is a type of influenza, it is possible to diagnose using this imaging technique. With rapid development in the area of Machine Learning (ML) and Deep learning, there had been intelligent systems to classify between Pneumonia and Normal patients. This paper proposes the machine learning-based classification of the extracted deep feature using ResNet152 with COVID-19 and Pneumonia patients on chest X-ray images. SMOTE is used for balancing the imbalanced data points of COVID-19 and Normal patients. This non-invasive and early prediction of novel coronavirus (COVID-19) by analyzing chest X-rays can further be used to predict the spread of the virus in asymptomatic patients. The model is achieving an accuracy of 0.973 on Random Forest and 0.977 using XGBoost predictive classifiers. The establishment of such an approach will be useful to predict the outbreak early, which in turn can aid to control it effectively. , commonly referred to as coronavirus disease-19 has emerged as a fatal SARS infection 1 over the past few months. With the place of origin being identified as Wuhan, China, the disease now stands as being declared 'pandemic' by the World Health Organization (WHO) [2] [3] [4] . The symptoms of the disease resemble a typical viral, respiratory infection with an incubation time of214days. However, as this disease progresses, the affected experiences shortness of breath, nausea culminating in pneumonia and multiple organ failure 3, 5 . Given the deadly and debilitating condition that it is, the entire world now is at high alert with thousands of cases being identified each day. Going by WHO statistics, as of now, there have been about 1.84 million identified cases with over 1, 13, 329 deaths 6 . Country-wise (top 10 India) statistical COVID-19 cases as of April 12, 2020, presented in Table 1 7 . According to WHO 8 , geographical statistics of the confirmed COVID-19 cases have registered as of April 12, 2020 as depicted in the Figure 1 . International travel history and close contact with the infected have been identified as the causes of the worldwide spread of the disease. An enormous amount of effort is being put into developing vaccines and drugs to help treat the infection 9, 10 . Speaking of India, the country currently stands at 9, 205 confirmed cases and 331 deaths. There has been a 21-day complete lockdown in India as a measure to prevent further transmission, but whether India is coping up with the disease or not continues to be a big question. While these small numbers of confirmed and deceased cases may bring in a sigh of relief, they also challenge the state of medical testing and diagnosis in India 11 . Whether these numbers are low or they are low because of insufficient testing remains unanswered 12 . In these times of emergency, researchers and scientists have come a long way in helping the society by indulging in scientific methods of identifying the virus inside humans through the technology of machine learning (ML) and deep learning(DL) algorithms. In recently, several classical image processing and machine or deep learning methods are used to automatically classify the diseases with digitized chest X-ray images 13, 14 . Class decomposition of the coronavirus disease-2019 (COVID-19) in non-COVID and COVID viral infection with X-ray scans regarded as one of the critical subjects of matter for diagnosing this highly infectious disease [15] [16] [17] . For eg., One of the chest X-ray slides pictorially represented as COVID-19, Regular, and Pneumonia patient in Figure 4 . Fast COVID-19 recognition can help control disease transmission and will help to monitor the progression of infectious disease. According to Tao et al. 18 , it seems that Chest CT is more prone to COVID-19 diagnosis with respect to the original reverse-transcription polymerase chain reaction (RT-PCR) that has been collected from swab samples from patients and reported an accuracy of 97.3% to classify COVID-19 viral highly infectious diseases. Convolution neural network (CNN) is one of the most popular algorithms that have shown high precision in the ability to interpret the COVID-19 classification with medical images like X-rays or CT images. Wang et al. 19 has proposed a COVID-19 classification technique by implementing the CNNs based on Inception Net over the pathogen-confirmed COVID-19, with 453 computed tomography (CT) images and reported an accuracy of 82.9%. Songet al 20 has implemented a multi-class classification to recognise the diseases (COVID-19 viral infection, non-COVID and bacterial pneumonia) with CT images (88 patients with COVID-19 infected, 86 non-COVID patients and 100 patients with bacterial pneumonia)by using a modified version of pre-trained ResNet-50 net-work, named as DRE-Net and reported an accuracy of 86% for bacterial pneumonia and viral pneumonia (COVID- 19) classification. In another study, Sethy et al. 21 used chest x-ray images to recognize COVID viral infection, firstly they have extracted deep features using CNN based on pre-trained ImageNETand at the last layer SVM were used in order to classify it. In addition, Wang et al. 16 presented multi-class classification, deep convolutional neural network architecture named as COVID-Net implemented over 16, 756 chest radiography images that have been scanned with 13,645 patient to classify the COVID-19 and non-COVID also with a safe and bacterial infected patient. In this section, the results for the proposed methodology is discussed. Table 2 depicts the results of the final classification metrics produced by the proposed methodology. As can be seen in the table, many potential classifiers are listed which have been utilized for the performance calculation of the classification task. It can be inferred that the metrics corresponding to the Random Forest (RF) classifier and the XGBoost (XGB) classifier outperformed from the rest indicating that they have a better understanding of the image features which were input to them. The corresponding performance metrics of RF and XGB in terms of Accuracy, Sensitivity, Specificity, F1-Score, and AUC are shown in Table 2 . It seems that XGBoost performed the best among all the classifiers with an accuracy of 97.7%, while Logistic Regression (LR), k-Nearest Neighbour (kNN), All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Model Performance: Once the training of the model is performed, it becomes easy to further classify the features separately into 3 classes, namely, COVID-19, Pneumonia and Normal. In real-time, for example, if some random patient comes for screening, we can determine whether he/she has COVID or he/she is suffering from Pneumonia or Healthy using the proposed model by taking the chest X-ray and sending it to the proposed model. To evaluate the efficiency of the proposed framework, confusion matrix is estimated, which gives a detailed understanding of the classification process in Figure 2 . The classification model's usefulness and productivity was measured using the traditional metrics of accuracy, precision, and recall. Precision is the calculation of the model's correct predictions all over all predictions. Corresponding graphs of the Receiver Operating Characteristics (ROC) curves are being depicted as in Figure 3 . All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 17, 2020. We instantiated our proposed methodology with two publicly available datasets: Chest X-Ray Images (Pneumonia) 1 Figure 4 . The number of the images used in the experiment from both the datasets are as depicted in Table 3 and Table 4. 1 https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia 2 https://towardsdatascience.com/covid19-public-dataset-on-gcp-nlp-knowledge-graph-193e628fa5cb All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 17, 2020. . Pre-processing of the Dataset It could be seen that all the acquired X-ray images are of variable shapes and sizes, which increases the difficulty in effective classification. In order to effectively perform classification tasks, image preprocessing is performed. There exist many automatic and semi-automatic techniques for detecting abnormality in the body of the patient, but the absence of reliable and precise techniques can cause ambiguities in the classification process. Keeping aforementioned challenges in the mind, ML based predictive classifiers are used for analyzing the chest X-ray images, which are further discussed in the following sections. X-ray images are taken with a low resolution which may have a variable height to width ratio. In order to facilitate training and testing of the deep networks, necessary pre-processing steps like image cropping and resizing is performed. In the proposed method, all input images are first converted to a standard size of 224 × 224 for a similar course of action in both the developed model. The proposed framework for accurate prediction of COVID-19 using the chest X-ray images through deep feature learning model with SMOTE and machine learning classifiers consisted of the ResNet152 architecture for the training with later using the concerned features to classify the chest X-ray images using machine learning classifiers. The proposed framework is depicted in Figure 5 . A detailed description of the architecture is explained below. • The X-ray images of COVID-19, Normal and Pneumonia infected patients were taken and resized into the size of 224x224. • A ResNet152 22 model was trained for the classification of Pneumonia and Normal patients. ResNet is known to be a better deep learning architecture as it is relatively easy to optimize and can attain higher accuracy. Due to a large number of layers in the network architecture, it has high time complexity. This complexity can be reduced by utilizing a bottleneck design. Further, there is always a problem of vanishing gradient, which is resolved using the skip connections in the network. Finally, the last fully connected (FC) layer of the network is followed by a Logarithmic Softmax layer with Adam optimizer to optimize the neural network. • After completion of the training process, the last three layers from the ResNet152 were replaced with a ReLU, an FC layer, and an output layer. Now with the tweaked network, the last FC layer gives 1024 as output. The input images are now passed through this modified network to obtain 1024 features for each image in the dataset. • Now, in the dataset, the number of data points for one class is very less when compared to the other class corresponding to the COVID-19 patients. Therefore, to balance the imbalanced data points, we use synthetic minority oversampling technique (SMOTE) 23, 24 . This algorithm creates an equal number of samples for each class. The above method was incorporated to ensure the smooth working of many machine learning algorithms like Logistic Regression, Decision Tree, etc. which otherwise tends to be more biased towards a majority class. The algorithm mentioned above generates virtual data points between existing points of the minority class by using linear interpolation. • After processing all the images and converting them into features and using SMOTE for intra-class variations, the next step involves fitting the dataset using different machine learning predictive classifiers. For this purpose, We have integrated Logistic Regression (LR) 25 , k-Nearest Neighbour (kNN) 26 , Decision Trees (DT) 27 , Random Forest (RF) 28 , Adaptive Boosting (AdaBoost) 29 , Naive Bayes (NB) 30 and XGBoost(XGB) 31 to classify the COVID-19, Normal, and Pneumonia (shown in the Table 2 ). We have implemented the proposed classification system for COVID-19 diagnosis using Python 3.8 programming language with a processor of Intel R Core i5-8300H CPU @ 2.30GHz × 8 and RAM of 8 GB running on Windows 10 with NVIDIA Geforce GTX 1050 with 4GB Graphics. To evaluate the efficacy of the model, the confusion matrix along with Area under Curve (AUC) 32 are estimated and gives an understanding of the proposed methodology and its potential for detailed classification. The classification model's usefulness and productivity were measured using the traditional metrics of accuracy, precision, and recall. Precision is the calculation of the model's correct predictions all over all predictions. The classification of COVID-19 patients and Normal or Pneumonia patients and between Normal patients and Pneumonia is termed as Accuracy, Sensitivity, Specificity, and F1-score are represented mathematically in terms of confusion matrix as given in Equations 1, 2, 3, and 4, respectively. Accuracy = T P + T N T P + FP + FN + T N (1) Sensitivity = T P T P + FN (2) Speci f icity = T N T N + FP (3) where TP, TN, FP, and FN are True Positive, True Negative, False Positive, and False Negative, respectively. 1993 C C Z weig In this work, we have presented the use of ResNet152 and machine learning classifiers for the effective classification of COVID-19. The proposed methodology is trained on two publicly available datasets and has outperformed across all the classes. We also encompassed the SMOTE algorithm for balancing the intra-class variation among the datasets. With the SMOTE based features, machine learning algorithms are applied for final classification leading to the best result obtained by Random Forest with the Accuracy, Sensitivity, Specificity, F1-score and AUC of 0.973, 0.974, 0.986, 0.973, and 0.997 (for Random Forest) and 0.977, 0.977, 0.988, 0.977, and 0.998 (for XGBoost), respectively. Therefore, this approach of using X-ray images and computer-aided diagnosis can be used as a massive, faster and cost-effective way of screening. Also, it brings down the time for testing drastically. To make a clinically effective prediction of COVID-19, training with more massive datasets and testing in the field with a larger cohort can be immensely useful. World Health Organization; Naming the coronavirus disease (covid-19) and the virus that causes it Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study The continuing 2019-ncov epidemic threat of novel coronaviruses to global health-the latest 2019 novel coronavirus outbreak in wuhan, china World Health Organization; Director-General's opening remarks at the media briefing on COVID-19 World Health Organization Accessed on World Health Organization COVID-19 CORONAVIRUS PANDEMIC. Accessed on World Health Organization; Coronavirus disease 2019 (COVID-19) situation report -83. Accessed on World Health Organization; Coronavirus disease (COVID-19) advice for the public Novel Coronavirus (2019-nCoV) World Health Organization Accessed on Chest x-rays image classification in medical image analysis Comparison of deep learning approaches for multi-label chest x-ray classification Covid-resnet: A deep learning framework for screening of covid19 from radiographs Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases A deep learning algorithm using ct images to screen for corona virus disease (covid-19) Deep learning enables accurate diagnosis of novel coronavirus (covid-19) with ct images Detection of coronavirus disease (covid-19) based on deep features Deep residual learning for image recognition An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary Applied logistic regression. john wiley & sons An introduction to kernel and nearest-neighbor non parametric regression Induction of decision trees Random forests Multi-class adaboost Naive (bayes) at forty: The independence assumption in information retrieval Xgboost: A scalable tree boosting system Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine We acknowledge Consulate General of India, Osaka-Kobe for constant support and encouragement for this Bilateral India-Japan Artificial Intelligence (INJA-AI) module between the researchers of Indian Institute of Technology Roorkee (IITR), India and Kyoto University, Japan to better manage COVID-19 by non-invasive prediction.