key: cord-0773383-9rm2v20r authors: Habib, Nahida; Rahman, Mohammad Motiur title: Diagnosis of corona diseases from associated genes and X-ray images using machine learning algorithms and deep CNN date: 2021-05-28 journal: Inform Med Unlocked DOI: 10.1016/j.imu.2021.100621 sha: 7c30961d009f8e9aea8db1472686b7c7aed6523a doc_id: 773383 cord_uid: 9rm2v20r Novel Coronavirus with its highly transmittable characteristics is rapidly spreading, endangering millions of human lives and the global economy. To expel the chain of alteration and subversive expansion, early and effective diagnosis of infected patients is immensely important. Unfortunately, there is a lack of testing equipment in many countries as compared with the number of infected patients. It would be desirable to have a swift diagnosis with identification of COVID-19 from disease genes or from CT or X-Ray images. COVID-19 causes flus, cough, pneumonia, and lung infection in patients, wherein massive alveolar damage and progressive respiratory failure can lead to death. This paper proposes two different detection methods – the first is a Gene-based screening method to detect Corona diseases (Middle East respiratory syndrome-related coronavirus, Severe acute respiratory syndrome coronavirus 2, and Human coronavirus HKU1) and differentiate it from Pneumonia. This novel approach to healthcare utilizes disease genes to build functional semantic similarity among genes. Different machine learning algorithms - eXtreme Gradient Boosting, Naïve Bayes, Regularized Random Forest, Random Forest Rule-Based Model, Random Ferns, C5.0 and Multi-Layer Perceptron, are trained and tested on the semantic similarities to classify Corona and Pneumonia diseases. The best performing models are then ensembled, yielding an accuracy of nearly 93%. The second diagnosis technique proposed herein is an automated COVID-19 diagnostic method which uses chest X-ray images to classify Normal versus COVID-19 and Pneumonia versus COVID-19 images using the deep-CNN technique, achieving 99.87% and 99.48% test accuracy. Thus, this research can be an assistance for providing better treatment against COVID-19. COVID-19 caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly contagious disease. The Coronaviruses were thought to infect only animals until the world witnessed a severe acute respiratory syndrome (SARS) outbreak caused by SARS-CoV, 2002 in Guangdong, China [1] . In 2005, the human coronavirus HKU1 was first discovered. Almost a decade later, another endemic coronavirus known as Middle East respiratory syndrome coronavirus (MERS CoV) appeared in Middle Eastern countries. We have now seen the onset of COVID-19. Emerging from Wuhan, China in December 2019, COVID-19 spread rapidly around the world, affecting the people of approximately 215 countries. On 12 February, 2020, the WHO warned that due to COVID-19, millions would die if it remained controlled, and declared it as a Pandemic on 11 March 2020 [2] . According to the Worldometers data, over 21M million people have been infected, with over 0.76M deaths [3] . This pandemic has become a grim figure as the new cases have increased exponentially. Social distancing and contact tracing are two effective techniques proposed by the World Health Organization (WHO) to control the spread of this viral infection [4] . Thus, to avoid fast transmission of the virus, most countries made lockdown compulsory, which disrupted daily life and socio-economic conditions. Still, the situation is not entirely controlled. Effective screening of infected patients helps them so as to become isolated and receive immediate treatment and care to mitigate the spread of the virus [5] . The reverse-transcription polymerase chain reaction (RT-PCR) is the accepted standard diagnostic method of COVID-19 [6] . However, because the number of RT-PCR testing kits, testing reagents, proper lab environment, PPE, and expertise is inadequate to meet demand, contaminated rates are rapidly increasing. Hence, researchers are trying to develop alternative detection techniques. Currently, Machine Learning and Deep Learning are used as successful AI techniques for effective diagnosis of diseases. The X-ray radiography method is easier and more cost-effective than CT scan images. Therefore, most researchers prefer the use X-ray images rather than CT images. Almost all of the Corona diseases start with cold-like symptoms and then advance to Pneumonia. COVID-19 Symptoms can be mild to severe, including fever, cough, and dyspnea, to pneumonia, severe acute respiratory syndrome, septic shock, multi-organ failure and death in more serious cases [7] . From a report of [3] among the active cases, 2% of patients are critical J o u r n a l P r e -p r o o f and 98% are mild. Studies have found that the symptoms are changing gradually as the virus slightly changes its genetic makeup. In some current cases, corona-positive patients are found without any symptoms. For these reasons, the gene-based COVID-19 detection method can be a great alternative to the other methods. To mitigate the risk of developing certain diseases and to detect these diseases at an earlier stage, the knowledge of an individual's genetic make-up can be used [8] . Given a set of disease genes associated with a disease, they can be used to find further candidate genes for the disease [9] and also to detect and distinguish it from other diseases. This study aims to mitigate the limitations of the traditional COVID-19 diagnostic method by demonstrating two fast and effective diagnostic techniques, a Gene-based Corona disease detection method and an automated computer-aided diagnosis (CAD) tool for the diagnosis of COVID-19, to differentiate it from pneumonia and healthy via from chest X-ray imagery. Genes and chest X-ray images associated with the diseases are first collected and pre-processed. Several techniques are applied to the disease genes to calculate the functional similarity measure matrices among them. The National Center for Biotechnology Information (NCBI) and Gene Ontology (GO) online databases are used for these purposes. Afterward, different machine learning algorithms are applied to the matrices for successful prediction. The X-ray images are incorporated by using a pre-trained CheXNet deep convolutional neural network (CNN) with model weights [10] to diagnose COVID-19 from Pneumonia versus Normal healthy images. The main contributions of this paper are:  A new approach to diagnose Corona diseases from disease genes with excellent performance.  An extensive experimentation was done to select the best performing machine learning (ML) model on the best gene functional similarity measures.  An ensemble technique of ML models was utilized to increase classification accuracy.  An automated computer-aided method of COVID-19 detection was developed to be used with from Chest X-ray imagery with a transfer learned deep CNN model. The remainder of the paper is organized as follows: Related Works demonstrating the literature review can be found in section 2. Section 3 captures the proposed materials and methods of this study. Section 4 describes and discusses the results. Finally, the conclusion and future work of this research can be found in section 5. Modern technology has made diagnosis and treatment easier and more convenient than ever before. The availability of large datasets and the success of deep learning have made the results of diagnostics tasks more accurate. This section highlights the studies and works done by other groups related to this research. As of this writing, Coronavirus is still spreading, which causes danger to millions of people. To control the spread of the COVID-19, screening large numbers of suspected cases for appropriate quarantine and treatment measures is a priority. Yet, the RT-PCR testing process is timeconsuming and also sometimes shows false-negative results, so researchers are trying to develop alternative detection techniques. Paper [11] uses gene functional similarity to identify disease genes. Jianpeng Zhang et. al. developed a deep learning model to detect COVID-19 from Chest X-ray images with high sensitivity for active cases [6] . In article [12] , different online Chest X-ray datasets are combined, rearranged and then transfer learning methods are used for this disease detection. Paper [7] and [13] also developed an automated deep CNN model for detecting and distinguishing COVID-19 from Pneumonia using X-rays. Lin Pneumonia is a contagious lung disease that creates breathing difficulty and severe respiratory problems with inflammation in lung alveoli. One of the major symptoms of COVID-19 is Pneumonia. Hence, it is very difficult to differentiate between Pneumonia and COVID-19. Identifying COVID-19 disease genes from Pneumonia disease genes appropriately, in turn means identifying COVID-19 from Pneumonia. ML classifiers trained on gene semantic similarity scores can differentiate disease genes by inferring hidden semantic similarities among genes. As AI and ML tools show efficient performance in diagnosing Pneumonia, they can be also applied to diagnose COVID-19 successfully. To suppress the rapid transmission of the coronavirus, it is necessary to screen all suspected cases, quarantine them and provide immediate treatment. This study proposed a new diagnostic technique for Corona Disease that uses disease genes and performs well in distinguishing Corona J o u r n a l P r e -p r o o f disease from Pneumonia. Also, a fine-tuned CheXNet CNN model is proposed here that is pretrained on the Pneumonia dataset [26] for the diagnosis of COVID-19 from X-ray images that serves two classifications tasks-COVID-19 vs Pneumonia and COVID-19 vs Normal images. The proposed method demonstrated different steps from data collection to Corona detection using the Gene-based method and CAD method. The diagram of Fig. 1 displays the schematic representation of the proposed methodology. Data preprocessing is one of the vital steps to enhance the quality of data and transform the raw data into a more suitable and efficient format. Table 1 shows the number of genes before and after preprocessing and gene mining. Gene functional similarity covers a wide area of biological and bioinformatics research including Component (CC) are the three orthogonal ontologies provided by GO. All of the five semantic similarity measures Resnik [33] , Jiang [34] , Lin [35] , Schlicker [36] and Wang [37] [11] and can be defined as-IC(t) =log(p(t)); where, p(t) be the probability of usage of GO term t being used in a given GO corpus. Whereas, the Wang semantic similarity measure uses hierarchical DAG structure to estimate semantic similarity between genes. The above semantic similarity measures can be represented as [38] - The Resnik method can be defined as: The Lin method can be defined as: The Relevance method, which was proposed by Schlicker, combines Resnik's and Lin's method and can be defined as: The Jiang and Conrath's method can be defined as: Given two GO terms A and B, the semantic similarity between these two terms can be defined as: The BMA method also finds the pair wise semantic similarity values and computes the average of all maximum similarities on each row and column and is defined as: Supervised machine learning methods are capable of training hidden gene-relationships from a given dataset and then using that learned knowledge to discriminate disease genes from non- The machine learning algorithms-eXtreme Gradient Boosting (xgbLinear), Naïve Bayes xgbLinear is a method of eXtreme Gradient Boosting that can be used for both classification and regression using the xgboost library. To find the best tree model, it uses a specific Gradient Boosting method using more accurate and successful approximations. NB is a supervised classification algorithm based on the Bayes' Theorem. It predicts the best class in a way like the Bayes Theorem finds the best hypothesis from given prior knowledge. RRF implements a regularized random forest algorithm that can be used for both classification and regression. It applies the tree regularization framework to RF and can select a compact feature subset [39] of relevant and non-redundant features. rfRules acts as both a classification and regression model. It generates a series of "if-then" rules to effectively classify classes. rFerns is a machine learning classification algorithm that extends the Naïve Bayes algorithm. It can be considered as a constrained decision tree where at each level of the tree the same binary test is performed. C5.0 is a classification algorithm that is well-known for producing decision trees. It can be used for both small and large datasets and its decision trees are relatively easy to understand and deploy. MLP is a supervised classification and regression algorithm which is widely used in image and speech recognition. It is a multilayer feedforward artificial neural network generating a set of outputs from a set of inputs. MLP uses backpropagation. To obtain more accurate classification results, the machine learning models are ensembled Two binary classification tasks-COVID-19 vs Normal and COVID-19 vs Pneumonia are performed in the CNN-based CAD method. After preprocessing of chest X-ray images, they are This research uses the CheXNet CNN model, which is the fine-tuned and transfer learned CheXNet model that was previously used by [10] for Pneumonia Detection. The original CheXNet model was proposed by the researchers from Stanford University [40] is a 121-layer DenseNet architecture. CheXNet was at first pre-trained on the ImageNet dataset and then trained on the CXR dataset of [41] . In our previous Pneumonia detection research, the fine-tuned CheXNet model was trained on [26] dataset. Here, the transfer learned CheXNet model is used with Softmax activation function at the final layer for binary classification and ReLU activation function for all other activation layers. Fig. 4(a) represents the CheXNet model architecture and Fig. 4(a) . Architectural Design of CheXNet Model. Fig. 4(b) . Proposed fine-tuned CheXNet Model with Pre-trained Weights. This section provides the result and discusses the output of each step of the proposed methods and materials of the current research project. The results are described in the following subsections. J o u r n a l P r e -p r o o f The collected gene data from NCBI Gene repositories are two summary type text files, one for Corona disease and the other for Pneumonia. There are 24 common genes found between Corona and Pneumonia diseases. They are removed from both files and 84 top-weighted genes from both classes are selected to maintain a balanced, unbiased dataset. After preprocessing and mining genes from the collected gene data files, a dataframe is created containing ENTREZID and Class column. The head of the dataframe is shown in Fig. 5 below-Fig. 5 . Head of Gene Input Dataframe. All processes in the Gene-based screening method are carried out in the R programming language in windows 10, 64-bit environment. In the CNN-based CAD method, collected X-ray images are preprocessed using the AHE contrast enhancement technique. The enhanced COVID-19, Normal and Pneumonia images are shown in Fig. 6 . Semantic Similarity can be used to measure the functional closeness of Gene Ontology (GO). The R packages org.Hs.eg.db [42] and GOSemSim [43] are used for the gene semantic similarity matrix estimation. As COVID-19 is a new term, some of the gene information of Corona disease and very few of Pneumonia lack GO information which returns null semantic similarity scores. Semantic similarity only returns a value between 0-1, thus null values need to be removed. Then the training and testing datasets are constructed using an 8:2 ratio. Among the semantic similarity measure methods (Resnik, Lin, Rel, Jiang and Wang), the Wang method achieves the best results with the max combining technique. To determine the hidden functional similarities between Corona disease genes and Pneumonia genes, xgbLinear, NB, RRF, rfRules, rFerns, C5 and MLP machine learning classifiers are trained and tested on Corona and Pneumonia gene functional similarities. The performance of any machine learning algorithm depends highly on the amount of data available. Huge data can make the algorithm more accurate than limited data. This is the main shortcoming of our study. Because of the unavailability of a large amount of gene data for coronavirus and updated GO information, the Machine Learning (ML) model accuracy got negatively affected. The above seven ML models are trained on each of the five-similarity matrices with two combining techniques resulting in a total of 70 classifiers and make a prediction using a 5-fold crossvalidation technique. The average accuracy for each model is then used as the model's accuracy. Sensitivity and Specificity are also calculated for each of the ML models. Tables 3 and 4 show the performance of various machine learning classifiers built over functional similarities scores using the max and BMA combining techniques, respectively. The model achieves 90.91% sensitivity and 94.12% specificity. Authors of [11] obtained 80% AUC values on the Gene-based screening method to identify ASD disease candidate genes. As this technique was not yet applied by other researchers for the Corona detection task, the proposed model could be an ideal supporting model for Corona Disease and Pneumonia Detection with approximately 93% classification accuracy. All tasks for the CNN-based CAD method, including training and testing were completed in python on a mac operating system with Google colab gpu and keras framework (using This section is designed to compare the proposed CNN-based COVID-19 detection model with the existing models. The transfer learned, fine-tuned CheXNet model is used in this research as it shows better performances in classifying both COVID-19 vs Normal and COVID-19 vs Pneumonia images. [44] . Diagnosis of COVID-19 can be either from CT scan images or from chest X-ray images. Comparison results of the proposed model with other binary and multi-class classification models on the different datasets also prove that the proposed model outperforms other state-of-the-art models for the diagnosis of COVID-19. Table 7 summarizes the comparative performances of different models on different datasets to our proposed model performance. Image Type Classification Type Accuracy Hemdan et al. [15] X-ray COVID-19 vs Normal 90% Zheng et al. [16] CT COVID-19 vs Normal 90.1% Ying et al. [20] CT COVID-19 vs Normal 94% Sarhan A.M et al. [17] X-ray COVID-19 vs Normal 94.5% Narin et al. [7] X-ray COVID-19 vs Normal 98% Ozturk et. al. [18] X-ray COVID-19 vs Normal 98.08% Tawsifur R. [44] X-ray COVID-19 vs Normal 99.7% Wang et al. [19] CT COVID-19 vs Pneumonia 82.9% The above comparison stated that the proposed binary COVID-19 diagnosis model performs superior to the compared binary and multi-class model. Thus, it could become a great supporting tool for fighting the COVID-19 pandemic. As Pneumonia is a major symptom of COVID-19, it is very difficult to differentiate COVID-19 or Corona diseases from Pneumonia. In this study, two cost-effective, rapid, and automatic Corona disease diagnostic methods were demonstrated. Gene Ontology (GO) is the most frequently used term by researchers to calculate gene functional similarity. Genes with higher functional similarity may belong to the same hierarchical path of GO with higher semantic terms. The identification of disease from associated genes through GO-based gene similarity measures can open a new era in complex disease diagnosis. ML classifiers with a large gene dataset may help to obtain improved accuracy. In the gene-based detection method, ML classifiers are applied in identifying and predicting the Corona disease from gene functional similarities calculated using different semantic similarity measures. Stacking ensembles of different machine learning models improve performance accuracy. Chest X-ray imagery is readily available, and the cost-effective images conveys potential information to assist radiologists in diagnosis disease. The proposed CNN-based CAD method provides a simple model that demonstrates superior results in diagnosing COVID-19 from X-ray imagery. In the future, the authors will try to overcome the data shortage limitation and optimize the model to classify more diseases with an effective result. Ying et al. [20] CT COVID-19 vs Pneumonia 86% Sethy and Behera [21] X-ray COVID-19 vs Pneumonia 95.38% Xu et al. [22] CT COVID-19 vs IAVP vs Normal 86.7% Mangal et al [23] X-ray COVID- 19 COVID-19 infection: origin, transmission, and characteristics of human coronaviruses A Deep Learning Framework for Screening of COVID19 from Radiographs A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images COVID-19 Screening on Chest X-ray Images Using Deep Learning based Anomaly Detection Automatic Detection of Coronavirus Disease (COVID-19) Using X-Ray Images and Deep Convolutional Neural Networks The prediction of disease risk in genomic medicine Disease gene prediction for molecularly uncharacterized diseases Fusion of Deep Convolutional Neural Network with PCA and Logistic Regression for Diagnosis of Pediatric Pneumonia on Chest X-rays Identifying disease genes using machine learning and gene functional similarities, assessed through Gene Ontology Automatic Detection from X-Ray Images Utilizing Transfer Learning with Convolutional Neural Networks COVID-CAPS: A Capsule Network-based Framework for Identification of COVID-19 cases from X-ray Images Using Artificial Intelligence to Detect COVID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy COVIDX-Net: A Framework of Deep Learning Classifiers to Diagnose COVID-19 in X-Ray Images Deep learning-based detection for COVID-19 from chest CT using weak label. medRxiv Detection of COVID-19 Cases in Chest X-ray Images Using Wavelets And Support Vector Machines Automated detection of COVID-19 cases using deep neural networks with X-ray images A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-19) Deep learning enables accurate diagnosis of novel coronavirus (COVID-19) with CT images. medRxiv Detection of Coronavirus Disease (COVID-19) Based on Deep Features Deep Learning System to Screen Coronavirus Disease 2019 Pneumonia COVID-19 Detection Using Chest X-Ray Classification of COVID-19 from Chest X-ray images using Deep Convolutional Neural Networks. medRxiv Accurate Prediction of COVID-19 using Chest X-Ray Images through Deep Feature Learning model with SMOTE and Machine Learning Classifiers. medRxiv Identifying medical diagnoses and treatable diseases by image-based deep learning The Human Genome Project. The New York Times COVID-19 Radiography Dataset COVID-19 image data collection Chest X-Ray Images (Pneumonia) NaviGO: interactive tool for visualization and functional similarity and coherence analysis with gene ontology Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy An Information-Theoretic Definition of Similarity A New Measure for Functional Similarity of Gene Products Based on Gene Ontology A New Method to Measure the Semantic Similarity of Go Terms School of Basic Medical Sciences Gene Selection with Guided Regularized Random Forest Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases Genome wide annotation for Human. R Package version 312 Gene Ontology Semantic Similarity Analysis Using GOSemSim Can AI help in screening Viral and COVID-19 pneumonia? The authors are grateful to the participants who contributed to this research. This research did not receive any specific grants from funding agencies in the public, commercial, or not-for-profit sectors. None. The authors are grateful to the participants who contributed to this research. No funding to declare. The authors have no conflicts of interest to declare.