key: cord-0665018-6ljv0vcn authors: Barstugan, Mucahid; Ozkaya, Umut; Ozturk, Saban title: Coronavirus (COVID-19) Classification using CT Images by Machine Learning Methods date: 2020-03-20 journal: nan DOI: nan sha: 26d11d2eab9b9842130e1fef2ce6f6e22324556d doc_id: 665018 cord_uid: 6ljv0vcn This study presents early phase detection of Coronavirus (COVID-19), which is named by World Health Organization (WHO), by machine learning methods. The detection process was implemented on abdominal Computed Tomography (CT) images. The expert radiologists detected from CT images that COVID-19 shows different behaviours from other viral pneumonia. Therefore, the clinical experts specify that COV.ID-19 virus needs to be diagnosed in early phase. For detection of the COVID-19, four different datasets were formed by taking patches sized as 16x16, 32x32, 48x48, 64x64 from 150 CT images. The feature extraction process was applied to patches to increase the classification performance. Grey Level Co-occurrence Matrix (GLCM), Local Directional Pattern (LDP), Grey Level Run Length Matrix (GLRLM), Grey-Level Size Zone Matrix (GLSZM), and Discrete Wavelet Transform (DWT) algorithms were used as feature extraction methods. Support Vector Machines (SVM) classified the extracted features. 2-fold, 5-fold and 10-fold cross-validations were implemented during the classification process. Sensitivity, specificity, accuracy, precision, and F-score metrics were used to evaluate the classification performance. The best classification accuracy was obtained as 99.68% with 10-fold cross-validation and GLSZM feature extraction method. COVID-19 disease was occurred in the end of 2019 at Wuhan region of China. COVID-19 disease showed fever, cough, fatigue, and myalgias in human body during early phases (1) . The patients had abnormal situations in their CT chest images. The respiratory problems, heart damages, and secondary infection situations were observed as complications of the disease. The findings showed that COVID-19 virus spreads from person to person. The infected person needs to be treated in intensive care unit. The infected people have serious respiratory problems. The CT images of the infected people shows that COVID-19 disease has own characteristics. Therefore, the clinical experts need lung CT images to diagnose the COVID-19 in early phase. COVID-19 patients and 300 new COVID-19 patients for validation in their study. They obtained Dice similarity coefficient as 91.6%. The normal delineation system often takes 1 to 5 hours; however, their proposed system reduced the delineation time to four minutes. This study used 150 CT images for COVID-19 classification. Before classification process, the four different datasets were created from 150 CT images and the samples of datasets were labelled as coronavirus / noncoronavirus (infected / non-infected). Feature extraction methods and SVM are used during the classification of the coronavirus images. The findings showed that the proposed method could be used to diagnose the COVID-19 disease as an assistant system. This paper is organized as follows. Section 2 analyses the images statistically and visually. Section 3 briefly explains the feature extraction classification techniques. Section 4 presents the classification results. Section 5 discusses and concludes the results. The data consist of 150 CT abdominal images, which belong the 53 infected cases, from the Societa Italiana di Radiologia Medica e Interventistica (6). The patch regions were cropped on 150 CT images. The patches were extracted from the regions selected. Four different patch subsets were created and presented in Table 1 . The images in the dataset have acquired from different CT tools. This situation makes the classification process difficult. Because, some grey-levels in one CT image represent the coronavirus infected areas. And the same greylevels in another CT image represent the non-infected areas. Figure 1 shows the infected areas in images that were acquired from different CT tools. As seen in Figure 1 , the grey levels are different in different CT tools. This situation is a disadvantage for classification. Figure 2 shows the patch regions and patch samples from four different subsets. This study performs a coronavirus classification in two stages. In the first stage, the classification process was implemented on four different subsets without feature extraction process. The subsets were transformed into vector and classified by SVM. In the second stage, five different feature extraction methods such as Grey Level Cooccurrence Matrix (GLCM) (7-9), Local Directional Patterns (LDP) (10), Grey Level Run Length Matrix (GLRLM) (11) , Grey Level Size Zone Matrix (GLSZM) (12) , and Discrete Wavelet Transform (DWT) (13) extracted the features and the features were classified by SVM (14) . During the classification process, 2-fold, 5-fold, and 10-fold cross-validation methods were used. The mean classification results after cross-validations were obtained. Figure 3 shows the two stages of classification process. The feature sets formed by using GLCM, LDP, GLRLM, GLSZM and DWT were used for classification of coronavirus. The SVM classifier was used to classify the extracted features, because the SVM is a strong binary classifier. The feature extraction methods used in this study are as follows: GLCM is used to obtain the second-degree statistical features on the images. GLCM consists of the relationships of different angles between the pixels of an image. Let a co-occurrence matrix that is obtained from an I image be features from all subsets (7) (8) (9) . GLCM method produces 1x19 feature vector for classifier input. LDP method uses Kirsch compass kernels to combine the directional elements (30). Let i c be density of an I image on (xc, yc). Let in be the pixel density when the center pixel ic is outside of 3x3 neighbourhood of (xc, yc). LDP value of (x c, yc) is computed as follows (10): LDP method produces output matrix sized as input image. This matrix is transformed into a vector for classifier input. GLRLM extracts texture features on a high level. Let L be the number of grey-levels, R is the longest run, and P is the number of pixels in the image. A GLRLM matrix is L×R, and each p(i,j | θ) element gives the number of occurrences in the θ direction with i grey level and j run length. GLRLM extracts the short-run emphasis, long-run emphasis, grey-level non-uniformity, run-length non-uniformity, run percentage, low grey-level run emphasis, and high grey-level run emphasis features from all subsets (11) . GLRLM method produces 1x7 feature vector for classifier input. GLSZM is a feature extraction method, which is developed version of GLRLM algorithm. GLSZM extracts the small zone emphasis, long zone emphasis, grey-level non-uniformity, size zone non-uniformity, zone percentage, low grey-level zone emphasis, high grey-level zone emphasis, small zone low grey-level emphasis, small zone high greylevel emphasis, large zone low grey-level emphasis, large zone high grey-level emphasis, grey-level variance, and size zone variance features from all subsets (12) . GLSZM method produces 1x13 feature vector for classifier input. DWT separates the image into frequency sub-bands by using an h low-pass filter and g high-pass filter. Approximation coefficients (LL), horizontal details (LH), vertical details (HL), and diagonal details (HH) represent the lowest frequency, horizontal high frequencies, vertical high frequencies, and high frequencies in both directions, respectively (13) . The feature set was created by LL coefficients, which has dimension of the half of input size, after DWT. The LL coefficients were obtained by db1 wavelet, and the coefficient matrix were transformed into a feature vector. SVM gives high classification accuracy in many applications. An SVM is based on two ideas. The first idea is to map feature vectors to a high dimensional space with a nonlinear method and to use linear classifiers in this new space. The second idea is to separate the data with a high margin hyperplane. This plane is the best plane, which can separate the data as well as possible (14) . The cost (C) parameter of SVM algorithm was taken as 1, which is default value of the SVM algorithm for all classification processes. This study presents a coronavirus classification in two stages. Stage 1 classified subsets without feature extraction. Subset 1 has 5912 non-infected and 6940 infected patches. These patches were classified by Stage 1 and Stage 2. Table 2 presents the obtained classification results. As seen in Table 2 , the best classification result was obtained as 99.68% in Stage 2 with 10-fold crossvalidation and GLSZM feature extraction method. Subset 2 has 942 non-infected and 1122 infected patches. These patches were classified by Stage 1 and Stage 2. Table 3 presents the obtained classification results. Table 3 shows that the best classification result was obtained as 99.37% in Stage 2 with 10-fold crossvalidation and DWT feature extraction method. Subset 3 has 255 non-infected and 306 infected patches. These patches were classified by Stage 1 and Stage 2. Table 4 presents the obtained classification results. Table 4 shows that the best classification result was obtained as 99.64% in Stage 2 with 10-fold crossvalidation and DWT feature extraction method. Subset 3 has 76 non-infected and 107 infected patches. These patches were classified by Stage 1 and Stage 2. Table 5 presents the obtained classification results. Table 5 shows that the best classification result was obtained as 97.28% in Stage 2 with 10-fold crossvalidation and DWT feature extraction method. Table 2 , Table 3 , Table 4 and Table 5 show that the best performance was obtained by extracting features on patches. GLCM, GLSZM and DWT methods always had classification accuracy over 90% during 10-fold cross validation. The best classification performance was achieved by using GLSZM method with 5-fold cross-validation. The scheme of the best method is presented in Figure 4 . As seen in Figure 4 , the CT image was divided into 32x32 sized patches. GLSZM method extracts the features of the patches and form feature vector. The vector is classified by five different SVM structures, which were obtained during training phase. The mean classification performance is obtained by SVM classification. In this study, the coronavirus image set has different type of images, which were acquired with different CT tools. Therefore, five feature extraction methods were utilized to find the feature set that separates the infected patches with a high accuracy. The dataset in this study was formed manually and achieved 99.68% classification accuracy. The proposed method should be tested on another coronavirus CT image dataset. The literature studies are mostly medical studies. The classification, segmentation studies may increase on COVID-19 in the literature. This study examined COVID-19 images in the classification field. There should be done more classification and segmentation studies on COVID-19. For this aim, the dataset diversion needs to be increased. The machine learning methods should be implemented more on CT abdominal images, X-ray chest images, blood test results when these data were shared to literature. Clinical features of patients infected with 2019 novel coronavirus in Wuhan Added value of computer-aided CT image features for early lung cancer diagnosis with small pulmonary nodules: a matched case-control study Dermatologist-level classification of skin cancer with deep neural networks Deep Learning System to Screen Coronavirus Disease Lung Infection Quantification of COVID-19 in An analysis of co-occurrence texture statistics as a function of grey level quantization Textural features for image classification Texture analysis of SAR sea ice imagery using gray level co-occurrence matrices Loop descriptor: Local optimal-oriented pattern Local relative GLRLM-based texture feature extraction for classifying ultrasound medical images Advanced statistical matrices for texture characterization: application to cell classification The discrete wavelet transform: wedding the a trous and Mallat algorithms Statistical learning theory: a tutorial Evaluation of the confusion matrix method in the validation of an automated system for measuring feeding behaviour of cattle