key: cord-0505437-w80gndka authors: Ozkaya, Umut; Ozturk, Saban; Barstugan, Mucahid title: Coronavirus (COVID-19) Classification using Deep Features Fusion and Ranking Technique date: 2020-04-07 journal: nan DOI: nan sha: beafcc134bb96b4c1898cbb63be513d951f82b95 doc_id: 505437 cord_uid: w80gndka Coronavirus (COVID-19) emerged towards the end of 2019. World Health Organization (WHO) was identified it as a global epidemic. Consensus occurred in the opinion that using Computerized Tomography (CT) techniques for early diagnosis of pandemic disease gives both fast and accurate results. It was stated by expert radiologists that COVID-19 displays different behaviours in CT images. In this study, a novel method was proposed as fusing and ranking deep features to detect COVID-19 in early phase. 16x16 (Subset-1) and 32x32 (Subset-2) patches were obtained from 150 CT images to generate sub-datasets. Within the scope of the proposed method, 3000 patch images have been labelled as CoVID-19 and No finding for using in training and testing phase. Feature fusion and ranking method have been applied in order to increase the performance of the proposed method. Then, the processed data was classified with a Support Vector Machine (SVM). According to other pre-trained Convolutional Neural Network (CNN) models used in transfer learning, the proposed method shows high performance on Subset-2 with 98.27% accuracy, 98.93% sensitivity, 97.60% specificity, 97.63% precision, 98.28% F1-score and 96.54% Matthews Correlation Coefficient (MCC) metrics. Corona virus disease (COVID-19) is essential to apply the necessary quarantine conditions and discover the treatment methods in order to prevent the rapid spread of COVID-19. It has become a global epidemic similar to other pandemic diseases, causes patient deaths in China according to World Health Organization (WHO) data [1] [2] [3] . Early application of treatment procedures for individuals with COVID-19 infection increases the patient's chances of survival. Fever, cough and shortness of breath are the most important symptoms in infected individuals for the diagnosis of COVID-19. At the same time, these symptoms may show carrier characteristics by not being seen in infected individuals. Pathological tests performed in laboratories are taking more time. Also, the margin of error can be high. A fast and accurate diagnosis is necessary for an effective struggle against COVID-19. For this reason, experts have been started to use radiological imaging methods. These procedures are performed with computed tomography (CT) or X-ray imaging techniques. COVID-19 cases have similar features in CT images in the early and late stages. It shows a circular and inward diffusion from within the image [4] . Therefore, radiological imaging provides early detection of suspicious cases with an accuracy of 90%. When the studies in the literature are examined, Shan et al proposed a neural network model called VB-Net in order to segment the COVID-19 regions in CT images. This proposed method has been tested in 300 new cases. A recommendation system has been used to make it easier for radiologists to mark infected areas within CT images [5] . Xu et al. analyzed CT images to determine healthy, COVID-19 and other viral case. The dataset used included 219 COVID-19, 224 viral diseases and 175 healthy images. They achieved 87.6% general classification accuracy with their deep learning method [6] . Apostolopoulos et al. proposed a transfer learning methods to classify COVID-19 and normal case. They obtained performance metrics which are 96.78% accuracy, 98.66% sensitivity, and 96.46% specificity [7] . Shuai et al. were able to successfully diagnose COVID-19 using deep learning models that could obtain graphical features in CT images [8] . 150 CT images were used in this study to classify COVID-19 cases. Two different datasets were generated from 150 CT images. These datasets include 16×16 and 32×32 patch images. Each dataset contains 3000 number of images labeled with COVID-19 and No findings. Deep features were obtained with pre-trained Convolutional Neural Network (CNN) models. These deep features was fused and rank to train Support Vector Machine (SVM). The performance of proposed method can be used for early diagnosis of COVID-19 cases. This study consists of 5 sections. The properties of obtained patch images are visualized in Section 2. In Section 3, the basics of deep learning methods, feature fusion and ranking techniques are mentioned. Comparative classification performances are given in Section 4. There is a discussion and conclusion in Section 5. 53 infected CT images was accessed to the Societa Italiana di Radiologia Medica e Interventistica to generate datasets [9] . Patch images obtained from infected and non-infected regions form CT images. Properties of two different patch are given in Table 1 . CT images were obtained. The process of obtaining patches is given in Figure 1 . In 2006, Geoffrey Hinton has shown that deep neural networks can be effectively trained by the greedy-layered pre-training method [10] . Other research groups used the same strategy to train many other deep networks. The use of term (Deep Learning) in order to draw attention to the theoretical importance of depths has been popularized for the design of better performing networks of neural networks and the importance of deeper networks. Deep learning, which has become quite popular recently, has been used in many areas. E-mail filtering, search engine matching, smartphones, social media, e-commerce can be written to them. Academic studies have been pioneers for their use in these areas. Deep learning is also used for face recognition, object recognition, object detection, text classification and speech recognition. Deep learning is a type of artificial neural network and has multilayers. The more layers are increased, the greater accuracy is achieved. While deep convolutional networks are successfully used in image, video, speech and sound processing, recurrent neural networks are used in sequential data such as text and speech. Deep learning, started to be used in 2010, a large data set with multilayer of machine learning calculations used in many layers, even in the machine learning the parameters that need to be defined, perhaps a better system that can evaluate the parameters. Deep learning artificial neural networks are the algorithms created by taking advantage of the functions of the brain. In machine learning, Deep Belief Networks (DBN) is a productive graphical model or, alternatively, a class of deep neural networks consisting of multiple layers in hidden nodes. When trained on a series of unsupervised examples, the DBN can learn to reconfigure its entries as probabilistic. The layers then act as feature detectors. After this learning phase, a DBN can be trained with more control to make the classification. DBNs can be seen as a combination of simple, unsupervised networks, such as restricted Boltzmann machines (RBMs) or auto encoder, which serve as the hidden layer of each subnet, the visible layer of the next layer. Convolution is used as a mathematical process. It is a special type of linear operations. Convolutional neural networks (CNN) are a type of neural network with at least one layer of convolution. However, the convolution process in deep learning is different from the convolution process in normal or engineering mathematics. Convolution neural networks has some layer such as convolution, ReLU, Pooling, normalization, fully connected and softmax layer. In the convolution neural networks, classification process takes place in fully connected layers and softmax layer. Generally, convolution is a process that takes place on two actual functions. To describe the convolution operation, two function can be used for this definition. For example, the location of a space shuttle with a laser is monitored. The laser sensor produces a simple x(t) output, which is the space of the space shuttle at time t. Where X and t are actual values, for example, any t is a different value received at a snapshot time. Also, this sensor has a bit noisy. To carry out a less noisy prediction, designer can take the average of several measurements together. Naturally, final measurements are closer, so that the average weights that give more weights to desired final measurements. This can be done with the weighting function w(a), which is a measurement period. If a weighted average operation is applied at all times, a new function is obtained which allows to more accurately estimate the position: The above process is a convolution and is represented by a star: In CNN terminology, first argument in X function at Eq. 2 is called an introduction to convolution and the second argument for W function is called the kernel. The output is called feature map. In the above example, the measurement is made without interruption, but this is not realistic. Time is parsed when working on the computer. In order to realize realistic measurement, one measurement per second is taken. Where t is the time index and is an integer, so X and W are integers. In machine learning applications, the input function consists of a multidimensional array set and the kernel function consists of a multidimensional array of several parameters. Multiple axes are convolved at one time. So if the input is a two-dimensional image, the kernel becomes a two-dimensional matrix. The above equation means shifting the kernel according to the input. This increases invariance of convolution [11] . But this feature is not very important for machine learning libraries. Instead, many machine learning libraries process the kernel without inversion, which is called as cross correlation, which is related to convolution. But because it looks like a convolution, it is called a convulsive neural network: Discrete convolution is seen as a matrix product. Typical convolution neural networks' benefit from further expertise to effectively deals with large inputs. Figure 2 shows how the process occurs in convolution neural networks: Convolution provides three important thoughts to improve a machine learning system: infrequent interactions, parameter sharing, and covariant representations. Furthermore, convolution process can be worked with variable-sized inputs. Convolution neural network layers use a matrix parameter with a matrix parameter that includes a different kinds of link between each input unit and each output unit. It means that each output unit connects with each input unit. However, CNN typically have infrequent interactions (also called sparse links or sparse weights). This is done by making the kennel smaller than the entrance. Since the number of pixels after each convolution process decreases, if there is a quality that should not be overlooked at the edges, zero and edge attributes are preserved by adding zero at the end of the rows and columns. This process is called padding. For example, input image may consist of thousands or millions of pixels for image process, but small and meaningful properties such as kernel's edges consisting of only ten or hundreds of pixels can be detected. This means we need to save fewer parameters that both reduce the memory requirements of CNN model and increase its efficiency. It also means that calculating output requires less processing. These improvements in productivity are generally quite large. Parameter sharing refers to the use of the same parameter for more than one function in a model. In a conventional neural network, each element in weighted matrix is used to calculate the output of a layer. This is multiplied by an element of the entry and will not be reviewed again. It can be said that a network ties weights because the value of the weight applied to an input depends on the value of the weight applied elsewhere as in parameter sharing. In a CNN, each member of the core is used in each position of the insert. Parameter sharing used by the convolution process means that instead of learning a separate set of parameters for each subject, only one set will be learned. Considering that the images are three-dimensional in the form of H x W x D size if K x K is called kernel size is how many pixels of convolution output is calculated as follows: Roughly means normalization. The size of the data in artificial neural networks is important. As the data grows, the memory they occupy increases and this reduces both the efficiency of the artificial neural network and decreases the working speed. By compressing the entire dataset value to 0-1, the operations are made easy. It extracts this process from the average of all the data sets and thus the data is in the range 0-1. The result of standardization) is to rescale features for a standard normal distribution. where μ and σ is represented as average standard deviation respectively. Standard scores for each samples are computed as follows: The standard deviation for the features is centered between 1 and 0. Also, it is important for training of many machine learning algorithms. A pooling function changes the output of the network at a specific location with a summary statistics of nearby outputs. For example, max-pooling yields the largest in the quadrilateral space as output. Other popular pooling functions; mean and minimum pooling functions. When number of parameters in the next layer depends on input image or feature map size, any reduction in input size also increases the statistical efficiency and reduces the memory requirements for storing parameters. The number of pixels of the pooling output is calculated as follows: Rectified Linear Unit is an activation function type. The Rectified Linear Unit has recently become popular. Calculates the function F (x) = max (0, x). In other words, activation is thresholded equal to zero. There are a number of pros and cons of the use of ReLU. It has been found that stochastic gradient descent significantly accelerates convergence compared to Sigmoid / tanh functions. It is claimed that this originates from a linear, unsatisfactory form. When the neurons containing costly operations are compared to tanh / sigmoid, ReLU can simply be applied by thresholding an activation matrix to zero. ReLU units can become sensitive during training phase. For example, a large gradient scale flowing through neuron with a ReLU activation function can cause weights to be updated so that the neuron is not reactivated at any data point. If this happens, the gradient flowing through the unit will be zero from that point forever. That is, ReLU can kill units irrevocably during training because data replication can be disabled. For example, if the learning rate is too high, 40% of the network may be dead. This is a less frequent occurrence with an appropriate adjustment of the learning rate. In fully connected layers, reduction of nodes below a certain threshold increased the performance. So it is observed that forgetting the weak information increases learning. Some properties of dropout value are as follows. The dropout value is generally 0.5. Different uses are also common. It varies according to the problem and data set. The random elimination method can also be used for the dropout. The dropout value is defined as a value in the range [0, 1] when used as the threshold value. It is not necessary to use the same dropout value on all layers; different dilution values can also be used. The softmax function is a sort of classifier. Logistic regression is a classifier of the classifier and the softmax function is multi-class of logistic regression. 1/∑je fj term normalizes the distribution. That is, the sum of the values equals 1. Therefore, it calculates the probability of the class to which the class belongs. When a test input is given x, the activation function in j = 1,…,k is asked to predict the probability of p (y = j | x) for each value. For example, it is desirable to estimate the probability that the class tag will have each of the different possible values. Thus, as a result of the activation function, it produces a k-dimensional vector which gives us our predictive possibilities. The error value must be calculated for the learning to occur and the error value for the softmax function is calculated by the softmax loss function. In the Softmax classifier, the f (xi; W) = Wxi function match remains unchanged, but we now interpret these scores as normalized log probabilities for each class and use the following form of cross entropy loss. VGG-16, GoogleNet and ResNet-50 models were used for feature extraction. The obtained feature vectors with these models were fused to obtain higher dimensional fusion features. In this way, the effect of insufficient features obtained from a single CNN network is minimized. In addition, there is a certain level of correlation and excessive information among the features. This also increases consuming time and computational complexity. Therefore, it is necessary to rank the features. t-test technique was used in feature ranking. It calculates the difference between the two features and determines its differences statistically [12] . In this way, it performs the ranking process by taking into account the frequency of the same features in the feature vector and the frequency of finding the average feature. After the feature fusion and ranking functions were performed, the binary SVM classifier was trained for classification. SVM transfers features into space where it can better classify features with kernel functions [13] . Linear kernel function was used in SVM. The SVM classifier was trained to minimize the squared hinge loss. The squared hinge loss is given in Eq. 10. Here, xn represents the fusion and the ranking feature vector. The wrong classification penalty is determined by the C hyper parameter in the loss function. In the proposed method, pre-trained CNN networks were trained for Subset-1 and Subset-2 separately. VGG-16, GoogleNet and ResNet-50 models were used as a pre-trained network. Patch images were given as input to trained pre-trained CNN structures during the test phase. Feature vectors (1000 × 1 × 3) obtained from these networks provide a new feature set with fusion process. Correlation values between features were taken into consideration in fusion process. The obtained features were ranked by t-test method. In the t-test ranking process, features close to each other were eliminated according to feature frequency. In the last stage, fusion and ranking deep features were evaluated with SVM classifier. The method proposed in Figure 4 is visualized. There are 6000 pieces of 16 × 16 CT patches in Subset-1. Data distribution between classes is equal. 75% of these images were used for training and 25% for testing. Table 2 shows comparatively classification performance pre-trained CNN networks and of the proposed method. Subset-2 includes 3000 COVID-19 and 3000 No finding 32 × 32 CT patches. Comparative classification results of Subset-2 are given in Table 3 . The best performance in Subset-1 showed proposed method with 95.60% as can be seen in respectively. The proposed method achieved the highest metric performance in F1-score and MCC metrics with 98.28% and 96.54% respectively. As can be seen in Table 2 and Table 3 , there are confusion matrixes obtained with Subset-1 and Subset-2 datasets of the proposed method in Figure 5 and Figure 6 . Confusion matrix was obtained for proposed method using Subset-1 in Figure 5 . When confusion matrix was evaluated in class, COVID-19 class was classified with an accuracy rate of 97.9%. Performance of No findings class was lower than COVID-19. 93.3% accuracy rate was obtained for this class. A classification accuracy of 93.6% was obtained in the analysis of positive class. In negative class, this rate is higher and had a value of 97.8%. Subset-2 was used in the training and testing process for the proposed method. In Figure 6 , a confusion matrix was obtained for test data. In class analysis, 97.6% accuracy rate of COVID-19 class was obtained. Performance was increased compared to Subset-1 in the No findings class. Accuracy rate was 98.9% for this class. In the positive and negative class evaluation, a classification accuracy of 98.9% and 97.6% was obtained respectively. The first case of COVID-19 was found in the Wuhan region of China. COVID-19 is an epidemic disease and threatens world health system and economy. COVID-19 virus behaves similarly to other pandemic viruses. This makes it difficult to detect COVID-19 cases quickly. Therefore, COVID-19 is a candidate for a global epidemic. Radiological imaging techniques are used for a more accurate diagnosis in the detection of COVID-19. Therefore, it is possible to obtain more detailed information about COVID-19 using CT imaging techniques. When CT images are examined, shadows come to the fore in the regions where COVID-19 is located. At the same time, a spread is observed from the outside to the inner parts. Obtained images with different CT devices were used in the study. There were different levels of grey level in the images. Different characteristics of CT devices caused it. This complicates the analysis of the images. In the study, deep features were obtained by using pre-trained CNN networks. Then, deep features were fused and ranked. The data set was generated by taking random patches on CT images. Clinical features of patients infected with 2019 novel coronavirus in Wuhan Added value of computer-aided CT image features for early lung cancer diagnosis with small pulmonary nodules: a matched case-control study Dermatologist-level classification of skin cancer with deep neural networks Mining X-ray images of SARS patients Lung Infection Quantification of COVID-19 in Deep Learning System to Screen Coronavirus Disease Covid-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-19) Improving neural networks by preventing co-adaptation of feature detectors Non-Native Children Speech Recognition Through Transfer Learning A modified t-test feature selection method and its application on the hapmap genotype data Statistical learning theory: a tutorial Evaluation of the confusion matrix method in the validation of an automated system for measuring feeding behaviour of cattle