key: cord-0681739-8kuitglm authors: de Carvalho Brito, Vitória; dos Santos, Patrick Ryan Sales; de Sales Carvalho, Nonato Rodrigues; de Carvalho Filho, Antonio Oseas title: COVID-index: A texture-based approach to classifying lung lesions based on CT images date: 2021-06-06 journal: Pattern Recognit DOI: 10.1016/j.patcog.2021.108083 sha: 19a89f6f37f94d2afc2da49ea8f41e5dd760ca82 doc_id: 681739 cord_uid: 8kuitglm COVID-19 is an infectious disease caused by a newly discovered type of coronavirus called SARS-CoV-2. Since the discovery of this disease in late 2019, COVID-19 has become a worldwide concern, mainly due to its high degree of contagion. As of April 2021, the number of confirmed cases of COVID-19 reported to the World Health Organization has already exceeded 135 million worldwide, while the number of deaths exceeds 2.9 million. Due to the impacts of the disease, efforts in the literature have intensified in terms of studying approaches aiming to detect COVID-19, with a focus on supporting and facilitating the process of disease diagnosis. This work proposes the application of texture descriptors based on phylogenetic relationships between species to characterize segmented CT volumes, and the subsequent classification of regions into COVID-19, solid lesion or healthy tissue. To evaluate our method, we use images from three different datasets. The results are promising, with an accuracy of 99.93%, a recall of 99.93%, a precision of 99.93%, an F1-score of 99.93%, and an AUC of 0.997. We present a robust, simple, and efficient method that can be easily applied to 2D and/or 3D images without limitations on their dimensionality. • We propose eight image texture descriptors, which do not require parameterization; • The proposed descriptors do not need to resize images; • We have developed a scalable method, since it can be easily used in 2D or 3D images, without restrictions regarding the quantization of images; • Our descriptors can achieve results, as promising as deep networks, in some cases, with superior results; • Our descriptors do not require powerful hardware, as well as, approaches using deep neural networks; and, • Our descriptors, too, do not need large amounts of images to achieve good results. Since the discovery of a new coronavirus in China in late 2019, the disease has become a global concern, mainly due to its rapid spread. As of April 2021, the number of confirmed cases notified to the World Health Organization (WHO) has already exceeded 135 million, while the number of deaths 5 bas exceeded 2.9 million [1] . COVID-19 is an infectious disease caused by a recently discovered type of coronavirus called SARS-CoV-2. Although most people infected with COVID-19 recover without special treatment, older people and people with preexisting illnesses such as diabetes, cardiovascular disease, chronic respiratory disease, and cancer are more likely to be severely affected 10 [1]. The early diagnosis of COVID-19 is therefore essential for the treatment of the disease. Real-time polymerase chain reaction (RT-PCR) or chest computed tomography (CT) examination are possible alternatives for early diagnosis. Several computer-aided detection (CAD) systems for the early diagnosis of COVID-19 have been developed in this context. A CAD system typically con- 15 sists of three steps: (i) image acquisition; (ii) segmentation of candidate regions; and (iii) characterization and classification of these regions. In a CAD system, the segmentation stage is typically automatic, and needs to be able to handle numerous regions with similar characteristics (shape, density, or texture). It is therefore essential to apply a stage that efficiently classifies all of these regions. 20 Thus, the proposed method acts in the characterization and classification stages. Overall, CT images of cases of COVID-19 share certain specific features, such as the presence of ground-glass opacities (GGOs) in the early stages and lung consolidation in the advanced stages [2] . Pleural effusion may also occur in cases of COVID-19, but is less common than the other lesions. It is therefore 25 important to point out some difficulties with this approach, as follows: • Although the features of COVID-19 are found in most cases, CT images of some viral pneumonias also show these features, which can ultimately make diagnosis more difficult [2] ; 30 • In some cases of COVID-19, biopsies are needed [3] ; • Correct classification is required between healthy and diseased regions, especially those with COVID-19 and other more serious diseases such as lung nodules; and • According to [4, 5] , COVID-19 regions are generally more rounded in 35 shape. In view of the above, we can see that there is a need to provide specialists with a method that can enable an individual analysis of the types of lesions found in CT scans. Through correct classification, our method can provide the individual details of the lesions, which can help the specialist in making decisions 40 regarding the need for biopsies. Furthermore, based on the work in [4, 5] , we believe that the techniques used in our method of texture characterization can provide more meaningful information for lesion classification, since the shape of the lesion is not considered in the analysis. This work makes original contributions in the areas of both medicine and 45 computer science, as follows: • In the area of computer science: -In the context of COVID-19, we propose phylogenetic and taxonomic diversity indexes for the characterization of image textures; -We improve the efficiency of the index calculation by optimizing the 50 phylogenetic tree assembly. • In the medical field: -We present a method that can be applied in patient triage and can therefore assist in the efficient management of healthcare systems; -Our system can diagnose patients quickly, especially when the medical system is overloaded; -Our approach can reduce the burden on radiologists and assist underdeveloped areas in making an accurate and early diagnosis. The rest of the article is organized as follows: Section 2 reviews related works in the literature; Section 3 describes the proposed method; Section 4 60 presents the results obtained from an implementation of our approach; Section 5 discusses some relevant aspects of this work; and, finally, Section 6 presents the conclusion. Today, pattern recognition, and particularly intelligent analysis, is one of 65 the most promising areas of computer science. Several studies are notable in this field, such as the method developed in [6] , which used a 3D model to segment brain regions. In [7] , intelligent solutions for expression recognition and landmark localization problems were presented, while the authors of [8] proposed a method based on adversary learning to improve the efficiency of 70 deep learning approaches in object detection. In the area of optimization, we highlight the work in [9] , which introduced a consensus-based technique for a new form of data clustering and representation. Finally, we note the study in [10] , which presented a method for recognizing patterns in low-resolution images using super-resolution networks. Based on the aforementioned studies and global trends in this area, pattern recognition techniques have been used in numerous ways to help combat COVID19 and to mitigate the damage caused by the pandemic. In this section, we note some relevant works on this topic and several works that have applied phylogenetic diversity to image texture analysis to find solutions to other prob-80 lems. There is a great variety of diversity indexes, each of which has particular properties according to their categorization, for example (i) those that exploit the richness of species and the abundance of individuals; (ii) those that explore 85 the relationships between species; and (iii) those that explore the topology of the representations of common ancestors. There are several notable studies in this area. For example, the work in [11] used the phylogenetic distance and taxonomic diversity with a support vector machine (SVM) to classify pulmonary nodules. Other approaches have also used 90 these indexes in automatic methods for the detection of glaucoma [12] and for the classification of breast lesions [13] . As can be seen from the studies described above, the literature contains recent works that have used phylogenetic diversity indexes as texture descriptors and have applied this approach to diverse problems with promising results. Since 95 these were carried out in contexts that were different from ours, the studies listed above will not be used for a comparison of our results, but solely to highlight the contributions of the descriptors in the literature. In [14] a methodology based on X-ray images and deep learning techniques 100 was presented for the classification of images into COVID-19 and non-COVID- 19 . The results were promising, with an accuracy of above 90%. The authors of [15] also analyzed X-ray images, and combined the texture and morphological features to classify them into COVID-19, bacterial pneumonia, non-COVID-19 viral pneumonia, and normal. The results showed an AUC of 0.87 for this multi-105 class classification scheme. Feature combination was also used in [16] , where radiomic features and clinical details were used to describe CT images and a Random Forest algorithm was applied to classify the features into non-severe and severe COVID-19. Due to a lack of availability of public CT data on COVID-19, the authors of 110 [17] built a dataset called COVID-CT, composed of 349 CT images that showed COVID-19 and 463 that did not. In [18] , two subsets were extracted from a set of 150 CT images containing COVID-19 and non-COVID-19 patches, and classification was carried out using the deep feature fusion and ranking technique. The studies in [2] and [19] also applied CT images and deep learning approaches Unlike the previous approaches, the authors of [22] , [23] and [24] proposed methodologies for the detection of COVID-19 using 3D volumes of CT scans. In [22] , each volume was segmented using a pre-trained UNet and later classified with a weakly-supervised 3D deep neural network called DeCoVNet. The 125 authors of [23] also evaluated their proposed method using 3D CT regions, this time obtained from scans of 81 patients. Their scheme was a radiomic model that combined texture features with patients' clinical data to classify COVID-19 into common or severe types. Finally, the approach in [24] classified CT images into healthy, idiopathic pulmonary fibrosis (IPF) and COVID-19 using a 3D 130 approach called three-dimensional multiscale fuzzy entropy. It can be observed from the above studies that solving the problem of COVID-19 diagnosis is not a simple task. Despite the application of various CNN-based methods to image classification, in which a CNN is responsible for extracting and selecting representative features in its convolutional layers, these 135 feature maps are not always efficient enough to allow for classification. Although recent work on a diverse range of imaging applications has used CNNs, with results that have surpassed those of other methods, in the case where there are representative features of a specific problem, the use of these features may be more efficient than CNN methods. Another problem encoun-140 tered when using CNNs, which was also noted in [17, 18] , is the large number of parameters required to create models for both the architecture and the pa-rameters. The authors of these studies therefore proposed the use of transfer learning instead. In addition, the training of a CNN requires considerable time in order to 145 create a capable model, and several tests of architectures and parameters are required. Powerful machines are also needed to run these networks. Finally, the use of a CNN requires a large number of images, and data augmentation is often required to handle this issue. However, this is not a trivial task, as it requires countless tests and training of the whole network until satisfactory results are 150 obtained. We therefore propose to use phylogenetic diversity indexes for the feature extraction task of COVID-19, solid lesions and healthy tissue, in conjunction with the random forest and extreme gradient boost classifiers. This section describes our methodology for classifying CT volumes as COVID-19, solid lesion, or healthy tissue. The images used in this study were acquired from the Lung Image Database Consortium Image Collection (LIDC-IDRI) [25] and from the MedSeg [26] repository, the latter of which contained two datasets with COVID-19 images. At the feature extraction stage, we applied phyloge-160 netic diversity indexes. The extraction and classification algorithms developed in this study are available from our GitHub repository. Figure 1 illustrates the workflow of our methodology. We used three sets of volumes of interest (VOIs) extracted from the LIDC-165 IDRI and the MedSeg repositories to evaluate our method. These were as follows: (i) a set of images extracted from the LIDC-IDRI that contained VOIs showing solid lesions, for which we used the markings made by specialists for the base documentation; (ii) a set of healthy tissue VOIs that were extracted from LIDC-IDRI by applying the algorithm proposed in [27] , to guarantee that the VOIs of healthy tissue did not intersect with those of solid-type lesions (this method was chosen as it could provide samples found in real scenarios); and (iii) a set of images acquired from the MedSeg repository [26] which contained some external datasets of various types of CT exams, including those diagnosed with COVID-19. Hence, we used two different sets of images containing lesions 175 caused by COVID-19, i.e., regions with GGO lesions, consolidation, and pleural effusion. We used the specialists' markings that were available for the respective datasets to extract these lesions. Since MedSeg does not provide terminology for the COVID-19 datasets used here, we refer to these as COVID-19 (Dataset 180 Table 1 shows the number of images in each dataset. In this step, we present the rationale for the proposed indexes for texture characterization. Each index corresponds to a certain characteristic, meaning that a total of eight characteristics are extracted from each analyzed image. Phylogenetics is a branch of biology in which the evolutionary relationships between species are studied and the similarities between them described. In phylogenetic trees, leaves represent species and nodes represent common ancestors. The phylogenetic tree used in this work is called a cladogram. Figure 2 190 illustrates an example of a cladogram that represents the genetic relationship between the monkey and human species; it can be observed that from a genetic perspective, humans and chimpanzees are closer than the other pairs of species in the tree. A combination of phylogenetic trees and phylogenetic diversity indexes is 195 used to analyze the evolutionary relationships between species and to measure the variation between species in a community. In order to be able to apply these concepts to the characterization of CT images, we need to define a correspondence between the definitions used in biology and those used in this work. This is illustrated in Figure 3 . Using the correspondence shown in Figure 3 as a basis, we generate a cladogram for each study image, and an example of this is shown in Figure 4 The phylogenetic diversity (PD) index [28] is a measure that gives the sum of the distances of phylogenetic branches in the tree. When the branch length is longer, the species become more distinct. Equation (1) shows the formula for the PD, where B represents the number of branches in the tree, L i is the 210 extension of branch i (the number of edges in that branch), and A i refers to the average abundance of the species that share branch i: The sum of phylogenetic distances (SPD) is a phylogenetic index that gives the sum of the distances between the pairs of species present in the tree [29] . Equation (2) is used to calculate this index, where S represents the number and a j correspond to the abundance of species i and j, respectively. The term S j=i+1 in the numerator represents the double sum of the products of the distances between all species in the tree based on their abundance; in the denominator, it corresponds to the double sum of the products of the abundances 220 of the species. (2) The mean nearest neighbor distance (MNND) is a weighted average of the phylogenetic distance of the nearest neighbor of each species [30] . The weights represent the abundance of each species. Equation (3) shows the formula used to calculate this index, where S represents the number of species in the community, 225 min(d ij ) represents the distance between species i and j, and a i corresponds to the abundance of species i. In the case of d ij , j refers to the closest relative of the species i. The phylogenetic species variability (PSV) index measures the variation between two species in a community, and quantifies the phylogenetic relationship 230 between them.The PSV is calculated using Equation (4), where C is a matrix, trC is the sum of the diagonal values of this matrix, c represents the sum of all of the values in the matrix, and S is the total number of species. The phylogenetic species richness (PSR) index calculates the richness of the species present in a community based on their variability [29] . As shown in 235 Equation (5), this calculation is done by multiplying the number of species (S) by the PSV. The mean phylogenetic distance (MPD) represents the average phylogenetic distance, which is calculated by analyzing combinations of all pairs of species in the community [30] . The equation for this index uses the total number of 240 species, indicated by S, the phylogenetic distance between each pair of species, denoted by d ij , and a variable p i p j , which takes a value of one if the species is present and zero otherwise. The term S j=i+1 is a double sum of the products of the distances between all species in the tree and the value indicating the presence or absence of the species, i.e., one or zero Equation (6) shows the 245 formula used to calculate the MPD. The taxonomic diversity index ( ) value represents the average phylogenetic distance between the individuals of the species [31] . This index takes into consideration the number of individuals of each species and the taxonomic relationships between them. The formula for calculating is defined by Equation (7), where a i (i = 1, ..., S) represents the abundance of species i, a j (j = 1, ..., S) represents the abundance of species j, S indicates the total number of species, n denotes the total number of individuals, and d ij is the taxonomic distance between species i and j. Finally, the taxonomic distinction index ( * ), defined by Equation (8), ex-255 presses the average taxonomic distance between two individuals of different species [31] . In this expression, a i (i = 1,...,S) is the abundance of species i, a j (j = 1,...,S) is the abundance of species j, S is the total number of species and d ij is the taxonomic distance between species i and j. * = The equations in Section 3.2.1 are derived in relation to biological concepts. For a better understanding of how these indexes are calculated for images, we present an example of a three-dimensional image with two slices, from which we extract the cladogram and calculate the distances and the eight indexes. We used a small image so that the calculations were not extensive in the paper. To calculate the phylogenetic distances based on the cladogram, we use the 270 following equations: where i and j are two different species. in Figure 4 . In our implementation of these indexes, we represent the cladogram as a histogram structure. Each position in the histogram represents a species (which are the intensities in the image), and each value refers to the abundance (which is the number of voxels with each intensity). We can then calculate the distances 280 using the histogram. When constructing the cladogram, we apply a simple but Figure 4 . Table 2 , while the abundance of species can be seen in Figure 4 . Since the calculation of the PD is more complex than the other indexes, we 295 describe it using Table 3 . Based on the values in Table 3 ,the PD value for the image example was as follows: To calculate the SPD, as shown in Equation (2) represents the closest relative of i (intensity j, which refers to the intensity following i), and a i denotes the abundance of the species i (number of voxels with intensity i). Since our cladogram has only one path between one species and another, the minimum path is the only path between species. Thus, for our image, the MNND has the following value: To calculate the PSV, as shown in Equation (4) The PSR, as shown in Equation (5) To calculate the MPD, as defined in Equation (6), we consider the sum of the distances between species (d ij ) multiplied by the variables p i and p j , which take on a value of zero if the species is not present, or one if the species is present. Thus, when a given intensity in the histogram exists in the image, p is set to one; otherwise, p is set to zero. In the calculation of , as defined in Equation (7), d ij represents the distance between the species (intensities) i and j, a refers to the abundance of the species Finally, the calculation of * , as defined in by Equation (8), is similar to that of , with the difference that in the this case, of * the denominator is the sum of the multiplication products of species the abundances of species. Substituting in the values, we have: ΣΣ i