key: cord-0708555-shay3bis authors: Li, Zhaotong; Wu, Fengliang; Hong, Fengze; Gai, Xiaoyan; Cao, Wenli; Zhang, Zeru; Yang, Timin; Wang, Jiu; Gao, Song; Peng, Chao title: Computer-Aided Diagnosis of Spinal Tuberculosis From CT Images Based on Deep Learning With Multimodal Feature Fusion date: 2022-02-23 journal: Front Microbiol DOI: 10.3389/fmicb.2022.823324 sha: 4d4541323c28cfe41d515846783b39e9044aa8b1 doc_id: 708555 cord_uid: shay3bis BACKGROUND: Spinal tuberculosis (TB) has the highest incidence in remote plateau areas, particularly in Tibet, China, due to inadequate local healthcare services, which not only facilitates the transmission of TB bacteria but also increases the burden on grassroots hospitals. Computer-aided diagnosis (CAD) is urgently required to improve the efficiency of clinical diagnosis of TB using computed tomography (CT) images. However, classical machine learning with handcrafted features generally has low accuracy, and deep learning with self-extracting features relies heavily on the size of medical datasets. Therefore, CAD, which effectively fuses multimodal features, is an alternative solution for spinal TB detection. METHODS: A new deep learning method is proposed that fuses four elaborate image features, specifically three handcrafted features and one convolutional neural network (CNN) feature. Spinal TB CT images were collected from 197 patients with spinal TB, from 2013 to 2020, in the People’s Hospital of Tibet Autonomous Region, China; 3,000 effective lumbar spine CT images were randomly screened to our dataset, from which two sets of 1,500 images each were classified as tuberculosis (positive) and health (negative). In addition, virtual data augmentation is proposed to enlarge the handcrafted features of the TB dataset. Essentially, the proposed multimodal feature fusion CNN consists of four main sections: matching network, backbone (ResNet-18/50, VGG-11/16, DenseNet-121/161), fallen network, and gated information fusion network. Detailed performance analyses were conducted based on the multimodal features, proposed augmentation, model stability, and model-focused heatmap. RESULTS: Experimental results showed that the proposed model with VGG-11 and virtual data augmentation exhibited optimal performance in terms of accuracy, specificity, sensitivity, and area under curve. In addition, an inverse relationship existed between the model size and test accuracy. The model-focused heatmap also shifted from the irrelevant region to the bone destruction caused by TB. CONCLUSION: The proposed augmentation effectively simulated the real data distribution in the feature space. More importantly, all the evaluation metrics and analyses demonstrated that the proposed deep learning model exhibits efficient feature fusion for multimodal features. Our study provides a profound insight into the preliminary auxiliary diagnosis of spinal TB from CT images applicable to the Tibetan area. Spinal tuberculosis (spinal TB) is secondary to TB of the lung, gastrointestinal tract, or lymphatic tract, and it causes bone TB via the blood circulation route (Garg and Somvanshi, 2011; Rasouli et al., 2012; Khanna and Sabharwal, 2019) . The insidious onset of spinal TB and the lack of specificity in clinical manifestations can lead to serious symptoms, such as kyphosis, abscess injection, and spinal instability, further causing paraplegia or death (Qian et al., 2018; Vanino et al., 2020) . The incidence of TB is significantly higher in underdeveloped plateau regions, particularly in the Tibetan area of China (Du et al., 2017; Zhu et al., 2017) ; for example, the rate of reported TB cases in the Tibet Autonomous Region was 166.6 per 100,000 in 2017, which was the highest in China. Spinal TB accounts for approximately 2% of pulmonary TB, 15% of extrapulmonary TB, and 50% of bone and joint TB throughout the world (Fuentes Ferrer et al., 2012) . Moreover, the CT manifestation of spinal TB is complicated, including typical manifestations (destruction of the vertebral body, collapse of the vertebral space, abscess compression on the spinal cord or nerve roots, etc.) and atypical manifestations (vertebral body osteoid formation, vertebral body disruption in the anterior column, vertebral body endplate worm-like disruption, pus in the paravertebral soft tissue shadow, continuous unilateral bone disruption, and asymmetry between the imaging manifestations and symptoms) Rauf et al., 2015) . Local grassroots hospitals lack experienced specialists and multimodal medical imaging equipment, they have only CT or digital radiography (DR) machines. Therefore, most Tibetan grassroots doctors cannot make expeditious medical decisions. These poor health conditions lead to high rates of misdiagnosis, missed diagnosis, and delays in effective treatment, which result in severe complications that impose serious social burdens on Tibetan herdsmen (Wang et al., 2015) . Computer-aided diagnosis (CAD), including classical machine learning (ML) and deep learning (DL), is an effective method for assisting primary care physicians in treating patients with spinal TB; CAD builds mathematical models on computers using fuzzy mathematics, probability statistics, and even artificial intelligence to process patient information and propose diagnostic opinions and treatment plans. To the best of our knowledge, except for a few reports on the simple application of statistical analysis to the clinical diagnosis of spinal TB (Zhang et al., 2019; Liu et al., 2021) , there are limited studies on artificial intelligence-aided diagnosis of spinal TB, including diagnostic classification, pathological grading, lesion segmentation, and prognostic analysis. Radiomics, a typical example of traditional ML, is an automated high-throughput method that extracts a significant amount of quantitative handcrafted features from medical images (Lambin et al., 2012) . These handcrafted features are the conversion of digital images into mineable data and the subsequent analyses of these data for decision support (Gillies et al., 2016) , such as color, texture, shape, and statistical characteristics, including scale-invariant feature transform (SIFT), speeded-up robust features (SURF), and oriented rotated brief (ORB) (Abdellatef et al., 2020) . Currently, although many handcrafted features have been designed for various clinical applications (Moradi et al., 2007; Aerts et al., 2014; Cook et al., 2014; Wang et al., 2014; Li et al., 2018; Tian et al., 2018; Song et al., 2021) , classical ML cannot accurately perform ancillary diagnostics of TB owing to its limited accuracy (Goodfellow et al., 2016; Currie et al., 2019) . The design of handcrafted features often involves finding the right trade-off between accuracy and computational efficiency based on the subjective understanding of key issues (Nanni et al., 2017) ; therefore, an inappropriate handcrafted feature typically results in poor generalization ability (Suzuki et al., 2012) , which significantly hinders the development of ML diagnostic systems. Contrastingly, DL based on convolutional neural networks (CNNs) is another medical CAD method that enhances the identification of subtle differences in radiographical characteristics, and it is feasible for integrating multi-omics medical data by harnessing the power of computing (Altaf et al., 2019; Alkhateeb et al., 2021) . Unlike traditional ML, the features extracted by DL can be predetermined by a CNN during training, without elaborate design (Anwar et al., 2018) . There are various CNN models that are applicable to different medical scenarios, such as common CNN for grading (Yang et al., 2018; Swarnambiga et al., 2019) , U-Net for segmentation (Ronneberger et al., 2015; Jackson et al., 2018; Deng et al., 2021) , and GAN for the generation of synthetic images (Lei et al., 2019) . The technological innovations of CAD show that DL could be a suitable candidate for auxiliary diagnosis in modern healthcare systems. However, a CNN needs the majority of datasets to extract features automatically and requires significant training time to obtain a reliable model (Goodfellow et al., 2016; Currie et al., 2019) , both of which are scarce resources in medical practices. Moreover, the lack of interpretability of DL is another important factor that hinders its development in rigorous clinical work. Therefore, the effective fusion of the multimodal features extracted from both ML and DL is one of the key directions to further improve the performance of CAD when compared with the counterparts of individuals above. This approach has had several successful applications with medical radiological images, such as in the determination of tumor benignity and malignancy (Antropova et al., 2017; Xie et al., 2018; Khosravi et al., 2021) , lesion segmentation (Su et al., 2021) , survival prediction (Shboul et al., 2019; Guo et al., 2021) , detection of COVID-19 from chest CT images (Wang S.-H. et al., 2021) , and cancer diagnosis and prognosis (Chen et al., 2020) . By contrast, published studies on spinal TB have mainly focused on the clinical manifestations and surgical protocol of spinal tuberculosis (Garg and Somvanshi, 2011; Zhu et al., 2017; Khanna and Sabharwal, 2019) . Different feature fusion methods have been developed for different clinical purposes, such as a Bayesian algorithm-based method that can realize the fusion decision of multiple features (Khaleghi et al., 2013) , a sparse representation-based method that can obtain the joint sparse representation of multiple features (Lai and Deng, 2018) , and a DL-based method that can strengthen the feature learning process of deep neural networks . However, most of the aforementioned fused features are different representations under the same modality owing to the difficulty of multimodal fusion, and in cross-modal learning, it is difficult to implement transfer learning between more than two modalities. Conversely, the gated information fusion network (Arevalo et al., 2017; Kim et al., 2018) ensures that each single modality can work independently and transfer knowledge mutually, and it realizes the effective fusion of multimodal information, including histology images and genomic features (Chen et al., 2020) . It adopts the Kronecker product of unimodal feature representations to control the expressiveness of each single feature via a gated information attention mechanism. In this study, a multimodal feature fusion CNN is proposed to classify spinal TB CT images obtained from local grassroots hospitals in the Tibetan area. It provides a breakthrough in the application area of spinal TB auxiliary diagnosis, although it simply implements the classification of tuberculosis-health diagnostic results in spinal TB CT images. Specifically, the proposed network fuses three different elaborate features, namely SIFT, SURF, and ORB, with the DL feature that originates from the convolutional output layer of common CNNs. A new augmentation algorithm for handcrafted features that effectively simulates the data distribution in the feature space is proposed as a substitute for the image augmentation method. Additionally, a model was designed and used to effectively integrate these individual features, which included four different sections: matching network for consistency of different feature dimensions, backbone for sparse representation of features, fallen network for dimensional reduction, and fusion network for hybridizing multimodal features by a gated mechanism. We evaluated the hypothesis that the proposed method can effectively distinguish tubercular cases from healthy images by conducting experiments and performing several analyses. For convenience, from here on, "positive" and "negative" represent tuberculosis and health, respectively. Based on initial hypothesis attempts, further research will be conducted on other auxiliary diagnostics to form a complete auxiliary diagnostic process for spinal TB and solve the long-standing problem of spinal tuberculosis in Tibet. A multimodal image dataset was obtained from the People's Hospital of Tibet Autonomous Region, China, and consisted of DR and CT images of 197 patients with spinal TB acquired between 2013 to 2020. They were screened by two physicians based on basic patient information, medical records, and imaging evaluation, all of which were surgically treated as definite spinal tuberculosis pathology according to the corresponding guidelines about the diagnosis of spinal TB Liu et al., 2021) . The inclusion and exclusion criteria for the spinal TB cases were as follows: Inclusion criteria: • Diagnosis of spinal tuberculosis was confirmed by puncture biopsy or postoperative pathological examination; • Preoperative DR and CT examinations were performed; • Complete case data (e.g., gender, age, medical history, physical examination, imaging, and pathology data); • Patients who were first examined in primary care hospitals in less developed areas and had CT imaging data were prioritized for inclusion. Exclusion criteria: • Cases suspected of having spinal tuberculosis without pathological examination; • A history of spinal trauma before the diagnosis of spinal tuberculosis; • Incomplete case information. Table 1 presents the patients' gender, age, and lesion segment. Some patients had multiple site infections; therefore, the total number of female and male patients is not equal to the total number of cases of cervical, thoracic, lumbar, and sacral vertebral infections. It can be seen that middle-aged people (30-59 years) were the most infected among all age groups, and the number of men infected with spinal TB was higher than that of women. The patients presented in this table are the ones who bear the heaviest social and family pressures. Furthermore, the lumbar vertebrae are the most susceptible to spinal TB infection; therefore, the current research was mainly conducted on the TB of lumbar vertebrae. Although X-ray examinations are widely used in various primary hospitals, they provide limited information. CT examinations are approximately 20-25 times more sensitive than X-ray-based tissue density tests and are currently one of the most effective clinical bone examination methods. Spiral electron CT provides a high-resolution visualization of the destruction, hyperplasia, sclerosis, and focal boundaries of vertebral bone. Furthermore, it reveals the position of dead bone, fragmented bone, and their protrusion into the spinal canal, showing paravertebral abscesses and their density. Moreover, the lumbar spine has the highest incidence of tuberculosis as it has the most mobility and bears the heaviest burden along the entire spine, as shown in Table 1 . For the initial research of spinal TB, a total of 3,000 CT images of the lumbar vertebrae were randomly selected from the abovementioned multimodal image dataset based on slice level, which included a set of 1,500 slices for negative and positive cases. Finally, a small dataset of spinal TB CT images was obtained to explore the flexibility of CAD on spinal TB CT images. Feature engineering is a key step in the supervised classification of pathology images that directly affects the final classification result. Image feature extraction is the premise of image analysis, which is the most effective way to simplify the expression of high-dimensional image data. Based on the above qualitative diagnostic characteristics from orthopedists, three handcrafted features were extracted from spinal TB CT images, including three types of feature descriptors of the vertebral column in CT slices: SIFT (Lowe, 2004) , SURF (Bay et al., 2006) , and ORB (Rublee et al., 2011) . In addition to the handcrafted features mentioned above, deep features were extracted from the convolutional layers and fully connected layers of CNNs. These elaborate features are required for initial preprocessing to ensure dimensional consistency between different features before extracting the respective image characteristics. To understand the varied descriptions of different features, the meaning of the diverse features is indicated as follows. Several key points, such as the points of corners and edges, highlights, and dark spots, in an image do not change with luminance, transformation, and noise. These image feature points, which are typically used for image matching and image recognition, can reflect the essential features of an image. Scale-invariant feature transform (SIFT), speeded-up robust features (SURF), and oriented rotated brief (ORB) are three widespread methods used to describe these local feature points. Speckle and corner are just two typical feature points that can reflect key information that exists in the image. Speckle points usually refer to areas with color and grayscale that are different from the surrounding regions. Corner points are the intersection of two edges in the stable and informative areas of an image, which have certain characteristics, such as rotation invariance, scale invariance, affine invariance, and illumination invariance. These feature descriptors have been applied to various medical scenarios, such as medical image classification (Khan et al., 2015) , medical image stitching (Singla and Sharma, 2014; Win and Kitjaidure, 2018) , medical image fusion (Wang L. et al., 2021) , medical image registration (Lukashevich et al., 2011; Li et al., 2012) , and medical image retrieval (Govindaraju and Kumar, 2016) . Scale-invariant feature transform uses the Difference of Gaussian (DoG) matrix, which is a speckle detection method, to detect scale-space extrema, and uses an orientation histogram to extract the key point direction. The essence of the SIFT algorithm is to identify the key points and calculate their directions in different scale-spaces. The key points found by SIFT are almost speckle points that cannot be changed by illumination, affine transformation, or noise, such as highlights in dark areas and dark spots in bright areas. Speeded-up robust features is a scale and rotation invariant descriptor on based on SIFT. Rather than choosing the difference of a Gaussian matrix to detect scale-space extrema in SIFT, it calculates an approximation of the Laplacian of the Gaussian by a Hessian matrix. Instead of using an orientation histogram in SIFT, Harris wavelet response, a corner detection algorithm, is used to assign key point orientations in SURF. Therefore, the key points found by SURF are significantly different from the speckle points found by SIFT. The number of key points detected by SURF is more than that detected by SIFT, whereas the vector dimension (64) of SURF is less than the length (128) of SIFT. As a very fast binary descriptor based on two algorithms, Features from Accelerated Segment Test (FAST) (Khan et al., 2015) and Binary Robust Independent Elementary Features (BRIEF) (Calonder et al., 2010) , ORB is an improved algorithm that outperforms the SIFT and SURF algorithms in terms of nearest-neighbor matching and description efficiency. FAST was used to extract the corner points whose pixel gray value is obviously different from the pixel gray value in the surrounding fields, and BRIEF was employed to describe the points that were extracted by FAST. It has the least number of feature points and the lowest dimensionality (32) of extracted features. In summary, it is a fast feature extracting and matching algorithm with poor quality compared with SIFT and SURF. Before extracting feature points, the original spinal TB CT images were enhanced by adjusting the window width and position, and the vertebral region was isolated by watershed segmentation, which had a clear presentation on the centrum and also eliminated noise interference from non-skeletal areas. Subsequently, the feature points of SIFT, SURF, and ORB were transported to the bag of words (BoW) and term frequency-inverse document frequency (TF-IDF) models to obtain fifty-dimension feature vectors, as illustrated in Figure 1 . The BoW and TF-IDF models with virtual augmentation are explored in section "Feature Preprocessing." Finally, we obtained three eigenvectors, which are the local features of all TB images. In addition to the above handcrafted features, the DL characteristics extracted from the convolutional layer of the CNN were another critical feature that contained highly abstract image features. It is generally assumed that there is a closer spatial connection between local pixels than the counterparts between pixels at a greater distance. Thus, each neuron only needs to perceive the local areas of the image and not the global image. Consequently, the global information is obtained by combining the local information at a higher level. A variety of CNNs have been applied to various medical image processing tasks, such as ResNet, VGG, and DenseNet, and thus, the DL features also differ from each other. Because of the black box property of DL features (Guidotti et al., 2018) , different CNNs were selected to form the backbone of the proposed network to explore the optimal classification performance of spinal TB CT images. Figure 2 shows the procedure for extracting DL features. Because the TB image dataset was small, the models that had more or fewer parameters tended to overfit or underfit, respectively; that is, the neural networks with different numbers of layers, ResNet-18 and ResNet-50, VGG-11 and VGG-16, DenseNet-121, and DenseNet-161, were selected as the backbone of the proposed network to avoid overfitting or underfitting. The elaborate features should be preprocessed for dimensional consistency between different features. The identical handcrafted features of each slice were stacked vertically into one larger characteristic set. Subsequently, we used two algorithms, BoW and TF-IDF, to handle the low-dimension characteristic set extracted from the single-scale image. BoW adopted the K-means clustering method for unsupervised clustering of a large number of extracted SIFT, SURF, and ORB key points. The features with strong similarities were classified into the same clustering category. TF-IDF is the product of term frequency (TF) and IDF; it indicates the weight vector of features, where TF is the frequency of occurrence of a feature among all features, and IDF represents the uniqueness of a feature. Figure 3 shows a flowchart illustrating the preprocessing of features. First, we used the key points feature descriptor of SIFT, SURF, and ORB from the training sets to build CodeBook using BoW. The clustering features were the statistics on the number of occurrences of each category after clustering in the feature descriptors by searching the CodeBook. The number of categories was set to 50 after several pretraining experiments with individual features. Second, a new data augmentation algorithm was proposed to improve the generalization of small datasets, and the Algorithm 1 describes the data augmentation methods, which were only applicable to the cluster features processed by the BoW model. Specifically, the clustering information of each feature point was calculated using CodeBook, and the perturbation noise that obeys the normal distribution was used to jitter the clustering information for data augmentation, which increased the generalizability of the dataset. Finally, TF-IDF implemented feature weighting, which counted the frequency information of each feature vector appearing in the augmented feature sets. None of the augmented feature vectors existed in real TB images; that is, only the virtual key points of SIFT, SURF, and ORB existed. Input: Cluster Features Set: C = {c 1 , c 2 , · · · , c m } Output: Augmented Features Set: C = {a 11 , · · · , a 1n , a 21 , · · · , a 2n · · · a m1 , · · · , a mn } fori ∈ [1, 2, · · · , n] do After comprehensively considering the characteristics of vertebral morphology, we extracted four features from slice images, including the SIFT, SURF, and ORB vectors, and CNN features. Although the extracted handcrafted and DL features cover a wide range of valuable information involving both the local tissue and global slice, an effective method is imperative to fuse these features from different scales to improve prediction accuracy such that it is superior to the corresponding figures of any single feature. As shown in Figure 4 , the proposed network consists of four phases: the matching network that adjusts handcrafted features, backbone (i.e., different common CNNs) for processing all features, fallen network for dimension reduction, and fusion FIGURE 4 | Proposed network for classifying tuberculosis images. The middle three histograms are the SIFT, SURF, ORB vectors with length of 50 extracted from raw CT images, and the green, blue, and orange block represent the matching network, backbone, and fallen network, respectively. The matching network and fallen network are illustrated in Figures 5A,B , respectively, and the last block is the fusion network, which is illustrated in Figure 5C . network for blending different characteristics. Each network is explained in the following sections. Inconsistencies were present in the characteristic dimensions between the handcrafted and the DL features. Specifically, all handcrafted features were stacked into one-dimensional features with a size of 1 × 50, which was inconsistent with the dimensions of the DL features. Therefore, a matching network was required to reconcile the contradictions in the feature sizes between handcrafted and DL features, that is, to convert one-dimensional vectors into two-dimensional ones. The matching network consisted of nine convolutional blocks, with each block including a fractionally strided convolution, batch normalization, and ReLU activation function. The detailed architecture of the matching network is shown in Figure 5A . The one-dimensional feature with size 1 × 1 × 50 was mapped to a two-dimensional vector of size 224 × 224 × 3, which is similar to the common input size of CNN architectures such as ResNets and to the two-dimensional space of the DL features. Hence, it is easier to tune hyperparameters and fuse handcrafted and DL features. After the matching network, the handcrafted image features were transformed into the same dimensional space as that of the DL features. Subsequently, a common network was employed to hybridize these different vectors including handcrafted and DL features. This integrated network includes two foundational networks: a backbone network and fallen network. Various CNNs, such as ResNet (He et al., 2016) , DenseNet (Huang et al., 2017) , and VGG-Net (Simonyan and Zisserman, 2015) , serve as the backbone, and they have exhibited outstanding performance in different applications. The fallen network includes two convolutional operations and one average pooling, as shown in Figure 5B . It was used to refine the output characteristics of the backbone by mapping the outputs into a low-dimensional space, that is, a two-dimensional space with a size of 7 × 7 fell into a one-dimensional space with a size of 1 × 1 in detail. Essentially, we obtained a series of one-dimensional feature vectors for the subsequent processing of the fusion network. All image features, including the handcrafted and DL features, were eventually converted into one-dimensional vectors of length 128 after the fallen network was processed. There was high collinearity between the handcrafted and DL characteristics; therefore, an early fusion method that gates the weight contribution of the different tensors at the feature level was used to blend the aforementioned four image features before making a pathological diagnostic evaluation for the final classification. The structure of the gated fusion network is shown in Figure 5C . For each feature tensor from SIFT, SURF, ORB, and CNN, the dimensions of the input vectors F 1 , F 2 , F 3 , and F C , respectively, are gradually reduced through the fully connected layer network with a dropout rate of 0.5. For the same dimension, because of the connection between individual captured features, the feature expressions of each handcrafted tensor are weighted by the gated mechanism via a combination with DL features to reduce the size of the feature space. The gated mechanism consists of two pathways: one is a one-dimensional vector F i with a size of 128 × 1 after the ReLU activation function, and the other vector ω i of length 128 is the output of the bilinear transform between F i and CNN features F C , which evaluates the importance of each feature F i relative to the more precise CNN features by this non-linear correlation. Subsequently, the Kronecher product, which models the interaction of different features across modalities, constructs a threefold Cartesian space defined by H 1 , H 2 , and H 3 , that is, SIFT, SURF, and ORB, respectively. It also captures the trimodal interactions of all possible unimodal combinations, as shown in the upper right corner of Figure 5C . Finally, the predicted vectors, with a size of 96, and F C , with a size of 128, are vertically stacked into a larger one-dimensional vector with a length of 224. Subsequently, the predicted values of classification for the TB images are obtained after the fully connected layer operating on the former concatenated one-dimensional vector. The detailed operations above are summarized as shown below. (1) where, x, y denotes the concatenation of x and y, and FIGURE 7 | ROCs of the proposed network with different backbones: Frontiers in Microbiology | www.frontiersin.org A total of 3,000 spinal TB CT images were obtained and subsequently divided into two 1,500 datasets of positive and negative CT slices. For each type of CT image, 900, 300, and 300 slices were randomly selected as the training, validation, and test sets from the small TB dataset, respectively. In terms of training parameters, the optimizer was stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of 0.001, the learning rate was set to 0.01, which decayed by 0.1 every 7 epochs, and the loss function was a cross-entropy loss function describing the distance between two probability distributions. In addition, three different common deep CNNs (DCNNs) were used as the backbone: ResNet, VGG, and DenseNet. For each DCNN, two main networks with different numbers of layers were used to train on our small TB dataset to generate different sizes of models: 18 vs. 50 layers for ResNet, 11 vs. 16 layers for VGG, and 121 vs. 161 layers for DenseNet. The running environment was Pytorch 1.8.0, CUDA 11.1, and Python 3.7.1 based on Windows 10 with an advanced hardware configuration in terms of the GPU and CPU, i.e., GeForce RTX 3090 and Intel Xeon W-2255, respectively. The accuracy and loss curves of the training, validation, and test sets are shown in Figure 6 . The accuracy curves clearly appear to level off, and the loss curves converge to equilibrium with slight fluctuations starting at epoch 10. Specifically, the test loss curve of ResNet-50 is above the ResNet-18 loss curve, which indicates that the corresponding accuracy curve has a lower position compared with that of ResNet-18. Although there are few differences between the loss curves of VGG-11 and VGG-16, the accuracy is the same as that of ResNet, that is, the more layers in the network, the lower is the test accuracy value. However, there was a slight FIGURE 13 | Heatmap of the convolution layer weight visualization based on Grad-CAM++. The left column contains the original CT images, the middle column contains the heatmap generated by common VGG-11, and the right column contains the heatmap generated by backbone VGG-11 of the proposed network. difference between DenseNet-121 and 161. These phenomena are explained in the Discussion, and the evaluation indicators of these three backbones, including accuracy, specificity, sensitivity, and area under the curve (AUC), are discussed in the next section. There are four different quantitative indicators, namely accuracy, specificity, sensitivity, and AUC, that illustrate the predictive performance on 600 test images, as shown in Table 2 . The variates of accuracy, specificity, and sensitivity reflect the proportion of all samples with correct predictions, all negative samples with correct predictions, and all positive samples with correct predictions for all actual samples, all actual positive samples, and all actual negative samples, respectively. The AUC is the area enclosed by the coordinate axis under the receiver operating characteristic (ROC) curve. The proposed model with the backbone VGG exhibited the best performance compared with the other models, particularly VGG-11 achieved an accuracy of 98.33%, specificity of 98.33%, sensitivity of 98.33%, and AUC of 99.84%. Although the proposed model with ResNet-50 had the worst accuracy compared with the other models, the AUC was higher than that of ResNet-18, which demonstrated the existence of a superior classification threshold value for ResNet-18. For DenseNet, there were few significant differences between DenseNet-121 and DenseNet-161, both of which had an acceptable performance with an accuracy of 97.67%. Specifically, DenseNet-121 effectively predicted positive samples, whereas DenseNet-161 accurately predicted negative samples. This is due to the higher sensitivity of the former and the higher specificity of the latter. We also drew the ROC curves and calculated the AUC of the proposed model with different backbones, as shown in Figure 7 in which the above quantitative indices (including specificity, sensitivity, and AUC) are visualized as the false positive rate, true positive rate, and AUCs. It provides a more intuitive comparison of the differences among these networks when focusing on the upper left area. VGG11 (yellow line) was closest to the perfect classification point in the upper left corner, where the anticipated true positive rate and false positive rate were under different classified threshold values. More importantly, we provide the confusion matrix for each network in Figure 8 , in which we can observe the number of correct identifications and the number of incorrect identifications for each category in detail. There was a total of 600 CT images, including 300 positive samples and 300 negative samples. 295 true positives (TP) for VGG-11 and ResNet-18 and 296 true negative (TN) for DenseNet-161 were the maximum of all correctly classified sample volumes, which represents the recognition capability for health and disease. Similarly, the five false positives (FP) for VGG-11 and ResNet-18 and four false negatives (FN) for DenseNet-161 were the minimum of all incorrectly classified sample volumes. Overall, VGG-11 had the highest TP and comparatively higher TN and relatively lower FN, and there was a balanced capacity in predicting negative and positive samples, which demonstrated that VGG-11 was the optimal selection. Overall, we recommend VGG-11 as the backbone of the proposed deep network for the auxiliary diagnosis of TB CT images based on accuracy, stability, and convergence of the loss function among the six backbones. The subsequent section discusses the analysis conducted on VGG-11. The image augmentation was similar to the real data distribution in the feature space. In this study, a new data augmentation method was proposed to simulate a real data distribution. The performance of spinal TB classification with the proposed augmentation and image augmentation is shown in Figure 9 , highlighting the strength of the proposed augmentation algorithm. The accuracy, specificity, and AUC of the proposed augmentation were all slightly higher than those of image augmentation, and the sensitivity of the former was slightly lower than that of the latter. Generally, the radar map of image augmentation was surrounded by the proposed augmentation; therefore, the proposed method showed significant superiority over direct augmentation on images. We employed ResNet, VGG, and DenseNet as the backbones. The model layers influenced the classification accuracy, as shown in Figure 6 . Figure 10 shows a strong correlation between the number of parameters owned in the selected DCNN models and the prediction accuracy of test sets, which caused underfitting or overfitting when the DL model was too simple or complex to make accurate predictions for unrelated features from the small dataset. For ResNet and VGG, a decline was observed with the increase in parameters, as shown in Figure 10 , demonstrating that the excessive number of network layers in DCNN leads to model overfitting. By contrast, the model size had no impact on the accuracy of DenseNet, and the short paths from the initial layers to subsequent layers of DenseNet alleviated the vanishing gradient problem, which ensured maximum information transmission between layers in the network. Essentially, VGG exhibited optimal training performance. In particular, VGG-11 had a superior test accuracy of 98.33% compared with others. The four main characteristics were extracted from the CT images to identify spinal TB, namely three handcrafted features and one DCNN feature, i.e., SIFT, SURF, ORB of the local features, and deep features. As illustrated in Table 2 , accurate classification performances were obtained by fusing the four different features based on different backbones, particularly for VGG-11. A thorough investigation was conducted to show the significant influence of individual features on the proposed network. As a comparison of the fusion feature, we analyzed the performance of each feature separately based on the proposed network with backbone VGG-11 in Figure 11 . Diverse performances were obtained from various characteristics. A common trait was that not all handcrafted features outperformed the deep feature. Furthermore, the four evaluation indicators, namely accuracy, sensitivity, specificity, and AUC, were significantly different in one individual; however, none of them exceeded 90%. This shortcoming was effectively addressed when these different handcrafted features and deep features were fused by the proposed DCNN with the backbone VGG-11, as depicted in the last block of Figure 11 . Specifically, the accuracy, AUC, sensitivity, and specificity of deep features improved from 85.17%, 91.53%, 89.00%, and 81.33-98.33%, 99.84%, 99.33%, and 98.33%, respectively, with assistance from the other three handcrafted features. A new data augmentation method for handcrafted features was proposed based on the algorithm, as described in section "Data Augmentation for Handcrafted Features." The direct augmentation of images is a common method of data amplification and can produce an augmented feature dataset after extracting the handcrafted features from augmented images. Moreover, it has an identical data scale as the proposed augmentation algorithm. Figure 9 illustrates an intuitive comparison of these two augmentation schemes using a radar map from the four indices. In this study, we conducted a visual analysis of the retained original information in a low-dimensional feature space through t-distributed stochastic neighbor embedding (T-SNE), as shown in Figure 12 . In column b, that is, the T-SNE visualization of image augmentation, there are irregular gaps within the same category and considerable overlap among neighboring data points. This demonstrates that the CT slices obtained from image augmentation do not fully represent the real data distribution. By contrast, the binary data distribution (i.e., the red and green points) of the proposed augmentation (column a) is more uniform than that of image augmentation (column b), except for several outliers. This proves that the newly generated feature points can effectively fill the missing data in the spatial distribution. In Figure 11 , a significant improvement can be observed when the CNN features from VGG-11 fused three different handcrafted features. The accuracy increased from 85.17% to 98.33%. Compared with the direct classification of VGG-11 on CT images, some changes were observed in the region of interest for the proposed fusion model with VGG-11 as the backbone. To explore the differences in the area of interest between these two models, Grad-CAM++ (Chattopadhay et al., 2018) was used to generate a heatmap++, as shown in Figure 13 . Significant differences can be observed between these two methods on the heatmap of model concerns. VGG-11 focused on the vertebral foramen region in the TB images, regardless of negative or positive cases, which created a significant distraction for the classified judgment. By contrast, the proposed fusion model focused on the areas of destruction of vertebral bodies, even though some unrelated regions received little attention from the fusion models, which had less of an adverse effect on the final classification. This study proposes a novel DL-based classification model by fusing four image features, including three handcrafted features and one CNN feature-SIFT, SURF, ORB, and the CNN feature. During the feature engineering phase, the BoW and TF-IDF algorithms combined with a new data augmentation algorithm were used to extract the three handcrafted features, and the deep features were extracted from the convolution layers of common DCNNs, including ResNet, VGG, and DenseNet. The proposed network consists of four main sections: matching network, backbone, fallen network, and fusion network. Specifically, the matching network is used to adjust the dimensions of handcrafted features to match the image size, the fallen network integrates and processes each single feature from two-dimensional into one-dimensional vectors, and the fusion network is composed of a gated information fusion network and Kronecher space, which realizes the effective fusion of different characteristics and outputs the final classification results of TB images. Experimental results were obtained using different backbones: . The results demonstrated that VGG-11 achieved the optimal performance in terms of accuracy, AUC, specificity, and sensitivity. Furthermore, we analyzed the performance of the individual features, the proposed augmentation algorithm, the model stability, and the modelfocused heatmap to prove the advancement of the proposed network. The proposed method is interpretable in multimodal feature fusion and can be extended to more medical scenarios, which may aid clinical radiologists, particularly grassroots physicians. It has promising potential, although our research was limited to the positive and negative classification of spinal TB CT images. In subsequent studies, the patient clinical data, including gender, age, and medical history, have a strong relationship for the classification of spinal TB, it is worth adding this personal feature into the fusion networks. In addition, we aim to extend the proposed method to CT images that include all types of spines, such as thoracic, sacral, cervical, and lumbar vertebrae. Further exploration will be conducted for DR images based on spinal TB CT images, which can form a more complete auxiliary diagnosis system applicable to grassroots hospitals in Tibet, China. The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s. The studies involving human participants were reviewed and approved by Ethics Committee of People's Hospital of Tibet Autonomous Region, China. The patients/participants provided their written informed consent to participate in this study. ZL: conceptualization, data curation, software, and writing original draft preparation. FW: conceptualization, data acquisition, and funding acquisition. FH and ZZ: data acquisition. XG, WC, TY, and JW: conceptualization and manuscript revision. SG: conceptualization, supervision, manuscript review and funding acquisition. CP: data acquisition and funding acquisition. All authors contributed to the article and approved the submitted version. We would like to thank the top journal editing plan service of Editage. English language and grammar were edited, logical presentation of the ideas and structure of the manuscript was checked, and a peer review of the manuscript contents was performed before submitting this manuscript. Fusion of deep-learned and hand-crafted features for cancelable recognition systems Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach Deep learning in multi-omics data integration in cancer diagnostic Going deep in medical image analysis: concepts, methods, challenges, and future directions A deep feature fusion methodology for breast cancer diagnosis demonstrated on three imaging modality datasets Medical Image Analysis using Convolutional Neural Networks: a Review Gated multimodal units for information fusion SURF: speeded up robust features BRIEF: binary robust independent elementary features Grad-CAM++: generalized Gradient-Based Visual Explanations for Deep Convolutional Networks Pathomic Fusion: an Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis Radiomics in PET: principles and applications CT and MR in the management of advanced spinal tuberculosis Machine learning and deep learning in medical imaging: intelligent imaging CTSpine1K: a largescale dataset for spinal vertebrae segmentation Secondary laryngeal tuberculosis in Tibet China: a report of six cases Tuberculosis of the spine. A systematic review of case series Spinal tuberculosis: a review Radiomics: images are more than pictures, they are data Deep Learning A novel content based medical image retrieval using SURF features A survey of methods for explaining black box models A multimodal affinity fusion network for predicting the survival of breast cancer patients Deep residual learning for image recognition Imaging in children with spinal tuberculosis. A comparison of radiography, computed tomography and magnetic resonance imaging Densely connected convolutional networks Deep learning renal segmentation for fully automated radiation dose estimation in unsealed source therapy Multisensor data fusion: a review of the state-of-the-art Ensemble classification with modified SIFT descriptor for medical image modality Spinal tuberculosis: a comprehensive review for the modern spine surgeon A deep learning approach to diagnostic classification of prostate cancer using pathology-radiology fusion Robust deep multi-modal learning based on gated information fusion network Medical image classification based on deep features extracted by deep model and statistic feature fusion with multilayer perceptron Radiomics: extracting more information from medical images using advanced feature analysis MRIonly based synthetic CT generation using dense cycle consistent generative adversarial networks An effective computer aided diagnosis model for pancreas cancer on PET/CT images Evaluation of medical image registration by using 3D SIFT and phase-only correlation A diagnostic model for differentiating tuberculous spondylitis from pyogenic spondylitis on computed tomography images Distinctive image features from scale-invariant keypoints Medical image registration based on SURF detector Computer-aided diagnosis of prostate cancer with emphasis on ultrasound-based approaches: a review Handcrafted vs. non-handcrafted features for computer vision classification Risk factors for extrapulmonary dissemination of tuberculosis and associated mortality during treatment for extrapulmonary tuberculosis Spinal tuberculosis: diagnosis and management Spinal tuberculosis: our experience and a review of imaging methods U-Net: convolutional networks for biomedical image segmentation ORB: an efficient alternative to SIFT or SURF Feature-guided deep radiomics for glioblastoma patient survival prediction Very deep convolutional networks for large-scale image recognition Medical image stitching using hybrid of sift & surf techniques New approaches in the classification and prognosis of sign clusters on pulmonary CT images in patients with multidrug-resistant tuberculosis MBFFNet: multi-branch feature fusion network for colonoscopy Machine learning in medical imaging Medical image retrieval using Resnet-18 Radiomics strategy for glioma grading using texture features from multiparametric MRI Spinal tuberculosis: proposed spinal infection multidisciplinary management project (SIMP) flow chart revision A Medical Image Fusion Method Based on SIFT and Deep Convolutional Neural Network in the SIST Domain Computer aided-diagnosis of prostate cancer on multiparametric MRI: a technical review of current research Covid-19 classification by FGCNet with deep feature fusion from graph convolutional network and convolutional neural network Primary care quality among different health care structures in Tibet, China Biomedical images stitching using ORB feature based approach Fusing texture, shape and deep model-learned information at decision level for automated classification of lung nodules on chest CT Glioma grading on conventional MR images: a deep learning study with transfer learning Image fusion meets deep learning: a survey and perspective The value of MR imaging in comparative analysis of spinal infection in adults: pyogenic versus tuberculous The burden and challenges of tuberculosis in China: findings from the Global Burden of Disease Study The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.