key: cord-0975926-u92z1744
authors: Ter-Sarkisov, A.
title: One Shot Model For COVID-19 Classification and Lesions Segmentation In Chest CT Scans Using LSTM With Attention Mechanism
date: 2021-02-19
journal: nan
DOI: 10.1101/2021.02.16.21251754
sha: 8e58175b549f1d5f41876c9239d216c5fce4cbbd
doc_id: 975926
cord_uid: u92z1744

We present a model that fuses instance segmentation, Long Short-Term Memory Network and Attention mechanism to predict COVID-19 and segment chest CT scans. The model works by extracting a sequence of Regions of Interest that contain class-relevant information, and applies two Long Short-Term Memory networks with attention to this sequence to extract class-relevant features. The model is trained in one shot: both segmentation and classification branches, using two different sets of data. We achieve a 95.74% COVID-19 sensitivity, 98.13% Common Pneumonia sensitivity, 99.27% Control sensitivity and 98.15% class-adjusted F1 score on the main dataset of 21191 chest CT scan slices, and also run a number of ablation studies in which we achieve 97.73% COVID-19 sensitivity and 98.41% F1 score. All source code and models are available on https://github.com/AlexTS1980/COVID-LSTM-Attention.

Coronavirus (COVID- 19) is an ongoing global pandemic that has taken so far over 91400 lives in the UK alone and over 2.069M worldwide as of late January, 2021 with the crisis worsening in some countries, measured both by the number of deceased and the number of new cases (https://www.worldometers.info/coronavirus). The pandemic caused a complete or partial lockdown in most countries across the planet and led to a previously unseen pressure on healthcare with radiology departments workload exceeding their capacity and manpower.

Analysis of chest CT scans using Deep Learning (DL) can provide assistance to the radiology personnel in many ways. One of them is the reduction of time it takes to process a scan slice from roughly 20 minutes to a few seconds and less [1] . DL algorithms can both rule out clear true positives, and draw the personnel's attention to suspicious images, e.g. by detecting and segmenting lesions. This may result in two types of errors that the algorithm can possibly make: fail to identify the suspicious areas in scans (false negative) or raise false alarm (false positive) by misclassifying the control image as COVID-19. One of the specific challenges that the personnel, and, therefore, DL algorithms, face is the misclassification of COVID-19 into other types of pneumonia, which is due to a large number of overlaps between the ways these diseases manifest in chest CT scans (see [2] [3] [4] [5] ).

Existing Deep Learning methodology analyzing chest CT scans has two main limitations: either it relies on large amounts of data (and data manipulation tricks) to train the model or the model was both trained and evaluated on small amounts of data, hence the solution's ability to extend to larger datasets is debatable. Another problem that, to the best of our knowledge, all DL solutions suffer from, is transferability of results to other datasets without additional Figure 2 for its details). Normal arrows: data, Broken arrows: batches or samples, dotted arrows: labels. RoI layer is shared between the segmentation (top) and classification (bottom) branches. Weights are copied from the RoI layer segmentation into the classification branch (red dotted arrow+cross). Full segmentation branch in purple, full classification branch in green. Best viewed in color.

combined convolutional neural net (ConvNet) and LSTM was presented, in which LSTM takes the last features output of ConvNet (size 512 × 7 × 7) and uses it as an input, and the final fully connected layer connected to LSTM predicts the class of the image (COVID-19, Common Pneumonia and Control).

Attention is one of the most active research topics in deep learning. It was introduced in [13] for machine translation encoder-decoder framework in two versions: soft attention that connects each decoder to all states (weighted average) and hard attention (connection to a single state selected using the alignment score) and [12] in the form of global (connection to all encoder states) and local (connection to a window). [17] introduced a transformer model that replaces recurrent connections such as LSTM with multihead attention embedded in an encoder-decoder framework. It has gained a particular recognition in the natural language processing (NLP) community, e.g. for sentiment classification [18] .

There is a number of well-received publications that use a form of attention for COVID-19 prediction and lesion segmentation. In [19] a model with residual connections and attention-aware units was used to predict COVID-19 vs Negative. In [20] attention is computed between convolution maps from two different branches of the model: 2-and 3-class problem classification branches. In [21] attention was used to construct the relationship among separate feature maps extracted by VGG16 [22] from x-ray images.

We improve this approach by replacing whole feature maps with RoI mask features extracted from FPN. This yields better results because irrelevant areas in features maps are ignored; instead, the model focuses on RoIs relevant to the class prediction. Sequence of RoIs is an input in LSTM. Attention mechanism is constructed between an output of LSTM and LSTM's hidden states.

As in [14, 15, 23, 24] we use CNCB-NCOV dataset and COVIDx-CT splits [24] except that our training data has only 3000 observations (1000/class). Test split with 21191 was used in full. For the segmentation problem, 650 images were used to train and validate the model and 100 for testing. Further dataset details are presented in [15] . The labels for the segmentation problem are derived from the ground truth masks (pixel-level labelling) for 4 classes: background, Ground Glass Opacity (GGO), Consolidation (C) and clean lungs. The last 3 classes are deemed 'positive' (i.e. in the sense they are not the background) and each instance thereof has 3 parameters: class label, bounding box coordinates and ground truth mask, each used for model training and validation. This data is used to train the full segmentation branch of the model. Images labelled at a global (image) level are used to train the full classification branch (except RoI classification . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 19, 2021. ; https://doi.org/10.1101/2021.02.16.21251754 doi: medRxiv preprint branch, as explained above).

Overview of the model is presented in Figure 1 . 

In this stage a state-of-the-art network with Feature Pyramid Net (FPN) extracts features from the input image. In our setup only the last layer of the backbone, FPN output, is used in RPN layer for box prediction and RoI layer for the RoI feature extraction.

This layer is inherited from Mask R-CNN. RPN predicts positive (contain object) bounding boxes and passes them to RoI layer. This layer computes loss (bounding box coordinates and object vs background class).

Our RoI concept is different from [7] , as it consists of a total of four branches vs two. Each branch uses RPN box predictions and RoIAlign tool to crop the corresponding area in the FPN feature layer to the predefined size (see [6, 7] for the details). The first two branches, detection branch that predicts boxes and classes and mask segmentation branch are trainable and solve the segmentation problem using object-level labels (bounding boxes, class labels and masks). The remaining two constitute a classification branch and are described in Section 3.6. Segmentation branch is identical to the one in [7] .

RoI classification branch consists of two branches that have an architecture identical to the ones in the segmentation branch. They are not trainable, instead, they copy weights from the segmentation branch if there is an improvement in the segmentation loss. This lets them detect RoIs and extract mask features size from any image. They do not compute any loss, because their job is to construct and output a batch of RoI mask features size β, see [15, 23] for the details. Box predictions are ranked in the order of decreasing confidence score. Since β is fixed, the batch size must be the same regardless of the distribution of the confidence scores. This is achieved by setting the confidence threshold θ = −0.01 to accept even very low-ranking (insignificant) boxes. From these boxes and FPN layer, RoI mask features of fixed size β × C × H × W are extracted. Here β, H, W are hyperparameters and C is the number of feature maps in FPN, also a hyperparameter.

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 19, 2021. ; 

This is the key step and the novelty of our model. Attention layer consists of two main stages: RoI mask features filtering and LSTM with attention, see Figure 2 .

In this stage a small subnetwork filters the batch of RoI mask features to prepare them for attention computation. RoIs in the batch are stored in the decreasing order of their class confidence, and we maintain this order. The rationale for that is that this is an ordered sequence (see Section 4 for the discussion of the order), and LSTM learns sequential relationships. We filter the features through a Mask sieve that first halves the number of feature maps and downsamples their dimensionality using a convolution layer and then upsizes and upsamples them using transposed convolution layer each RoI a total of N times: N × (Conv2D → BatchNorm2d → Conv2DTranspose → ReLU), therefore the output size stays the same, β × C × H × W. This output is vectorized in 3 steps using convolution layer with kernels size 2 × 2, 2 × 2, 7 × 7, such that the output size of the third convolution layer is β × C. Finally, before the LSTM input we convert batch dimension β to a sequence dimension, keeping the order of the vectorized mask features: 1 × β × C.

The detailed architecture of the attention mechanism is presented in Figure 3 . We use a form of soft attention [12] that considers all hidden states in LSTM, H t . The main idea of the attention mechanism is to compute a score measuring the relationship between vector z t that expresses the state of the model after the last LSTM state passed through the linear filter and each hidden LSTM state h k . Unlike [12, 13] , we do not have decoder states, as there is no output sequence, therefore the attention is computed for z t . After softmax rescaling, context vector c t captures the relevant information . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 19, 2021. ; 

Equation 1 is the LSTM model that processes the RoI batch X t and outputs the full history H t (stack of hidden layers, Equation 3) and the last hidden layer o t . Dot-product w t is rescaled using softmax function (Equations 4, 5). These weights α k are used to weigh the stack of hidden layers to get the context vector c t , Equation 6 . Finally, Equation 7 computes the elementwise sum between the linear vector z t (Equation 2) and the context vector.

We want to separate out the learning of class-relevant and class-irrelevant RoIs (regardless of their confidence score), and therefore we create two LSTM+Attention: one for the class-relevant (class-positive) RoIs, the other for class-negative RoIs. Their architecture is identical. The key idea of this separation is to prevent the class-positive LSTM from the alignment with the same features regardless of the class. Therefore 

The architecture of the image classification module is quite simple: two fully connected layers and the output class logits layer with 3 neurons (one/class). The module accepts a vector of features from the Attention layer as an input. As . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 19, 2021. ; https://doi.org/10.1101/2021.02.16.21251754 doi: medRxiv preprint , that become inputs in two different modules.

The model solves a segmentation and classification problem simultaneously using a linear combination of two loss functions, Equation 8 . Segmentation loss L SEG is the same as in [7] with three sets of labels: box coordinates, class labels and masks and 5 loss functions: box+object in RPN, box+class+mask in RoI. 

We trained the model using ResNeXt101+Feature Pyramid Network (FPN) [30, 31] The model was trained using Adam [32] optimizer with β 1 = 0.9, β 2 = 0.999, learning rate 1e − 5 and weight decay factor 1e − 3 for 50 epochs. We compare the trained model to some of the best available in literature. The challenge of comparing results stem from a number of factors:

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 19, 2021. ; Table 2 : Accuracy results of Attention-based and Attention-free models on the COVIDx-CT test split (21191 images). Per-class sensitivity and overall model accuracy are reported. Best results in bold. Results for the reference models were taken from the respective publications and were trained and evaluated on different datasets. COVID-Net CT-2, DenseNet and ResNet models were trained and evaluated on the same data as our model. 3. Reported metrics. Some publication report AUC/ROC (area under curve, receiver operating characteristics) instead of F 1 score, overall accuracy and precision/recall. Not all publications report per-class results.

As a result, this limits the number of publicly available benchmarks to which we can compare our results. Therefore, we attempted two different architectures and compared them to COVID-CT-Mask-Net [23] and a large suite of ResNet and DenseNet [33] models trained on our data. The reported models are:

1. One Shot Model [15] . The model uses a feature Affinity mechanism and also does segmentation+classification in a single shot. Batch size β is set to 16, the number of affinities to 8 (see [15] for further details).

2. One Shot Model + LSTM with Attention. The model has the LSTM+attention layer discussed in Section 3.7. LSTM has 256 hidden cells and β = 16. Linear layer output z t has 256 features, matching the hidden dimensions of LSTM. This is necessary for the computation of dot-product alignment in Equation 5.

As discussed above, LSTM uses a sequential input. First, RoIs output from RoI classification branch are ordered in the order of decreasing confidence score, but since one of our objectives is to compute the attention between spatial [34] 97.00% 85.47% features, this means that RoIs with different scores could be next to each other in the spatially-aware sequence. Therefore, empirically we found that reordering the sequence of RoIs based on the distance from the origin gives LSTM a better sequential input than confidence scores.

We use MS COCO 2017 main criterion [8] in addition to two Intersect over Union (IoU) thresholds: 50% and 75%. To compute the overlap, dot-product between the predicted and ground truth masks is computed. For Average Precision (AP) computation we use Pascal VOC 2012 interpolation: precision is padded with 1 at the start and 0 at the end, recalls are padded with 0 and 1 respectively. Results in Table 1 

Sens(c) = TP TP + TN (11)

Prec(c) = TP TP + FP (12)

here w(c) is the class share in the data: w(Neg) = 45%, w(CP) = 35%, w(COV ID) = 20%. Accuracy of the model is computed using sensitivity/recall (Equation 11) and precision/positive predictive value (Equation 12) both for each class and the overall model and F 1 score for the overall model (Equation 13 ). Sensitivity is the measure of how well the model captures the positive effect given that it exists (false negatives are missed positives), precision is the measure of the ability of the model to avoid finding the effect where it does not exist (false positives). In our implementation of F 1 score, the weights (shares) of classes in the test set are taken into consideration, to avoid the imbalanced results. Table 2 

To establish the ability of the model to further generalize to the unseen data, we perform ablation study on two additional datasets:

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 19, 2021. ;

1. Hold-out CNCB-NCOV dataset. This is part of the COVIDx-CT train split that we did not use, a total of 58737 images, approx. Negative: 46%, CP: 36% and COVID-19: 18%.

2. iCTCF [34] dataset with 2 classes (Negative and COVID- 19) , from which we use 600 images (300/class) for training and validation and the remaining 12976 (COVID-19: 9275, Negative: 3701) for testing.

All model hyperparameters were kept the same. Overall results in Tables 3 and 4 confirm the main findings: LSTM with attention outperforms One Shot model with Affinity by about 3% (F 1 score) and 0.4% (COVID-19 sensitivity) in the CNCB-NCOV holdout set and 1.7% (F 1 score) and 2.9% (COVID-19 sensitivity) in the iCTCF test set.

In this paper we presented a novel methodology that combines LSTM with Attention to explore relationship among Regions of Interest feature masks for the purpose of one shot COVID-19 classification and segmentation. The obtained highly accurate results confirm that this method can serve as an assistance diagnostic tool in radiology departments, both to segment instances of lesions and classify whole chest CT scan slices. Our One Shot model with LSTM and attention mechanism achieved 0.470 mean average precision on the test segmentation split, 95.74% COVID-19 precision and 98.15% F 1 score, some of the best results in the literature on the dataset of this size. Source code of the LSTM with Attention model has been uploaded on https://github.com/AlexTS1980/COVID-LSTM-Attention.

Jcs: An explainable covid-19 diagnosis system by joint classification and segmentation

A comparative study on the clinical features of covid-19 pneumonia to other pneumonias

Ct scans of patients with 2019 novel coronavirus (covid-19) pneumonia

Comparison of chest ct findings between covid-19 pneumonia and other types of viral pneumonia: a two-center retrospective study

Detection and segmentation of lesion areas in chest CT scans for the prediction of COVID-19

Faster r-cnn: Towards real-time object detection with region proposal networks

Mask r-cnn

Microsoft coco: Common objects in context

The pascal visual object classes (voc) challenge

Fully convolutional networks for semantic segmentation

Effective approaches to attention-based neural machine translation

Neural machine translation by jointly learning to align and translate

Single-shot lightweight model for the detection of lesions and the prediction of covid-19 from chest ct scans

One shot model for the prediction of covid-19 and lesions segmentation in chest ct scans through the affinity among lesion mask features

A combined deep cnn-lstm network for the detection of novel coronavirus (covid-19) using x-ray images

Attention is all you need

Enhancing attention-based lstm with position context for aspect-level sentiment classification

Covid ct-net: Predicting covid-19 from chest ct images using attentional convolutional network

Prior-attention residual learning for more discriminative covid-19 screening in ct images

Attention-based vgg-16 model for covid-19 chest x-ray image classification

Very deep convolutional networks for large-scale image recognition

COVID-CT-Mask-Net: Prediction of COVID-19 from CT Scans Using Regional Features

Covidnet-ct: A tailored deep convolutional neural network design for detection of covid-19 cases from chest ct images

Deep learning enables accurate diagnosis of novel coronavirus (covid-19) with ct images

Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct

A light cnn for detecting covid-19 from ct scans of the chest

Covid-ct-dataset: a ct scan dataset about covid-19

Covid-net ct-2: Enhanced deep neural networks for detection of covid-19 from chest ct images through bigger, more diverse learning

Deep residual learning for image recognition

Feature pyramid networks for object detection

Adam: A method for stochastic optimization

Densely connected convolutional networks

Open resource of clinical data from patients with pneumonia for the prediction of covid-19 outcomes via deep learning