key: cord-0147028-y0o1vpl4
authors: Konwer, Aishik; Bae, Joseph; Singh, Gagandeep; Gattu, Rishabh; Ali, Syed; Green, Jeremy; Phatak, Tej; Prasanna, Prateek
title: Attention-based Multi-scale Gated Recurrent Encoder with Novel Correlation Loss for COVID-19 Progression Prediction
date: 2021-07-18
journal: nan
DOI: nan
sha: 8b97581ecb1e43c8a7e712b38bcd0a85bab04102
doc_id: 147028
cord_uid: y0o1vpl4

COVID-19 image analysis has mostly focused on diagnostic tasks using single timepoint scans acquired upon disease presentation or admission. We present a deep learning-based approach to predict lung infiltrate progression from serial chest radiographs (CXRs) of COVID-19 patients. Our method first utilizes convolutional neural networks (CNNs) for feature extraction from patches within the concerned lung zone, and also from neighboring and remote boundary regions. The framework further incorporates a multi-scale Gated Recurrent Unit (GRU) with a correlation module for effective predictions. The GRU accepts CNN feature vectors from three different areas as input and generates a fused representation. The correlation module attempts to minimize the correlation loss between hidden representations of concerned and neighboring area feature vectors, while maximizing the loss between the same from concerned and remote regions. Further, we employ an attention module over the output hidden states of each encoder timepoint to generate a context vector. This vector is used as an input to a decoder module to predict patch severity grades at a future timepoint. Finally, we ensemble the patch classification scores to calculate patient-wise grades. Specifically, our framework predicts zone-wise disease severity for a patient on a given day by learning representations from the previous temporal CXRs. Our novel multi-institutional dataset comprises sequential CXR scans from N=93 patients. Our approach outperforms transfer learning and radiomic feature-based baseline approaches on this dataset.

the progression of the disease process. In the United States, chest radiographs (CXRs) are the most commonly used imaging modality for the monitoring of COVID-19. On CXR, COVID-19 infection has been found to manifest as opacities within lung regions. Previous studies have demonstrated that the location, extent, and temporal evolution of these findings can be correlated to disease progression [12] . Studies have shown that COVID-19 infection frequently results in bilateral lower lung opacities on CXR and that these opacities may migrate to other lung regions throughout the disease's clinical course. [12, 7] . This suggests that COVID-19 progression may be appreciable on CXR via examination of the spatial spread of radiographic findings across multiple timepoints.

Despite the many studies analyzing the use of CXRs in COVID-19, machine learning applications have been limited to diagnostic tasks including differentiating COVID-19 from viral pneumonia and predicting clinical outcomes such as mortality and mechanical ventilation requirement [6, 4] . Many of these studies have reported high sensitivities and specificities for the studied outcomes, but they remain constrained due to deficiencies in publicly available datasets [8] . Furthermore, none have attempted to computationally model the temporal progression of COVID-19 from an imaging perspective. Significantly, most studies have also not explicitly taken into account the spatial evolution of CXR imaging patterns within lung regions that have been demonstrated to correlate with disease severity and progression [12, 7] . In this study we take advantage of a unique longitudinal COVID-19 CXR dataset and propose a novel deep learning (DL) approach that exploits the spatial and temporal dependencies of CXR findings in COVID-19 to predict disease progression. (a) depicts a CXR in which lung fields have been divided into three equal zones. Disease information in patches from primary zone (P p) are more similar to those from neighboring zone (N p) than the remote zones (Rp). (b)-(e) depict serial CXRs taken for one patient over several days of COVID-19 infection. We note a progression of imaging findings beginning with lower lobe involvement in (b) with spread to middle lung involvement in (c) and upper lung region involvement in (d) and (e).

Previous deep learning (DL) based COVID-19 studies have mainly considered single timepoint CXRs [1, 10] . Unlike these studies, we analyze CXRs from multiple timepoints to capture lung infiltrate progression. Recurrent neural networks (RNNs) have been widely employed for time series prediction tasks in computer vision problems. Recently, RNNs have also found success in analyzing tumor evolution [18] and treatment response from serial medical images [14, 16] . A Gated Recurrent Unit (GRU) is an RNN which controls information flow using two gates -a Reset gate and an Update gate. Thus, relevant information from past timepoints are forwarded to future timepoints in the form of hidden states. GRUs have been used extensively to predict disease progression [9] .

In this work, we aim to explore how the different zones of an image are correlated to each other. Many studies have demonstrated the spatial progression of COVID-19 seen on CXR imaging with lung opacities generally being noted in lower lung regions in earlier disease stages before gradually spreading to involve other areas such as the middle and upper lung [12, 15, 7] . Therefore, two neighboring lung zones should have a higher similarity measure than two far-apart zones. Unlike previous approaches, we propose a multi-scale GRU [17] which can accept three distinct inputs at the same timepoint. Apart from primary patches P p of concerned zone, patches from Neighbor N p and Remote areas Rp are also used as inputs to a GRU cell at a certain timepoint. We include a Correlation module to maximize the correlation measure between P p and N p, while minimizing the correlation between P p and Rp. Finally, an attention layer is applied over hidden states to obtain patch weights and give relative importance to patches collected from multiple timepoints. The major contributions of this paper are the following: (1) Our work uses a multi-scale GRU framework to model the progression of lung infiltrates over multiple timepoints to predict the severity of imaging infiltrates at a later stage. (2) Disease patterns in adjacent regions tend to be spatially related to each other. COVID-19 imaging infiltrates exhibit similar patterns of correlation across lung regions on CXRs. We are the first to use a dedicated correlation module within our temporal encoder that exploits this latent state inter-zone similarity with a novel correlation loss.

Varying numbers of temporal images are available for each patient. The number of timepoints is equal to d which may vary from 4 to 13 for a given patient. The images corresponding to these d timepoints are denoted by I t1 , I t2 , ...I t d−1 , I t d . The left and right lung masks are generated from these images using a residual U-Net model [1] . These masks are each further subdivided into 3 lung zones -Upper (L 1 , R 1 ), Middle (L 2 , R 2 ) and Lower (L 3 , R 3 ) zones. Our collaborating radiologists assigned severity grades to each of the 6 zones as g 0 = 0, g 1 = 1 or g 2 = 2 depending on the zonal infiltrate severity. This procedure mirrors the formulation of other scoring systems [6] . We train 6 different models for each of the six zones -M L1 , M L2 , M L3 , M R1 , M R2 , and M R3 . We adopt this zone-wise granular approach to overcome the need of image registration.

We implement an Encoder-Decoder framework based on seq2seq model [2] in order to learn sequence representations. Specifically, our framework includes two recurrent neural networks: a multi-scale encoder and a decoder. The training of the multi-scale encoder involves fusion of three input patches each from P p, N p and Rp -concerned (current zone of interest), neighboring, and remote zones, at each of the timepoints, to generate a joint feature vector. The attention weighted context vector that we obtain from the encoder is finally used as input to the decoder. The decoder at its first timepoint attempts to classify this encoder context vector into the 3 severity labels. The multi-scale encoder is trained with the help of a correlation module to retain only relevant information from each of the patches of three distinct zones.

Each image zone is divided into sixteen square grids. These grids are resized to dimension 128×128 and used as primary patches P p for the concerned zone. Now, for each zone, we also consider 8 patches from the boundary of two adjoining neighbor zones. For example, in the case of L 1 zone, we use 4 patch grids from R 1 boundary and 4 patch grids from L 2 boundary. Similarly, in the case of middle zone L 2 , we use 4 patch grids each from nearest L 1 and L 3 boundaries. Thus we build a pool of 8 neighboring N p patches for each concerned lung zone. Additionally, we also create a cluster of 8 Rp patches coming from the far-away boundaries of remote zones. e.g. Rp patches for L 1 is collected from boundaries of L 3 .

For a particular model, say M L1 , each P p patch from zone L 1 is fed as input to a Convolutional neural network (CNN) to predict the severity scores at a given timepoint. Similarly, one random patch from each of N p and Rp, are also passed into the same CNN. As an output of the CNN, we obtain three 1 × 256 dimensional feature vectors. The CNN network configuration contains five convolutional layers, each associated with an operation of max-pooling. The network terminates with a fully connected layer.

Multi-scale GRU. The GRU module used here is a multi-scale extension of the standard GRU. It houses different gating units -the reset gate and the update gate which control the flow of relevant information in a GRU. The GRU module takes P t , N t , and R t as inputs (denoted by X i GRU module. The computation within this module may be formally expressed as follows:

where σ is the logistic sigmoid function and ϕ is the hyperbolic tangent function, r and z are the input to the reset and update gates, and h andh represent the activation and candidate activation, respectively, of the standard GRU [3] . W r , W z , W h , U r , U z and U h are the weight parameters learned during training. w i t (i=1,2,3) are also learned parameters. b r , b z and b h are the biases. X i t (i=1,2,3) are the CNN feature vectors of patches from the three zones -P p , N p and R p . Correlation module. In order to obtain a better joint representation for temporal learning, we introduce an important component into the multi-scale encoder, one that explicitly captures the correlation between the three distinct inputs. Our model explicitly applies a correlation-based loss term in the fusion process. The principle of our model is to maximize the correlation between features from P p and N p, and to minimize the correlation between features from P p and Rp. Pearson coefficient has been used to compute the correlation. Hence this module computes the correlation between the projections h 1 t , h 2 t and also between h 1 t , h 3 t obtained from the GRU module. We denote the correlation-based loss function as

For all patients, independently for each patch from P p and N p zones, we maximized the correlation function. Similarly we minimized the correlation function for each patch from P p and Rp zones. Attention module. The hidden state from each GRU cell is passed through an attention network. The attention weights α 1 , α 2 ,...,α d−1 are computed for each timepoint. These scores are then fed to a softmax layer to obtain the probability weight distribution, such that the summation of all attention weights covering the available d − 1 timepoints of the encoder equals to 1. We compute a weighted summation of these attention weights and the GRU hidden states' vectors to construct a holistic context vector for the encoder output.

The attention weighted context vector obtained from d − 1 timepoints of the encoder is used as an input to the decoder. A linear classifier and softmax layer is applied on the GRU decoder's hidden state to obtain three severity scoresg 0 , g 1 , and g 2 . For each patient and zone, we predict 16 such patch classification scores for the I t d th image. We employ majority voting as an ensemble procedure on these scores to obtain the final patient-wise grade. Fig. 2 . Architecture of the proposed approach. We show here model ML 1 which deals with patches from L1 zone. At each timepoint, 3 patches each from P p, N p and Rp are inputs to CNN network. The generated CNN features are passed into a GRU cell. Fused hidden state GRU output ht is used to calculate attention weights. Attention weighted summation of multiple such hidden states form the context vector for decoding purpose.

Our multi-institutional dataset, COVIDProg [5] , contains 621 antero-posterior CXR scans from 93 COVID-19 patients, collected from multiple days. 23 cases were obtained from Newark Beth Israel Medical Center. The remaining 70 cases curated from Stony Brook University Hospital. All the CXRs were of dimension 3470 × 4234. Additional details can be found in Supplementary section 3.

For training the CNN and GRU, a cross entropy loss function was used along with the designed correlation loss discussed earlier. Optimization of the network was done using Adam Optimizer. Each of the 6 models is trained once for 300 iterations with a batch size of 30 and a learning rate of 0.001. The total number of epochs is 20. We used pack padded sequence to mask out all losses that surpassed the required sequence length. Thus, we could nullify the effect of missing timesteps for a patient in the dataset. We have adopted a 5-fold cross validation approach to predict the I t d th image severity grades for 93 patients, using d − 1 images as encoder input. First baseline approach (B 1 ). We trained 6 different models based on a transfer learning based framework, illustrated in Supplementary section 4. All the pretrained convolutional weights of a VGG-16 network [11] were kept same. The last two layers of the network were replaced with two new fully connected layers to deal with the 3-class classification problem. For a particular model, M L1 , 64 × 64 dimension patches were extracted from the L 1 zone using a sliding window approach with a stride size 32. After passing these patches as input to our VGG-16, we obtained a P ×4096 feature vector where P denotes the total number of patches extracted for a patient from the L 1 zones of images collected from multiple timepoints t 1 , t 2 ,...,t d−1 . We used a simple feature averaging technique to obtain a 1 × 4096 feature vector from the P × 4096 feature for each patient. Finally a 1-D neural network was trained to classify the patches into severity grades g 0 , g 1 and g 2 predictions for I td th image. Majority voting was used as an ensemble procedure to convert these patch classification grades to a patient-wise grade. Second baseline approach (B 2 ). We built a radiomic feature based pipeline. 445 texture-based radiomic features [13] were extracted from the concerned lung zone. These features were similarly averaged into a single feature vector and classified using random forest classifier.

Averaged results are presented after 5 runs of model-testing. Accuracy is computed for each of the 6 lung zones, while the precision and recall are measured for each of the severity grades, g 0 , g 1 and g 2 . The results using our approach and the two baseline methods are illustrated in Tables 1 and 2 for the left and the right lung zones, respectively. In all the zones, except R 3 , our method performed significantly better than both baseline approaches. For example, in left lung upper zone, we achieved an accuracy of 75.26%. The baseline accuracies were 60.21% and 56.98% for B 1 and B 2 , respectively. Ablation study. In order to capture the gradual improvement of our framework through different stages, we conducted a serial ablation study and built two subvariants of our frameworks. 1) Variant-1: This variant uses only multi-scale GRU cells which concatenate the inputs from two distinct patches -P p, N p to generate the fused representation. Both the Correlation module and the Attention module were removed from our framework. Though neighboring patches were taken into consideration, we do not exploit the explicit correlation between P p, N p and between P p, Rp. Also, the encoder output vector does not consider the relative importance of hidden states generated for multiple timepoints. 2) Variant-2: This variant consists of the Correlation module. However, the Attention module is neglected and equal importance is assigned to all the zone patches collected from multiple timepoint' images. Thus we gradually zeroed into our framework which outperforms the sub-variants by a large margin in most zones. The results in Tables 1 and 2 suggest that exploiting the correlation between the nearby zones and remote zone patches leads to an increase in prediction performance. Moreover, the use of an attention layer to provide individual patch importance further boosts the accuracy. As an example, it can seen that for the left lung middle zone, our M L1 accuracy is 72.04% while for Variant-1 and Variant-2 it is 67.74% and 70.96%, respectively. Testing with d − 2 timepoints as encoder input. We designed an experimental setup to analyze how the framework performs when patches from only first d − 2 images are used as input to our GRU encoder. However, the task is still to predict the severity scores of I t d image. 

COVID-19 CXRs reveal varied spatial correlations among the lung infiltrates across different zones. Adjacent zones are generally found to be more correlated than two distant regions. We build a multi-scale GRU based encoder-decoder framework which accepts multiple inputs from different lung zones at a single timepoint. Unlike generative approaches, our model does not require registration between images from different timepoints. A novel two component correlation loss is introduced to explore the spatial correlations within nearby and distant lung fields in latent representation. Finally we use an attention layer to judge the relative importance of the images from available timepoints for computing the disease severity score at a future timepoint. 

We implemented our framework on a server with 11gb Nvidia RTX 2080 Ti gpu. Each model in the proposed approach was trained in 3.4 hours for 30 epochs. Baseline 1 and 2 took 2 hours and 1.25 hours respectively.

CXRs taken from Stony Brook University Hospital were acquired using the portable DRX Revolution machine developed by Carestream Health with AP image technique. CXRs taken from Newark Beth Israel Medical Center were arXiv:2107.08330v1 [eess.IV] 18 Jul 2021 acquired using GE Optima XR240 AMX portable machines. Each lung zone severity score was determined by agreement among three expert readers (≥ 15, ≥ 3, and ≥ 2 years of experience, respectively). 

Predicting mechanical ventilation requirement and mortality in COVID-19 using radiomics and deep learning on chest radiographs: A multi-institutional study

Neural machine translation by jointly learning to align and translate

Learning phrase representations using rnn encoder-decoder for statistical machine translation

Role of standard and soft tissue chest radiography images in COVID-19 diagnosis using deep learning

Predicting covid-19 lung infiltrate progression on chest radiographs using spatio-temporal lstm based encoder-decoder network

Combining Initial Radiographs and Clinical Variables Improves Deep Learning Prognostication in Patients with COVID-19 from the Emergency Department

Review of Chest Radiograph Findings of COVID-19 Pneumonia and Suggested Reporting Language

Current limitations to identify COVID-19 using artificial intelligence with chest X-ray imaging

Gru based deep learning model for prognosis prediction of disease progression

Review of Artificial Intelligence Techniques in Imaging Data Acquisition, Segmentation and Diagnosis for COVID-19

Very deep convolutional networks for large-scale image recognition

Clinical and Chest Radiography Features Determine Patient Outcomes In Young and Middle Age Adults with COVID-19

Computational radiomics system to decode the radiographic phenotype

Toward predicting the evolution of lung tumors during radiotherapy observed on a longitudinal mr imaging study via a deep learning algorithm

Frequency and Distribution of Chest Radiographic Findings in COVID-19 Positive Patients

Deep learning predicts lung cancer treatment response from serial medical imaging

Deep multimodal representation learning from temporal data

Convolutional invasion and expansion networks for tumor growth prediction

Acknowledgment: Reported research was supported by the OVPR and IEDM seed grants, 2020 at Stony Brook University, NIGMS T32GM008444, and NIH 75N92020D00021 (subcontract). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.