key: cord-0155676-2wrk5ohe
authors: Nguyen, Nam; Chang, J. Morris
title: COVID-19 Pneumonia Severity Prediction using Hybrid Convolution-Attention Neural Architectures
date: 2021-07-06
journal: nan
DOI: nan
sha: 3751ea49f3496578cfac0f24ad17238efc147734
doc_id: 155676
cord_uid: 2wrk5ohe

This study proposed a novel framework for COVID-19 severity prediction, which is a combination of data-centric and model-centric approaches. First, we propose a data-centric pre-training for extremely scare data scenarios of the investigating dataset. Second, we propose two hybrid convolution-attention neural architectures that leverage the self-attention from the Transformer and the Dense Associative Memory (Modern Hopfield networks). Our proposed approach achieves significant improvement from the conventional baseline approach. The best model from our proposed approach achieves $R^2 = 0.85 pm 0.05$ and Pearson correlation coefficient $rho = 0.92 pm 0.02$ in geographic extend and $R^2 = 0.72 pm 0.09, rho = 0.85pm 0.06$ in opacity prediction.

The coronavirus disease 2019 (COVID- 19) was declared a global pandemic by the World Health Organization in early 2020. There are 184 million cases with approximately 4 million deaths recorded up to July 2021 [1] . Early detection not only ameliorates the survival rate of COVID-19 patients but also prevents the spread of diseases. Moreover, severity prediction significantly impacts the resource allocation in hospitals [2] , [3] , [4] , which is crucial during the pandemic. Many studies [5] , [6] , [7] , [8] shows the high correlation between severity progression of COVID-19 and the length of hospital stay, ICU admission, which is fruitful for optimal planning of follow-up medical care.

Computer-aided diagnosis based on machine learning and deep learning has become potential solutions for COVID-19 detection [9] and severity prediction [10] , [11] , [12] . The dominant solution for COVID-19 prediction is delivered through transfer learning, in which databases are abundant and adequate for the goodperformed model. In contrast, severity prediction copes with extremely small cohorts, where the number of samples is inadequate to deliver well-calibrated deep learning solutions. Similar works that tackle COVID-19 severity prediction in literature are introduced [10] , [11] with existing deep neural architectures.

In this work, we proposed a novel approach for COVID-19 severity prediction, which combines data-centric and modelcentric improvement. We summarize our contribution as follows: 1) We proposed a data-centric pre-training framework to tackle the extremely scare data scenario of COVID-19 prediction. Our proposed data-centric pre-training significantly ameliorate the performance of deep neural architectures in term of predictive power. 2) We proposed two hybrid convolution-attention neural architectures that leverage the self-attention from stateof-the-art Transformer and Dense associative memory.

3) The experimental results yield a noticeable improvement compared to conventional counterparts, which includes transfer learning from ImageNet and existing neural architectures.

The organization of our work is as follows: Section 2 briefly introduces related works, Section 3 gives a detailed description of our proposed approach, Section 4 reports our experimental design and results, Section 5 gives the discussion and conclusion of our study.

The design of deep neural architectures can be categorized into two approaches: (1) manual and (2) automated. In the manual design, we aim to develop the architecture of neural blocks, which requires considerable expert knowledge. For example, the residual block is introduced in [13] enables more convenient optimization with residual connection from the inputs; Inception blocks enable the approximation of optimal neural spare structure by "splittransform-merge" strategy [14] . On the other hand, automated neural architecture search (NAS) attempts to search the optimal neural architecture on a given datasets [15] , [16] , [17] . The dominant approach for neural encoding for the NAS algorithm is through directed acyclic graphs, which represent blocks in neural architecture (also known as a cell). These discovered cells are then stacked to form final neural architectures.

The recent development of manual neural architecture design leverages self-attention to capture the global contextual information within the input space. The common choice for the selfattention module is Transformer [18] , which was originally designed for the natural language process. Vision Transformer [19] (ViT) split the input images into sequences of patches, which are taken as the input of the Transformer encoder. It is noted that in the ViT architecture, only the Transformer encoder is used. Detection Transformer leverages full architecture of Transformer for object arXiv:2107.02672v2 [eess.IV] 7 Jul 2021 Fig. 1 : Illustration of our proposed data-centric pre-training framework. We highlight the difference between transfer learning and data-centric pre-training in different boxes. detection tasks [20] . The HybridCA [21] introduced learnable image representation queries for the full Transformer model, which enhances the capacity of neural architectures. CoAtNets [22] introduced a family of hybrid models that can improve the generalization, capacity, and efficiency.

The improvements of an AI system can be achieved through two main approaches: (1) model-centric and (2) data-centric development. In the model-centric approach, we aim to develop AI algorithms on given datasets, commonly fixed throughout the process. The main concentration on such an approach is delivering the optimized models for desired learning tasks, enabling stateof-the-art neural solutions. The advantage of such an approach is the convenient comparison between algorithms due to fixed collections of pre-defined datasets. However, there are several issues associated with the model-centric approach. The performance of the model-centric approach highly depends on data scenarios, which in some cases are intrinsically challenging. For example, inadequate training samples in scare data scenarios potentially lead to poor-performed models and non-robust inferences.

In the data-centric approach, we leverage additional data for improving the performance of the AI system. The fundamental assumption of such an approach is simple but practical, which is supposed that the performance of ML/DL algorithms can be ameliorated with more relevant data samples. Transfer learning can be considered an example of the data-centric approach, which transfers knowledge from a huge domain dataset to a smaller target dataset. This approach assists in learning low-level representations from domain sets, enabling more efficient learning on target sets. However, the inherited limitation of transfer learning is negative transfer, in which domain and target datasets are irrelevant.

Directly addressing these issues, we proposed a data-centric framework for pre-training deep neural architecture, which is illustrated in Figure 1 . In the pre-training phase, proxy datasets curation is required, in which proxy datasets need to be highly similar but strictly separated from the datasets of interest. The size of proxy data can be smaller or larger than the target datasets depends on desired purposed. Take automated neural architecture search (NAS) as an example, where we aim to discover the best neural solutions for a given dataset. Early works of NAS [15] , [16] search and evaluate on ImageNet [23] with 14 million samples, which leads to extensively searching time of 2250 − 3000 GPUdays. Following works show that CIFAR-10 [24] (60k samples) is a good proxy data for ImageNet [16] , reducing the search time to only 1 − 4 GPU days [17] . In our used case, we aim to develop a high-performance model for an extremely minimal dataset. Thus, the desired proxy dataset needs to include more samples while maintaining the similarity to the dataset-of-interest. Moreover, pretraining tasks enable learning good representations, which ensures similarity to dataset-of-interest. The pre-training tasks depend on the availability of the proxy data. In a labeled proxy, we can adopt supervised learning tasks such as classification or object detection for models to learn data representations. Regarding unlabeled proxy datasets, unsupervised and self-supervised learning [25] , [26] can be used to pre-train deep neural networks.

The main objective of the data-centric pre-training phase is to help neural architecture learning good representations, which can lead to significant improvements in the downstream tasks. Good representations are expensive, which reasonable-sized representations can capture the abstraction from a vast number of inputs and mitigate the variance [27] . The design of data-centric pre-training for COVID-19 severity prediction will be given in Section 4.

We generalized the hybrid convolution-attention (HybridCA) neural architecture in [21] , which include two main components: (1) convolution backbone module and (2) self-attention module (Figure 2) . First, the backbone convolution transforms input features into intermediate feature maps. These image representations is then projected and vectorized inter-channel to from a collection of entities {x 1 , x 2 , ..., x n }, in which each x i is p−dimensional vector in the latent space. These embedded vectors are taken as inputs of self-attention modules to extract the global contextual information and relationship amongst entities. Since the attention module possesses the permutation-invariant property, we apply the fixed positional encoding before the self-attention encoder. These encoded entities are then fed-forward into self-attention decoder and same size learnable image representation queries (IRQ), which can be considered learnable parameters of the architecture. The final prediction of HybridCA architecture is delivered by a multilayer perceptron, customized to the desired learning tasks.

The original architecture of HybridCA only considers the entire Transformer architecture as the self-attention module. In this work, we extend the study by investigating the effectiveness of an additional self-attention model, a dense associative memory.

The Transformer is an encoder-decoder neural architecture, which contains a stack of encoder layers followed by decoder layers. In the Transformer's encoder, the core component is a multihead self-attention block followed by a sub-sequence elementwise feed-forward network. Moreover, residual connections can be established within the encoder together with layer-wise normalization. The Transformer's decoder is similar to the encoder, except required multi-head attention for encoded representation entities. The building block for the Transformer model is multihead attention, which is formed by the self-attention mechanism. Self-attention: Given a set of image representation entities {x 1 , x 2 , ..., x n }, we denote

Th self-attention attempt to learn the relationship amongst input entities, producing encoded representations which captures the global contextual information from the entities. Such task requires learning three weight matrices:

The collection of input entities X is projected onto the three learnable matrix as following

The encoded representations Z n×dv is computed as

where 1/ d p is temperature of the dot product in the softmax function, preventing extremely small gradients [18] . As a result, each element of encoded representation matrix Z is the weighted sum of all original entities in the latent space, in which weight matrix is computed by the dot-product of queries and all keys.

Multi-head Attention Given B blocks of self-attention, multihead attention can be formed by simultaneously computing multiple individual self-attention. We denote {W

V } for i = 1, 2, . . . B for each self-attention head and Z (i) for each corresponding computed encoded entities. Output of multi-head attention is formed by the projection of concatenation of all elements Z (i) onto W Bdv×d . Hence, multi-head attention's outputs capture multiple complex interactions form projected convolution feature maps in parallel, which provide a larger receptive field.

The Dense Associated Memory (or Modern Hopfield Networks) is introduced in [28] , which extends to continuous-valued patterns and states. The new energy function introduced in [29] enables an exponential number of stored patterns with exponentially small retrieval errors, which is given as

where lse(.) is the log-sum function, β is the temperature, p is state pattern, n is the number of stored representations and M is the largest norm among all stored representations. The update rule of such network by using the Concave-Convex-Procedure yields

This proposed update rule is equivalent to the self-attention used in the Transformer model, which enables attention learning from the input data. Moreover, the new energy function with associated update rule ensure the convergence to local minimum of the energy function, leading to fast convergence.

In the pre-training phase, we collect two databases for the proxy dataset of COVID-19 severity prediction: (1) NIH Chest Xray Dataset [30] , which include 112, 120 Non-COVID-19 Xray images with 14 disease labels from 30, 805 patients and (2) COVID-19 database includes 3, 671 images collected from various resources [31] , [32] . It is noted that these databases are completely separated from the dataset used for severity prediction. The pre-training task for the proxy data is multi-label classification. The target vector (label) for each input instance is a 15-dimensional vector (14 types of disease plus COVID-19 class) with binary entries, representing the presence of related diseases. In other words, the zero vector y = [0, ..., 0] represents normal case, while an unit entry at location C represents the appearance of C th disease. This label encoding guarantees that the model cannot infer normal and disease classes concurrently and that learning tasks are considered a regression-like problem.

Within the scope of this study, we investigate five backbone convolution neural networks with different model complexity, which are DenseNet121 [33] , ResNet50 [13] , EfficientNet-B1 to B5 [34] . The loss function for the pre-training task is binary crossentropy loss. To optimizing model parameters, we use AdamW optimizer with an initial learning rate of 10 −6 and weight decay 0.01. We pre-train backbone CNN with initial ImageNet weights for 100 epochs. We discuss the experimental results of the pretraining phase in Section 4.

The COVID-19 dataset for severity prediction is from [10] , which is completely separated from the proxy dataset. The database includes 94 posteroanterior (PA) chest X-ray images. All patients in the cohort have been reported positive to COVID-19 from December 2019 to March 2020. The labels of the database are based on radiological scoring, which involved three blinded experts. Two chest radiologist (with 20 years of experience) and a radiology resident score the COVID-19 severity based on [35] , which includes extent of lung involvement (geographic extend) and degree of opacity (opacity). The COVID-19 severity prediction dataset can be considered an extremely small dataset, so we decided to perform 5-fold crossvalidation to evaluate competitors' performance. First, we split the dataset into five independent folds, which guarantee no overlapped patient across folds. The evaluation metrics for severity prediction are: (1) mean squared error (MSE), (2) mean absolute error (MAE), (3) R-squared, and Pearson correlation between actual and predicted scores.

The loss function used in this phase is smoothed L1 loss with β = 1, which is given by

We train the model on each fold for 400 epochs with the SGD optimizer with an initial learning rate of 10 −3 , momentum 0.9, and weight decay 3 × 10 −5 . In order to prevent the over-fitting problem, we reduce the learning rate with a decay rate of 0.98 for 

CNNs, in terms of multi-label classification. The mean area under curve (AUC) scores of Densenet121, EfficientNet-B1, B2, and B3 are nearly the same, while EfficientNet-B4 achieve a slightly higher AUC of 0.8018 on 15 classes. On the other hand, ResNet50 achieves the least AUC score of 0.7495, although it has the largest model complexity. Table 1 reports the AUC score for each class from five CNNs models. As we can see, the AUC score of the COVID-19 class is very close to perfect prediction. However, the phenomenon is potentially attributed to the cross-domain design of proxy dataset, where COVID-19 images are from entirely different institutions [36] . Thus, we do not attempt to compare the results of the pretraining phase to other works or emphasize the COVID-19 detection ability. Instead, the main objective of the pre-training phase is assisting backbone models to learn valuable representations for the downstream task, which is severity prediction. We will investigate the effectiveness of such knowledge transfer and expansion here in Section 4.

We report the main results of the COVID-19 severity prediction task in Table 2 , which contains six blocks corresponding to the choices of backbone CNN.

In the first line of each block, we report the performance of stand-alone CNNs under transfer learning setup, in which model's weights are inherited from ImageNet. The outcomes are consistent from all backbone CNNs, showing that transfer learning is ineffective in the case of extremely small and irrelevant target datasets. The performance on the test set of higher complexity such as ResNet50 and EfficientNet-B4 is lower than small neural networks even though over-fitted on the training set. Moreover, from the diagnosis of learning curves (not shown here), these models stop gaining test accuracy after half of the training process, even being applied over-fitting prevention such as adaptive learning rate or dropout.

In the second line, we train stand-alone backbone CNNs with weights from our data-centric pre-training task. The consistent pattern appears across all models, yielding improvement noticeably in comparison to transfer learning from ImageNet. First, the test accuracy is improved significantly, which can be observed through test MAE and MSE. Moreover, the R 2 and Pearson correlation between actual and predicted values increases with a considerably large gap, indicating a more precise prediction. The most accuracy gain can be observed from DenseNet121, while deeper CNNs such as ResNet50 gain a minor improvement. However, the agreement between actual and predicted is not remarkable, which achieves only R 2 of 0.74 and 0.55 to predict geographic extend and opacity from DenseNet121. Figure 4 illustrates the cross-validation loss computed by Equation 6 . Transfer learning fails to achieve good performance in comparison to data-centric pre-training.

The third and fourth line of Table 2 shows the performance of proposed hybrid convolution-attention architectures with different self-attention modules. We denote HCT for Transformer and HCH for Hopfield network. The initial weights for these experiments are adopted from the data-centric pre-training phase. Overall, the MSE of hybrid neural architectures drops approximately 0.2 points across all backbone models, while MAS drops 1.5 points on average. We can see the significant improvement when observing the R 2 from each hybrid model. For example, the R 2 in geographic extend from DenseNet121 increase from 0.74 to 0.82, while that in opacity prediction enhances from 0.55 to 0.73. Moreover, Figure 3 depicts that the CV-loss of hybrid architectures is lower than stand-alone CNNs in general, while the difference between two self-attention modules is not noticeable. We report the alignment of DenseNet121, HCT-DenseNet121 and HCH-DenseNet121 in Figure 5 . In general, hybrid architectures achieve better performance across all folds. Moreover, the alignment from geographic extent is slightly better than the predictions of opacity.

We have presented a novel framework for COVID-19 severity prediction. Our data-centric pre-training design enables high performance models when transferring knowledge to the downstream task. Moreover, we introduce new class of deep neural architecture, which capture the global contextual information from the input space through self-attention modules. Further improvement of our work considers different self-attention modules for the hybrid architecture. Additionally, extending the framework to more applications is also a potential research direction. 

Early prediction of disease progression in covid-19 pneumonia patients with chest ct and clinical characteristics

Development of a prognostic model for mortality in covid-19 infection using machine learning

Viral pneumonia screening on chest x-ray images using confidence-aware anomaly detection

Sensitivity of chest ct for covid-19: comparison to rt-pcr

Time course of lung changes on chest ct during recovery from 2019 novel coronavirus

The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the United States Special Operations Command

Chest ct findings in coronavirus disease-19 (covid-19): relationship to duration of infection

Ct quantification of pneumonia lesions in early days predicts progression to severe illness in a cohort of covid-19 patients

Application of deep learning for fast detection of covid-19 in x-rays using ncovnet

Predicting covid-19 pneumonia severity on chest x-ray with deep learning

Integrating deep learning ct-scan model, biological and clinical variables to predict severity of covid-19 patients

Covid-19 in cxr: From detection and severity scoring to patient disease monitoring

Deep residual learning for image recognition

Going deeper with convolutions

Regularized evolution for image classifier architecture search

Learning transferable architectures for scalable image recognition

Contrastive self-supervised neural architecture search

Attention is all you need

An image is worth 16x16 words: Transformers for image recognition at scale

End-to-end object detection with transformers

Attention learning for classification of dermoscopy image

Coatnet: Marrying convolution and attention for all data sizes

Imagenet: A large-scale hierarchical image database

Learning multiple layers of features from tiny images

Self-supervised learning of pretextinvariant representations

A simple framework for contrastive learning of visual representations

Representation learning: A review and new perspectives

Dense associative memory for pattern recognition

Hopfield networks is all you need

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

Can ai help in screening viral and covid-19 pneumonia?

Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images

Densely connected convolutional networks

Efficientnet: Rethinking model scaling for convolutional neural networks

Frequency and distribution of chest radiographic findings in patients positive for covid-19

On the limits of cross-domain generalization in automated x-ray prediction

Effort sponsored in part by United States Special Operations Command (USSOCOM), under Partnership Intermediary Agreement No. H92222-15-3-0001-01. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation thereon. 1