key: cord-0573991-vtmgk010 authors: Liu, Yi; Ji, Shuiwang title: A Multi-Stage Attentive Transfer Learning Framework for Improving COVID-19 Diagnosis date: 2021-01-14 journal: nan DOI: nan sha: df7dba88dc23d5b33163a9fd2ca9d66769f6963a doc_id: 573991 cord_uid: vtmgk010 Computed tomography (CT) imaging is a promising approach to diagnosing the COVID-19. Machine learning methods can be employed to train models from labeled CT images and predict whether a case is positive or negative. However, there exists no publicly-available and large-scale CT data to train accurate models. In this work, we propose a multi-stage attentive transfer learning framework for improving COVID-19 diagnosis. Our proposed framework consists of three stages to train accurate diagnosis models through learning knowledge from multiple source tasks and data of different domains. Importantly, we propose a novel self-supervised learning method to learn multi-scale representations for lung CT images. Our method captures semantic information from the whole lung and highlights the functionality of each lung region for better representation learning. The method is then integrated to the last stage of the proposed transfer learning framework to reuse the complex patterns learned from the same CT images. We use a base model integrating self-attention (ATTNs) and convolutional operations. Experimental results show that networks with ATTNs induce greater performance improvement through transfer learning than networks without ATTNs. This indicates attention exhibits higher transferability than convolution. Our results also show that the proposed self-supervised learning method outperforms several baseline methods. T HE COVID-19 pandemic has spread rapidly and infected millions of people worldwide. A critical step to fight against the spreading of COVID-19 is effective diagnosis of the infected cases [1] , [2] . The commonly used approach for COVID-19 diagnosis is reverse transcription polymerase chain reaction (RT-PCR). It usually takes hours for results and test kits are in great shortage in some counties and areas [3] , [4] . Computed tomography (CT)aided diagnosis has become a weighty alternative due to its nature of wide availability and easy accessibility [3] , [5] , [6] . To accelerate the reading of CT images, machine learning approaches have been employed to learn patterns from labeled images and then automatically make prediction for any newly obtained CT image [7] , [8] , [9] , [10] . A major challenge of CT-aided automatic diagnosis is the lack of labeled data [7] , [11] , [12] . To date, the largest labeled CT dataset [12] that is publicly available only contains several hundred images. Models trained on such smallscale datasets may generate unsatisfactory prediction results for newly obtained CT images. It is natural to leverage transfer learning to train more powerful models for accurate COVID-19 diagnosis. Transfer learning is a popular machine learning technique that learns a model from a source task where the labeled data is sufficient and transfers the learned knowledge to a target task [13] , [14] , [15] . Existing work [7] , [12] mainly focuses on using pretrained deep neural networks (DNNs) to improve the prediction performance for the target task of COVID-19 prediction. However, there is no work examining which network component, such as a convolutional layer or an attention layer, can induce larger performance improvement through transfer learning. In addition, there is little work presenting a unified transfer learning framework for medical image analysis, especially for COVID-19 CT image prediction. In this work, we propose a multi-stage attentive transfer learning framework for improving CT-based COVID-19 diagnosis. Our proposed framework is composed of three stages based on the machine learning approaches and data used in source tasks. First, we perform supervised transfer learning from natural images (STL-N) and supervised transfer learning from medical images (STL-M) to learn knowledge from large-scale labeled natural images and medical images, respectively. After that, we design a novel self-supervised task and perform self-supervised transfer learning from medical images (SSTL-M) to extract complex patterns from the used medical CT images. For the networks used in the transfer learning framework, we integrate self-attention layers (ATTNs) into convolutional neural networks (CNNs) such as ResNets to compare transferability of convlotional layers and ATTNs. Specifically, We categorize networks into two groups, one of which contains ResNets and the other contains the same ResNets with ATTNs inserted in. Then both groups are pretrained on the same source tasks and data to compare transferability of the two groups. We perform self-supervised transfer learning in the last stage of the proposed transfer learning framework. Existing self-supervised methods [16] , [17] , [18] , [19] , [20] achieve superior results on natural images but usually generate poor predictions on medical images. By referring to biological domain knowledge of substructures of the human lung, we design a novel self-supervised method to learn multiscale representations for lung CT images. Our method is capable of learning representations at both the image level arXiv:2101.05410v1 [eess.IV] 14 Jan 2021 and the region level. By doing this, sufficient semantic information from the whole lung is captured and the functionality of each lung region is highlighted for better representation learning. Then self-supervised transfer learning is performed to reuse the complex inherent patterns learned from the same CT images to improve the performance of the target task. We conduct extensive experiments to evaluate our proposed approach. Experimental results show that after pretrained with our proposed multi-stage transfer learning framework, networks with ATTNs achieve much better performance for CT-aided COVID-19 prediction compared with the baseline ResNets. This indicates the effectiveness of integrating ATTNs into our transfer learning framework. More importantly, it is shown that networks with ATTNs result in much larger performance improvement through transfer learning compared with convolutional layers. This points out that compared with convolution, attention can transfer knowledge from source tasks to target tasks more easily, which essentially reveals that attention exhibits higher transferability than convolution. In addition, we show that our proposed self-supervised learning method achieves best performance compared with several SOTA baselines. This demonstrates the effectiveness of our method of learning multi-scale representations of lung CT images and highlighting the functionality of each lung region. Our qualitative results demonstrate that using attention for transfer learning can successfully detect important regions for prediction. Overall, our major contributions are summarized as follows: • We propose a multi-stage attentive transfer learning framework for improving CT-aided COVID-19 diagnosis. The proposed framework successfully learns and transfers knowledge from multiple source tasks and data of different domains for accurate COVID-19 diagnosis. We propose a novel self-supervised learning method for medical images. Our method enables multi-scale representation learning for lung CT images and outperforms existing self-supervised learning methods. • We not only show that networks with attention layers are more powerful through transfer learning, but also demonstrate that attention has higher transferability than convolution. To our best knowledge, this is the first work to compare transferability of attention and convolution. In this section, we introduce related work of transfer learning and attention mechanism. Transfer learning aims at transferring knowledge across different tasks [13] , [15] . Generally, it learns knowledge from a source task and transfers the knowledge to a target task. In practice, it is usually difficult to collect sufficient training data for a target task. Training a model on insufficient data may result in unsatisfactory prediction results. Transfer learning is used to first train a model on a source task where the training data is sufficient. Then the pretrained model serves as the starting point and is finetuned on the target task. For instance, in visual recognition, a model is usually trained on ImageNet that contains millions of labeled training samples. After that, the trained weights are used as initial weights for downstream tasks such as semantic segmentation and objective detection. Transfer learning has achieved success across various artificial intelligence domains, including natural language processing [21] computer vision [22] , and biomedical image analysis [23] , [24] . There exist several categorization criteria of transfer learning [13] , [14] . Based on the machine learning approaches used in the source task, transfer learning could be categorized into supervised transfer learning, unsupervised transfer learning, and semi-supervised transfer learning, etc. Generally, supervised and semi-supervised learning have been studied intensively, and unsupervised learning is a promising research area as labeling is usually expensive. Self-supervised learning is a type of unsupervised learning strategy that has gained more and more popularity recently [16] , [17] , [25] . It aims at supervised feature learning where the supervision is provided by the data. The supervised tasks are the key for self-supervised learning. Earlier work for supervised tasks on images basically predicts positions or context for a local patch [16] , [17] . Recent work [26] , [27] , [28] mainly performs two random sets of data augmentations on a pair of images and predicts whether the two images are the same or not. The contrastive loss is commonly used for these methods. The input for the contrastive loss contains a query vector x q from an image X, a key vector x k+ from the same image X, and another n key vectors from n images that are different from X. Then the contrastive loss is essentially a log-loss of a (n + 1)-way softmax classifier that tries to classify x q to x k+ rather than the other n key vectors. In this section, we describe the attention mechanism, which captures long-range dependencies from input [29] , [30] . Given the input tensor X ∈ R h×w×c , the attention mechanism first performs 1 × 1 convolution three times and achieves three tensors; those are, the query Q ∈ R h×w×d k , the key K ∈ R h×w×d k and the value V ∈ R h×w×dv . These tensors are unfolded into three matrices along mode-3 [31] , resulting in Q ∈ R d k ×hw , K ∈ R d k ×hw and V ∈ R dv×hw , respectively. Then the intermediate output is computed as where Softmax(·) is performed on columns such that every column sums to 1. Finally, the obtained matrix O is converted back to a tensor O ∈ R h×w×dv , as the final output of the attention mechanism. Essentially, (K T Q) generates a matrix of sizes hw × hw, which can be treated as hw attention heatmaps. Each heatmap contains hw attention weights, and O is computed as a weighted sum of all vectors in V . To this end, the response at each position of the output O is dependent on all positions of V, which is just achieved by performing a linear transformation on the input X . As a result. long-range dependencies from the input are captured by attention. Notably, the output of attention is input dependent. Different from convolution where weights are learnable parameters, attention weights are computed from the input. In this section, we introduce our proposed multi-stage attentive transfer learning framework and network settings. Given a new CT image, our objective is to predict whether it's COVID-19 positive or negative based on the trained model. However, existing COVID-19 CT data is small-scale and insufficient to train a powerful model, which usually leads to poor prediction performance. It's natural to leverage transfer learning to achieve more powerful models and boost the performance of COVID-19 prediction. An illustration of our proposed multi-stage transfer learning framework is provided in Figure 1a . Specifically, we first conduct two supervised source tasks, namely supervised transfer learning from natural images (STL-N) and supervised transfer learning from medical images (STL-M), to learn models from large-scale labeled data for the target task. After that, we perform self-supervised transfer learning from medical images (SSTL-M) to learn complex inherent patterns from the used CT images. For the networks used in our transfer learning framework, we categorize them into two groups to compare transferability between self-attention layers (ATTNs) and convolutional layers. The first group contains standard CNNs such as ResNet-50 and ResNet-101. For the other group, we follow the settings in the work [29] and insert ATTNs in these backbone ResNets. An illustration of the network settings is provided in Figure 1b . Generally, there exist four residual blocks in the family of ResNets, namely res 2 , res 3 , res 4 and res 5 , respectively [32] . It is shown that networks with ATTNs inserted in the res 3 and res 4 obtain the best performance [29] . We use similar strategies and insert most ATTNs in the res 3 and res 4 . In addition, we propose to insert another ATTN in the res 5 before the global average pooling to qualitatively demonstrate the effectiveness of ATTNs in transfer learning. Then both the groups of networks are applied to our multi-stage transfer learning framework to compare transferability of the two groups. Essentially, networks with and without ATTNs are pretrained on the same source tasks to compare which of ATTNs and convolutional layers can transfer knowledge from these source tasks more easily. The labeled CT data for COVID-19 diagnosis is limited. Training networks directly on these CT images may result in poor performance for COVID-19 detection. There exist large-scale labeled datasets from other domains or diseases. We use supervised transfer learning to learn and transfer knowledge from these labeled data to facilitate the CT-aided COVID-19 diagnosis. We first perform STL-N on ImageNet, a large-scale collection of natural images and the most popular labeled dataset for model pretraining. Both the groups of networks introduced in Section 3.2 are pretrained on ImageNet to learn knowledge from natural images. When applied to new tasks such as COVID-19 CT image prediction, the transferability of two different categories can be estimated by the performance improvement induced by transfer learning. Notably, networks with and without ATTNs are pretrained on ImageNet to compare the transferability of ATTNs and convolution layers from natural images. It is obvious that nature images and medical images (such as CT images) follow different distributions. Pretraining on natural images enables models to learn common patterns shared by natural and medical images, but fails to learn distinguishing patterns for medical images. Hence, we conduct STL-M to pretrain models on existing large-scale labeled medical images. Even though labeled CT images for COVID-19 diagnosis are scarce, there exist redundant sources for annotated medical images from other domains, such as chest X-ray (CXR) images for COVID-19, or CT images for regular pneumonia. Pretraining on these medical images enables models to learn inherent patterns in medical images and extract strong features for accurate COVID-19 diagnosis. Similar to STL-N, the two groups of networks that with and without ATTNs are both pretrained on the labled medical images to compare transferability of these two groups. Notably, by performing the two stages of supervised transfer learning STL-N and STL-M, we compare transferability of ATTNs and convolutional layers on two scenarios, where the source data follows different or similar distributions with the target data. We proposed to transfer knowledge from labeled natural and medical images by performing supervised transfer learning STL-N and STL-M. However, there still exists divergence in distributions between the labeled source data and the target data. Notably, medical images from other modalities or diseases still have a domain shift from the lung CT images for COVID-19. Hence, we perform self-supervised transfer learning as the last stage in our framework to obtain knowledge from the same CT images. Existing self-supervised learning methods achieve good performance on natural images but usually result in poor performance on medical images (like CT images). In this section, we introduce a novel self-supervised learning method for medical images and transfer the learned knowledge to COVID-19 detection. We denote this stage as selfsupervised transfer learning from medical images (SSTL-M), as illustrated in Figure 1a . The objective of SSTL-M is to learn complex and inherent patterns from CT images by performing a carefully-designed self-supervised task. It is vital to design an appropriate source task to obtain redundant information from the CT images. Currently, selfsupervised learning methods, including MoCo (v1 [26] and v2 [27] ) and simCLR [28] , have achieved the SOTA performance on tasks for natural images, but usually result in poor performance on tasks related to medical images [33] . Essentially, these methods apply two sets of random data augmentations on the same image then force the network make a positive prediction. A positive pair contains two same images (one image actually). Negative pairs are added where a negative pair contains two different images. By doing this, inherent patterns from input images are learned and can be transferred to other target tasks. However, the data augmentation methods used in these tasks are commonly used techniques (such as rotation, flipping) and may not be strong enough to extract semantic patterns from medical images. In addition, medical images such as lung CT images are usually symmetric in structure. Applying data augmentation on the whole image may fail to extract distinguishing features from a specific region. In this work, we propose a novel self-supervised learning method to learn multi-scale representations for lung CT images, as illustrated in Figure 2 . Our proposed method is composed of two branches, these are, image-scale representation learning for the whole lung structure, and regionscale learning for different substructures of a lung. We use a contrastive self-supervised pipeline [26] for the former. For the latter, we design a task referring to prior domain knowledge based upon biological structures of the human lung. Specifically, humans have two lungs, a right lung, and a left lung. The right lung has three lobes, the upper lobe, the middle lobe, and the lower lobe. The left lung is a little smaller and composed of two lobes, the upper lobe and the lower lobe [34] . These lobes play different roles in biological processes. For some diseases, infection of a specific lobe can serve as an important indicator for medical diagnosis [5] , [34] . Hence, it's important to learn inherent patterns for each specific lobe and highlight structural divergence among all lobes. We first perform image-scale learning based on the whole CT image. A contrastive self-supervised learning framework is used, an input sample of which contains a CT image X to form a positive pair, and a set of different CT images {X i |i = 1, ..., n} to form negative pairs. Specifically, two sets of random augmentations are applied on the same image X, which results in two images X q and X k . X q is passed to the query encoder to generate a query vector x q , and X k is passed to the key encoder to generate a key vector x k+ . In addition, images {X i |i = 1, ..., n} are also taken by the same key encoder to obtain a set of key vectors {x i k− |i = 1, ..., n}. As a result, (x q , x k+ ) forms a positive pair and (x q , x i k− ) form n negative pairs for the contrastive loss, which is expressed as where τ is a temperature parameter. Intuitively, L cons is the log-loss of a (n + 1)-way softmax classifier whose input is x q and the correct label is x k+ . Generally, the query encoder and the key encoder share the same network architecture, but differ in weight update. The weights of the query encode are updated by backpropagation. Different from the query encoder, the key encoder takes as input a large set of images {X i |i = 1, ..., n}, which usually makes it intractable to update the weights by back-propagation. The MoCo [26] , [27] adopts a momentum update strategy for the key encoder as where m is the momentum parameter that usually takes a large number such as 0.999, θ k and θ q denote the weights of the key encoder and the query encoder, respectively. Notably, we also use two groups of networks for the query and key encoders to compare the transferability of ATTNs and convolutional layers. We propose to learn distinguishing patterns in five lobes in region-scale learning. From this perspective, we carefully design a task to predict the positions of all the five lobes. Given an input lung CT image X, we first generate five regions that cover the five lobes as where X ru , X rm , X rl , X lu , X ll denote the regions cover the right lung upper lobe, the right lung middle lobe, the right lung lower lobe, the left lung upper lobe, the left lung lower lobe, respectively, Generator denotes a composition of operators to generate the above five region from an input lung CT image. Generator is composed of three operators, including the locate, crop and resize operators. The locate operator contains three steps. First, we compute a location tuple (x i , y i , w i , h i ) for each region i ∈ {ru, rm, rl, lu, ll} from Figure 2 , where x i and y i denote the coordinates of the center point of the region i, w i and h i denote the width and height of the region i, respectively. After that, for a given lung CT image, we generate a boundary map that separates the lung and its peripheral tissue, then an image that only contains the lung can be achieved based on the boundary map. To generate a boundary map, we slide a 2D kernel with sizes 3 × 3 pixel by pixel on the input lung image. If a window contains more than one distinct pixel value, the center pixel of this window is marked as a boundary pixel. At last, we compute the five location tuples for the achieved lung image based on the location tuples (x i , y i , w i , h i ), where i ∈ {ru, rm, rl, lu, ll}. This is because the position and spatial sizes of each region for the human lung are roughly fixed. We then use the crop operator to crop out five regions based on the location tuples. Finally, the resize operator is performed such that each region is resized to the original lung CT image's sizes. We employ either of the above two groups of networks to generate a region representation for each region. In this way, the networks are forced to understand and learn inherent knowledge for each lobe of a lung. After that, classifiers are used to predict the positions of all the five lobes. It's essentially a multi-task learning problem. The input is the five regions, each of which covers a lobe. For a region, the right class is the predefined index for it. Formally, given an input lung CT image X, the region-aware loss L ra for region-scale learning is computed by where CE(x i , y i ) is the cross-entropy loss for the region i ∈ {ru, rm, rl, lu, ll}, x i ∈ R 5 is the output of the classifier, and y i ∈ R 5 is a one-hot vector indicating the right class for the region i. Notably, the parameters are shared across all the five input regions. The proposed region-scale learning task has two advantages. Firstly, it learns specific patterns for each lobe, thereby extracting subtle information at the lobe-level. Secondly, by treating each lobe as a center lobe, the other four lobes can be viewed as the context for the center lobe. Thus the center lobe is highlighted and doing well on this task requires distinguishing representation of each lobe. As introduced in Section 5.2, an input sample for our selfsupervised learning framework contains a CT image X to form a positive pair, and a set of different CT images {X i |i = 1, ..., n} to form negative pairs. Formally, the final loss L for this input sample is computed by performing the weighted sum on the contrastive loss of the image-scale learning and the proposed region-aware loss of the regionscale learning as where α 1 and α 2 are hyper-parameters, L cons is defined in Equation 2, L i ra is the region-aware loss for the image X, and L i ra is the region-aware loss for an image X i from the image set {X i |i = 1, ..., n}. During back-propagation, the network is forced to learn representations at both the lung-level and lobe-level, distinguishing the functionality of each lobe and capturing sufficient semantic information to achieve better representations for lung CT images. Note that we use the same network in two branches. That is to say, the weights are shared for the feature extractors used in both branches. Similar to supervised transfer learning, we apply each of the two groups of networks as the backbone network in the self-supervised learning framework and then transfer the knowledge to the target task. The transferability of ATTNs and convolution layers on the self-supervised learning is then explored. We use two approaches to estimate the transferability of learned representations. We denote a network that is not pretrained on a source task as N n , and the same network that is pretrained on a source task as N p . First, we directly finetune the pretrained N n and N p on the target task and record the prediction performance. The metrics include accuracy, F1-score and AUC. For any metric, the performance of N n and N p is denoted as P n and P p , respectively. Then the divergence of each metric between P n and P p can be used to estimate the transferability of representations learned by the employed network on the source task. In addition to evaluating transferability through running experiments on the target task, we employ LEEP [35] to directly estimate transferability based on the pretrained model and statistics of the target dataset. LEEP can only be applied to supervised transfer learning where the source data has labels. Assume labels of the source data are in a label set Z, input instances of the target data are in the domain set X ∈ R N , and labels of the target data are in a label set Y. Given a pretrained network N and target dataset D = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n )}, where n is the number of data samples in the target set, x i ∈ X is achieved by flatting an image to a vector. Formally, the LEEP score L can computed as where P (y i |z) is the conditional distribution of the target label y i given the source label z, and N (x i ) z is the probability that the output of the network that takes x i as the input is the label z. Similarly, the LEEP score based on N n and N p with the same target dataset can be denoted as L n and L p , respectively. The divergence between L n and L p can be used to estimate transferability for N n and N p from the same source task. We use two datasets to pretrain models and perform supervised transfer learning. First, We use ImageNet that contains millions of natural images to pretrain the models. Even though nature images follow different distributions with CT images, pretraining on ImageNet enables models to learn redundant patterns that are shared by natural and medical images. After that, we use COVIDx [7] , a collection of images from medical domain, to extract similar patterns that are shared by medical images. We use COVID19-CT to perform self-supervised transfer learning and CT-based diagnosis of COVID-19. To our best knowledge, the dataset is the largest public-available CT dataset for COVID-19. COVIDx The COVIDx dataset is a public-available labeled dataset containing chest X-ray (CXR) images for COVID-19 detection. The dataset contains 16898 images in total, among which 573 images are for COVID-19 cases, 5559 images are for regular pneumonia (non COVID-19) cases and the rest 8066 are normal cases. The dataset is generated from 5 sources; those are, COVID-19 Image Data Collection [36] , COVID-19 Chest X-ray Dataset Initiative [37] , ActualMed COVID-19 Chest X-ray Dataset Initiative [38] , RSNA Pneumonia Detection Challenge dataset [39] and COVID-19 radiography database [40] . Generally, CXR imaging is a low-cost, first-look technique compared with CT scanning. CXR images usually have lower quality than CT images but can be obtained much faster. Due to their easyobtaining nature, the existing CXR datasets are much larger than the CT datasets for COVID-19 diagnosis. However, the CXR imaging and CT scanning techniques have something in common. The CXR imaging uses a small amount of radiation to go through and take an image of the chest. CT scanning is essentially a more detailed type of CXR that makes more comprehensive views of the chest. In this sense, the achieved CXR and CT images follow similar distributions and may share common patterns for image representation learning. It is a vital step to learn and transfer knowledge to the target dataset from a source dataset that is larger but follows similar distributions with the target dataset. We employ two backbone networks ResNet-50 and ResNet-101, where purely convolutional layers are used to extract features from images. We then investigate two scenarios that adding 1 ATTN and 5 ATTNs for each backbone network. The ATTN is inserted in the res 3 block for the former. For the latter, 2 ATTNs are inserted in both the res 3 and res 4 , and another ATTN is inserted in the res 5 block. This results in a total of six networks. Each of the ResNet-50 and ResNet-101 has three variants, including the baseline, the baseline with 1 ATTN and the baseline with 5 ATTNs. Notably, all ATTNs are added at the end of the corresponding residual blocks. We first perform STL-N and pretrain all the six networks on ImageNet ILSVRC 2012 image classification dataset [41] , which contains 1.2 million natural images for training, 50 thousand for validation and another 50 thousand for testing. There are 1000 classes in total. We adopt the same data augmentation pipeline as in [32] . Specifically, each image is scaled to 256×256 and a patch of size 224×224 is randomly cropped as a training sample. Horizontal flip is randomly performed for each cropped patch with a probability of 0.5. We employ the dropout [42] with a rate of 0.8 and the weight decay of 1e-4 to avoid over-fitting. To optimize the models, we employ the stochastic gradient descent (SGD) optimizer with a momentum of 0.9 to train models for 90 epochs. The initial learning rate is set to 0.1 and decays by 0.1 every 30 epochs. We use 8 TITAN Xp GPUs and the batch size is set to 512 for training. We then perform STL-M to pretrain the models on the COVIDx dataset. The used data augmentation scheme is composed of several techniques including translation, rotation, horizontal flip, intensity shift. All the techniques are randomly performed with a probability of 0.5. SGD is used with a learning rate equal to 2e-4. A dropout with a keep rate of 0.5 is adopted to avoid over-fitting. The number of epochs is set to 30 and the batch size is 64 during the pretraining procedure. During the SSTL-M, we follow the same data augmentation scheme as in [26] . Specifically, all images as well as cropped regions are scaled to 256 and a patch with spatial sizes 224 is randomly cropped from each image. Then color jittering, horizontal flip, and grayscale conversion are performed randomly with a probability of 0.5. The temperature parameter τ in Equation 2 is set to 0.07. The weights in Equation 6 are set to α 1 = 0.8, and α 2 = 0.8. The SGD is employed as our optimizer with weight decay equal to 1e-4 and SGD momentum equal to 0.9. The training is performed for 200 epochs. The initial learning rate is 0.3 and decays by 0.1 at the 120th and the 160th epochs. The batch size is set to 256 with 8 TITAN Xp GPUs. When comparing our method with the MoCo and SimCLR, we use the same hyperparameters as used in their papers [26] , [28] for the MoCo and SimCLR. After finishing the pretraining procedures, we use each pretrained model as a starting point and finetune it on the target dataset COVID19-CT for COVID-19 prediction. We use the same optimizer and hyperparameters as in the SSTL-M. The data augmentation scheme is also the same during training. During inference, the center-cropped patch with size 224 for each image is used, and other augmentation techniques are the same. We first investigate how powerful the networks with ATTNs are on ImageNet. We denote ResNet-50 as R-50 and ResNet-101 as R-101 for convenience. Each of the R-50 and R-101 has three variants including the baseline CNN, the baseline with 1 ATTN, and the baseline with 5 ATTNs. We denote them as baseline, 1 ATTN and 5 ATTNs for convenience. We conduct experiments to predict the top-1 and top-5 accuracies on the validation set of the ImageNet as the test data has no labels. In addition, we also compute performance improvements compared with the baselines. The results are reported in Table 1 . We can observe from the table that by adding ATTNs, the performance improvement is less than or around 1% on the validation set of the ImageNet. We then conduct experiments to examine the overall performance of all the models on the target dataset. These models are all pretrained on the three source tasks STL-N, STL-M and SSTL-M in order. The metrics include accuracy, F1-score and AUC. For each metric, we also compute performance improvements compared with the baselines. The results are reported in Table 2 . We can find from the table that adding ATTNs to the networks and then performing transfer learning can significantly improve prediction performance. In terms of accuracy, adding 1 ATTN for transfer TABLE 2 Overall performance for COVID-19 CT image prediction of all the six models pretrained with STL-N, STL-M and SSTL-M in terms of accuracy(%), F1-score(%) and AUC(%). There are two columns for each metric. The value column provides the original results, and the improv. column shows the improvements based on the baselines. learning leads to an average improvement of 4.5%, and adding 5 ATTNs leads to an average improvement of 5.3% compared with the baseline ResNets. Similar results can be found on the metrics F1-score and AUC. These results indicate the effectiveness of integrating ATTNs into our proposed multi-stage transfer learning framework. We investigate the effectiveness of our proposed selfsupervised learning method. For both the R-50 and R-101 groups, we first conduct the same supervised transfer learning tasks STL-N and STL-M. After that, we applied the trained models to two state-of-the-art self-supervised learning frameworks, the MoCo [26] and the SimCLR [28] , and our method, which we denote as X w MoCo, X w SimCLR and X w Ours, respectively. X denotes either baseline, adding 1 ATTN or adding 5 ATTNs. The experimental results are reported in Table 3 . We can observe from the table that models with our method consistently outperform models with the MoCo or SimCLR on all the three metrics. Specifically, considering all the three networks together, our method outperforms the MoCo and SimCLR by an average margin of 0.6% and 0.5% in terms of accuracy for the R-50 Group, and an average margin of 0.7% and 0.6% in terms of accuracy for the R-101 Group. Similar results can also be achieved for the F1-score and AUC. This indicates by using a multi-scale learning framework and considering distinguishing patterns from local lobes, our method can successfully extract useful inherent patterns from the CT data, thereby leading to performance improvement for COVID-19 diagnosis based on CT images. We design experiments to explore whether ATTNs bring benefits for transfer learning. For both the R-50 and R-101 groups we conduct two sets of experiments. First, we directly optimize the networks on the target dataset COVID19-CT without transfer learning, namely X w/o TL, where X denotes either baseline, adding 1 ATTN or adding 5 ATTNs. Second, we apply all the networks to our multistage transfer learning framework that we perform all the three stages of pretraining for all the networks and then finetune the pretrained models on the target dataset. We name such models as X w TL, where X has the same meaning as above. The performance is evaluated on the test set of the COVID19-CT dataset. We then compute the improvements by using transfer learning for the two groups of networks and the results are reported in Table 4 . We can observe from the table that models with the proposed transfer learning framework consistently improve the prediction performance on the target dataset. This indicates that by using our multi-stage transfer learning framework, it successfully extracts important patterns between the source images and the target images. More importantly, the table shows that networks with ATTNs achieve much larger performance improvements through transfer learning than the baseline ResNets. In terms of accuracy on ResNet-50, the improvement for the baseline is 4.5%, while and improvements for adding 1 ATTN and adding 5 ATTNs are 8.2% and 9.2%, respectively. The benefits induced by attention are 2.7% and 3.7%. Similar results can be observed for the ResNet-101 and the benefits induced by attention are 2.5% and 2.6%. Consistent conclusions can be obtained on the metrics F1-score and AUC that attention helps transfer learning. For the two settings with ATTNs, the benefits brought by attention are 5.9% and 7.0% in terms of F1-score for ResNet-50, and 4.7% and 5.9% for ResNet-101. In terms of AUC, attention helps with margins of 2.5% and 4.2% for ResNet-50, and margins of 3.1% and 5.7% for ResNet-101. These results indicate that TABLE 4 Results for COVID-19 CT image prediction of all the six networks with and without transfer learning in terms of accuracy(%), F1-score(%) and AUC(%). There are two columns for each metric. The value column provides the original results, and the improv. column shows the performance improvements of networks with transfer learning compared with networks without transfer learning. compared with convolution, attention helps transfer more useful knowledge from the source tasks to the target task in transfer learning. By adding ATTNs, the network is capable of learning important common pattens of images in the pretraining stage, thereby leading to significant performance improvement for the target task. In this section, we conduct ablation study to examine the benefits of attention brought to the two supervised transfer learning tasks STL-N and STL-M. Instead of performing two tasks sequentially, we conduct pretraining on natural images and medical images separately to explore how attention benefits transfer learning from different domains. Specifically, we name the models as X w/o TL, X w STL-N and X w STL-M, respectively. X denotes either baseline, adding 1 ATTN or adding 5 ATTNs. We compute the improvements for each stage of transfer learning for both the R-10 and R-101 groups and the results are reported in Table 5 . We can observe from the table that attention consistently benefits both the STL-N and STL-M. Specifically, on the ResNet-50, adding 1 ATTN benefits the STL-N by a margin of 1.8% in terms of accuracy, 4.6% in terms of F1score, and 1.7% in terms of AUC. Adding 1 ATTNs benefits the STL-M by a margin of 2.0% in terms of accuracy, 4.7% in terms of F1-score, and 3.5% in terms of AUC. On the ResNet-50, adding 5 ATTNs benefits the STL-N by a margin of 2.8% in terms of accuracy, 4.7% in terms of F1-score, and 1.5% in terms of AUC. Adding 5 ATTNs benefits the STL-M by a margin of 1.7% in terms of accuracy, 3.0% in terms of F1-score, and 1.3% in terms of AUC. Similar results can be computed on ResNet-101 that attention consistently benefits both the STL-N and STL-M on all three metrics. These results indicate that attention helps transfer much more useful knowledge than convolution in supervised transfer learning, no matter the source data follows totally different or similar distributions with the target data. In this section, we compute LEEP scores to examine the functionality of attention in transfer learning. Instead of optimizing parameters and making predictions on the target dataset, we directly compute LEEP scores based on the source models and statistics of the target data. As LEEP scores can only be computed for supervised transfer learning, we conduct experiments on STL-N and STL-M to achieve LEEP scores. we add STL-N and STL-M in order, which results in models X w/o TL, X w STL-N and X w STL-N w STL-M, respectively. X denotes either baseline, adding 1 ATTN or adding 5 ATTNs. The results are reported in Table 6 and shown in Figure 3 . The curve for each network setting in Figure 3 is composed of two line segments, each of which indicates improvement by adding a stage of pretraining procedure. The slope of each line segment can reflect the improvement in the LEEP score by adding the corresponding stage of transfer learning. We can observe from the figure that both adding 1 ATTN and adding 5 ATTNs achieve larger improvement than the baseline ResNets for all the two stages. This again demonstrates attention helps transfer learning regardless of divergences between the distributions of the source datasets and the target dataset. In addition to the quantitative results in the above sections, we provide qualitative results to show the capability of our proposed multi-stage attentive transfer learning framework when detecting important regions in CT image for COVID-19 diagnosis. We use ResNet-50 as the baseline and then adding 5 ATTNs to the baseline. Both the networks are pretrained on the three source tasks and finetuned on the target task. We use CNN Grad-CAM [43] to visualize convolution maps for the last convolutional layer right before the global average pooling. For the network with 5 ATTNs, we visualize the last ATTN inserted in the res 5 . We simply perform average on attention score maps of all the pixels and achieve the final attention map. As shown in Figure 4 , after performing transfer learning, a ATTN layer can successfully detect regions that are severely infected by the virus. However, convolutional layers fail in some cases where the uninfected regions are highlighted. This again demonstrates the effectiveness of using attention in our proposed multi-stage transfer learning framework. We propose a unified transfer learning framework and examine how attention facilitates transfer learning for improving COVID-19 diagnosis. We first design a multi-stage transfer learning framework, which consists of supervised transfer learning from natural images (STL-N), supervised transfer learning from medical images (STL-M) and selfsupervised transfer learning from medical images (SSTL-M). This framework allows transferring knowledge from data of different domains, such as large-scale labeled natural images, large-scale labeled medical images and the same CT images. As existing self-supervised learning methods usually generate poor results on tasks related to medical images, we propose a novel self-supervised learning method based on the understanding of substructures of the human lung. The method is integrated as the last stage in our transfer learning framework and self-supervised transfer learning is performed to reuse the complex patterns learned from the same CT images. Experimental results show that our method outperforms several SOTA baseline methods. For the networks used in our transfer learning framework, we integrate self-attention layers into ResNets and apply them to the proposed transfer learning framework. Experimental results demonstrate attention has higher transferability than convolution. To our best knowledge, this is the first work to compare transferability of attention and convolution when the source tasks and data are the same in transfer learning. Laboratory diagnosis of covid-19: current issues and challenges Diagnosing covid-19: the disease and tools for detection Chest ct for typical 2019-ncov pneumonia: relationship to negative rt-pcr testing Clinical course and risk factors for mortality of adult inpatients with covid-19 in wuhan, china: a retrospective cohort study Chest ct findings in coronavirus disease-19 (covid-19): relationship to duration of infection Coronavirus disease 2019 (covid-19): role of chest ct in diagnosis and management Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Rapid ai development cycle for the coronavirus (covid-19) pandemic: Initial results for automated detection & patient monitoring using deep learning ct image analysis Artificial intelligence and machine learning to fight covid-19 Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal A deep learning algorithm using ct images to screen for corona virus disease Sample-efficient deep learning for covid-19 diagnosis based on ct scans A survey on transfer learning A survey of transfer learning How transferable are features in deep neural networks? Unsupervised visual representation learning by context prediction Unsupervised representation learning by predicting image rotations Rethinking data augmentation: Self-supervision and self-distillation Revisiting self-supervised visual representation learning S4l: Selfsupervised semi-supervised learning Bert: Pretraining of deep bidirectional transformers for language understanding Faster r-cnn: Towards realtime object detection with region proposal networks Transfusion: Understanding transfer learning for medical imaging Dermatologist-level classification of skin cancer with deep neural networks Discriminative unsupervised feature learning with exemplar convolutional neural networks Momentum contrast for unsupervised visual representation learning Improved baselines with momentum contrastive learning A simple framework for contrastive learning of visual representations Non-local neural networks Attention is all you need Tensor decompositions and applications Deep residual learning for image recognition Self-supervised learning for medical image analysis using image context restoration Gray's Anatomy for Students E-Book Leep: A new measure to evaluate transferability of learned representations Covid-19 image data collection Bworld robot control software Actualmed COVID-19 chest x-ray data initiative Rsna pneumonia detection challenge COVID-19 radiography database Ima-geNet: A Large-Scale Hierarchical Image Database Dropout: a simple way to prevent neural networks from overfitting Grad-cam: Visual explanations from deep networks via gradient-based localization This work was supported by National Science Foundation grants DBI-1922969 and IIS-1908220.