key: cord-0795113-3i3scjt4 authors: Hou, Junlin; Xu, Jilan; Jiang, Longquan; Du, Shanshan; Feng, Rui; Zhang, Yuejie; Shan, Fei; Xue, Xiangyang title: Periphery-aware COVID-19 Diagnosis with Contrastive Representation Enhancement date: 2021-05-06 journal: Pattern Recognit DOI: 10.1016/j.patcog.2021.108005 sha: 79fd384870207d888d41a445e85cf09306a83a62 doc_id: 795113 cord_uid: 3i3scjt4 Computer-aided diagnosis has been extensively investigated for more rapid and accurate screening during the outbreak of COVID-19 epidemic. However, the challenge remains to distinguish COVID-19 in the complex scenario of multi-type pneumonia classification and improve the overall diagnostic performance. In this paper, we propose a novel periphery-aware COVID-19 diagnosis approach with contrastive representation enhancement to identify COVID-19 from influenza-A (H1N1) viral pneumonia, community acquired pneumonia (CAP), and healthy subjects using chest CT images. Our key contributions include: 1) an unsupervised Periphery-aware Spatial Prediction (PSP) task which is designed to introduce important spatial patterns into deep networks; 2) an adaptive Contrastive Repre-sentation Enhancement (CRE) mechanism which can effectively capture the intra-class similarity and inter-class difference of various types of pneumonia. We integrate PSP and CRE to obtain the representations which are highly discriminative in COVID-19 screening. We evaluate our approach comprehensively on our constructed large-scale dataset and two public datasets. Extensive experiments on both volume-level and slice-level CT images demonstrate the effectiveness of our proposed approach with PSP and CRE for COVID-19 diagnosis. Coronavirus Disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is spreading quickly worldwide. To date (December 29th 2020), more than 79 million confirmed cases have been reported in over 200 countries and territories, with a mortality rate of 2.2% [1] . 5 Considering the pandemic of COVID-19, early detection and treatment are of great importance to the slowdown of viral transmission and the control of the disease. Currently, the Reverse Transcription-Polymerase Chain Reaction (RT-PCR) testing remains the diagnostic gold standard of COVID-19. As a reliable complement to RT-PCR assay, thoracic computed tomography (CT) has also 10 been recognized to be a major tool for clinical diagnosis in many hard-hit regions. As shown in Figure 1 (a), some characteristic radiological patterns can be clearly detected in chest CT images, including ground-glass opacity (GGO), crazypaving pattern, and later consolidation. Bilateral, peripheral, and lower zone predominant distributions are mostly observed [2] . To assist radiologists in 15 more efficient pneumonia screening during the outbreak of COVID-19, we aim to develop a deep learning based approach to automatically diagnose in complex diagnostic scenarios using chest CT images. Deep learning approaches have demonstrated significant improvement in the field of medical image analysis. In fighting against COVID-19, these ap- 20 proaches have also been widely applied to the lung and infection region segmentation [3, 4, 5, 6] , as well as the clinical diagnosis and assessment [7, 8, 9, 10] . The review articles [11, 12] have made thorough investigations on the advanced deep learning researches and techniques associated with COVID-19. However, there are still remaining issues to be addressed: 1) Some typical spatial patterns In this paper, we propose a novel COVID-19 diagnosis approach based on CNNs with spatial pattern prior knowledge and representation enhancement mechanism to distinguish COVID-19 in the complex scenario of multi-type pneumonia diagnosis. Based on the spatial patterns of COVID-19, we design 45 an unsupervised Periphery-aware Spatial Prediction (PSP) to endow our pretrained network with spatial location awareness. We then construct an adaptive joint learning framework that employs Contrastive Representation Enhancement (CRE) as an auxiliary task for a more discriminative and accurate COVID-19 diagnosis. In summary, the main contributions of this paper that significantly 50 differ from other earlier works are the following three aspects: 1) We propose a novel diagnosis approach to efficiently diagnose COVID-19 from H1N1, CAP, and healthy cases. To the best of our knowledge, our approach first exploits the spatial pattern prior and representation enhancement on contrastive learning for further improving the diagnostic performance. 2) We integrate spatial pat-55 terns by introducing a Periphery-aware Spatial Prediction (PSP) to pre-train the network without any human annotations. We also construct an adaptive Contrastive Representation Enhancement (CRE) mechanism by discriminating pneumonia-type guided positive and negative pairs in the latent space. 3) Extensive experiments and analyses are conducted on our constructed large-60 scale dataset and two public datasets. The quantitative and qualitative results demonstrate the superiority of our proposed model on COVID-19 diagnosis with both 3D and 2D CT data types. The remainder of the paper is organized as follows. Section 2 reviews some related works. In Section 3 we describe in detail the periphery-aware COVID-19 65 diagnosis framework with contrastive representation enhancement mechanism. Section 4 discusses the experimental settings, results, comparisons and analyses on our constructed CT dataset and two public datasets. Finally, conclusions and future work are given in Section 5. Although these attempts have demonstrated their validity in COVID-19 screening, some drawbacks remain in clinical application. The spatial pattern, as an important basis for clinical diagnosis, has not been effectively integrated into the deep network. Additionally, the intra-class similarity and inter-class difference between various pneumonia need to be further explored. Contrastive learning, as the name implies, aims to learn data representations by contrasting positive and negative samples. It is at the core of recent works on self-supervised learning, generally using a contrastive loss called the InfoNCE loss. Contrastive Predictive Coding (CPC) [19] was a pioneering work that first 120 put forward the concept of the InfoNCE loss. It learned representations by predicting "future" information in a sequence. Moreover, instance-level discrimination has become a popular paradigm in recent contrastive learning works. It aimed to learn representations by imposing transformation invariances in the latent space. Two transformed images from the same image are regarded as 125 the positives, which should be closer together than those from different images in the representation space. Several recent works, such as SimCLR [20] and MoCo [21], have followed this pipeline and achieved great empirical successes in self-supervised representation learning. There are also some researches that have attempted to explore the poten- al. [24] adopted a prototypical network for few-shot COVID-19 diagnosis, which is pre-trained by the momentum contrastive learning method [21]. Li et al. [25] presented a contrastive multi-task CNN which can improve the generalization on unseen CT or X-ray samples for COVID-19 diagnosis. In our work, we fully exploit contrastive learning to obtain more discriminative representations between 140 COVID-19 and other types of pneumonia, i.e. H1N1 and CAP, in chest CT images. We further introduce the pneumonia-type information into contrastive learning for better exploration of intra-class similarity and inter-class difference. Particularly, we adopt it as an auxiliary learning task that can effectively improve the classification performance of COVID-19 from other pneumonia. 145 To make a more accurate diagnosis of COVID-19 from other pneumonia, we particularly construct a periphery-aware deep learning network with contrastive representation enhancement (CRE), as illustrated in Figure 2 . It takes a CT sample as the input and outputs the pneumonia classification result in an end-150 to-end manner. We first perform the Periphery-aware Spatial Prediction (PSP) to obtain the pre-trained encoder network. Then the CRE, as an auxiliary task, is adaptively combined to the pneumonia classification for more discriminative representation learning. We utilize ResNet [26] as the backbone architecture since it has proven effective in the previous works. contrastive learning manner, discriminating the positive (orange arrows) and negative (blue arrows) pairs after being mapped by the projection network into the dp-dimensional space. The enhanced representations can promote more precise diagnostic performance. To integrate the spatial patterns of infections into the pneumonia classification, we design an unsupervised Periphery-aware Spatial Prediction (PSP) to pre-train the encoder network without any human annotations. Given a CT image X and its automatically generated boundary distance map D, we train a 160 neural network to solve the boundary distance prediction problem D = F(X). We would like to obtain a feature encoder that is endowed with spatial location awareness by learning the PSP. Inspired by the radiological signs that COVID-19 infections mostly appear in the subpleural area, the boundary distance map is created to represent the location information about whether a pixel belongs to the interior of the lung region as well as the distance to the region boundary. As shown in Figure 3 , given a CT image, we first generate the segmented lung region mask by our segmentation algorithm without manual labeling, detailed in Section 4 (Preprocessing). Then, let Q denote the set of pixels on the region boundary, and C lung is the set of pixels inside the lung region. For every pixel p in the mask, we compute the truncated boundary distance D(p) as: where d(p, q) is the relative Euclidean distance between the pixels p and q. It is divided by the segmented lung volume and then normalized to the interval [0, 1]. R denotes the truncation threshold, which is the largest distance to represent. The distance is additionally weighted by the sign function α p to represent whether the pixel lies inside or outside the lung region. This prediction task can be trained with L 2 regression loss, as shown in Eq. (2). Another feasible solution is to translate the dense map prediction problem to a set of pixel-wise binary classification tasks. We quantize the values in the pixel-wise map into K uniform bins. To be specific, we encode the truncated distance of the pixel p using a K-dimensional binary vector b(p) as: where r k is the distance value corresponding to the k-th bin. The continuous pixel-wise map is now converted into a set of K binary pixel-wise maps by this one-hot encoding. A standard cross-entropy loss between the prediction and ground truth is used, as shown in Eq. (4). It is discovered that the classification loss is more effective for the boundary distance prediction task [27] . In our experiments, we use R = 0.6 and K = 6. We 165 employ the UNet-style prediction network with an encoder-decoder architecture. We adopt ResNet as the encoder and construct the decoder as a mirrored version of the encoder by replacing the pooling layers with bilinear upsampling layers. A typical predicted result is shown in Figure 3 . After solving the PSP problem, we discard the decoder and take the pre-trained periphery-aware encoder for 170 the downstream pneumonia classification. To learn more discriminative representations of different pneumonia, we develop an adaptive Contrastive Representation Enhancement (CRE) as an auxiliary task for a precise COVID-19 diagnosis. Our novel network architecture 175 with CRE is comprised of the following components. • A data augmentation function, A(·), which transforms an input CT sample x into a randomly augmented samplex. We generate m randomly augmented volumes from each input CT sample. To be concrete, we sequentially apply three augmentations for CT samples: 1) random cropping Then, the representations are normalized to the unit hypersphere. • A projection network, P (·), which is used to map the representation vector r to a relative low-dimension vector z = P (r) ∈ R dp for the contrastive loss computation. A multi-layer perception (MLP) can be employed as the 195 projection network. This vector is also normalized to the unit hypersphere, which enables the inner product to measure distances in the projection space. This network is only used for training the contrastive loss. • A classifier network, C(·), which classifies the representation vector r ∈ R de to the pneumonia predictionŷ. It is composed of the fully connected 200 layer and the Softmax operation. Given a minibatch of N randomly sampled CT images and their pneumoniatype labels {(x k , y k )} k=1,...,N , we can generate a minibatch of 2N samples ..,N after performing data augmentations, wherẽ x 2k−1 andx 2k are two random augmented CT samples of x k , andỹ 2k−1 =ỹ 2k = InfoNCE loss function [28] is defined as: where z = P (E(x)) stands for the representation vector ofx through the encoder E(·) and projection network P (·), 1 ∈ {0, 1} is an indicator function, and τ > 0 denotes a scalar temperature hyper-parameter. Samples z j which have the same label as sample z i (i.e.,ỹ i =ỹ j ) are the positives. Nỹ i is the total number of 210 samples in a minibatch that have the same labelỹ i . Besides, the inner product is used to measure the similarity between the normalized vectors z i and z j in the d p -dimensional space. Within the context of Eq. (6), the encoder is trained to maximize the similarity between positive samples z i and z j from the same classỹ i , while simultaneously minimizing the similarity between negative pairs. 215 We compute the probability using a Softmax, and Eq. (5) sums the loss over all pairs of indices (i, j) and (j, i). As a result, the encoder learns to discriminate positive and negative samples for enhancing the intra-class similarity and interclass difference. Finally, we learn the classifier C to predict the pneumonia classification resultsŷ using the standard cross-entropy loss L ce , which is defined as: whereỹ i denotes the one-hot vector of ground truth label, andŷ i is predicted 220 probability of the sample x i (i = 1, . . . , 2N ). Different from the previous works [20, 21, 28] which separate the contrastive learning and classification into two stages, we train the network by both loss functions, i.e., L con for the representation enhancement and L ce for the pneumonia classification. To greatly improve the pneumonia classification accuracy by the enhanced representations from CRE, we particularly design a combined objective function with adaptive weights [29] to balance the two loss functions for better joint learning performance: where σ 1 and σ 2 are utilized to learn the relative weights of the losses L con and L ce , adaptively. We derive our joint loss function based on maximizing the Gaussian likelihood with homoscedastic uncertainty. The detailed derivation 225 is explained in the Appendix section. Algorithm 1 summarizes our proposed diagnosis approach. Temperature τ , Adaptive loss weights σ 1 , σ 2 . Lce + log σ 1 + log σ 2 #adaptive joint loss 17: Update networks E(·), P (·), C(·), and weights σ 1 , σ 2 to minimize L 18: end for 19: return Encoder E(·), classifier C(·), and weights σ 1 , σ 2 ; throw away projection P (·) public COVID-19 CT datasets, namely Covid-CT dataset [30] and CC-CCII dataset [13] . In the following sections, we introduce our constructed 3D and 2D CT datasets and the public datasets in detail. Besides, the CAP and healthy cases were randomly selected between Jan 3rd, 285 2019 and Jan 30th, 2020, and the CAP cases were confirmed positive by bacterial culture. To ensure the diversity of our dataset, we retained an at least three days gap between infected CT samples if they were taken from the same patient. 2D CT Dataset. In addition to the 3D volume-level labels obtained by 290 clinical diagnosis reports, our constructed dataset particularly provides the annotations on 2D slice-level images. A slice image from three types of pneumonia is annotated with the corresponding CT volume's category if it is determined to have infected lesions; otherwise, it is randomly selected as "Healthy". The slices taken out of every three slices from healthy controls are annotated as "Healthy". In this way, we obtain a total of 88,734 CT slices of the four classes, which is a substantial amount of CT data compared with all the existing public COVID-19 CT datasets. Moreover, the slice-level labels are annotated by five professional experts and supervised by two radiologists with clinical experience of more than five years. These valuable annotations of our dataset make it versatile for both 300 3D and 2D diagnosis research and development. The Covid-CT dataset [30] is comprised of 349 CT scans for COVID-19 and 397 CT scans that are normal or contain other types of pneumonia. The We adopt several metrics to measure the diagnostic performance of models 320 thoroughly. To be concrete, we report the Sensitivity, Specificity, and Area Under Curve (AUC) for each class, as well as Accuracy, Macro-average AUC, and F1 score for overall comparison. Sensitivity and Specificity measure the proportion of positives and negatives that are correctly identified, respectively. Accuracy is the percentage of correct predictions among the total number of 325 samples. F1 score is the harmonic mean of the precision and recall. We further present statistical analysis based on the independent two-sample t-test. Besides, we visualize the Gradient-weighted Class Activation Mapping (Grad-CAM) [31] , which identifies the infected regions that are most relevant to the predictions. In this work, raw data is converted into Hounsfield Unit (HU), which is a standard quantitative scale for describing radiodensity. As Figure 4 illustrates, our proposed preprocessing procedure is as follows. For our 3D dataset, we use 3-fold cross validation. For our 2D dataset, we randomly divide it into training and testing sets with an approximate ratio of 2:1, comprising 59,413 and 29,321 slice images, respectively. In the training process, we adopt the random oversampling strategy [33] to mitigate the imbalance 370 of training samples across different classes. The 3D ResNet18 and 2D ResNet50 models are adopted as backbone architectures for 3D and 2D datasets, respectively. The parameter d e is 512 for 3D ResNet18 and 2,048 for 2D ResNet50. inputs a CT volume with its lung mask and outputs the diagnosis probabilities. We retrain their approaches on our 3D dataset, and the related comparison results are presented in Table 3 . As can be apparently seen, our method "Ours (Res18+PSP+CRE)" significantly outperforms these existing methods (p<0.001) on the 3D CT dataset. We evaluate the impact of the proposed PSP by comparing our model "Ours We investigate the effectiveness of the proposed CRE by comparing our model "Ours (Res18+CRE)" with three baselines on our 3D dataset. We discard the CRE and train two additional models ("Res18 (m=1, d=64)" and "Res18 (m=1, d=128)") to predict pneumonia classification as the baselines. To elimi-425 nate the specificity of data augmentation which is designed for the contrastive learning framework, the baselines are applied by three different augmentation settings (i.e., (m=2, d=64), (m=1, d=64), and (m=1, d=128)). It is observed from the 6th to 8th rows of Table 3 As illustrated in Eq. (9), we design a joint training loss with adaptive weights, σ 1 and σ 2 , to combine the contrastive loss and cross-entropy loss. During the training process, the two learnable weights are also optimized along with the network's parameters. Taking the average value of 3-fold cross validation results, we get the learned weights 1/σ 2 1 = 0.51 and 1/σ 2 2 = 1.33. We normalize them to the interval [0, 1] and obtain λ * 1 = 0.28 and λ * 2 = 0.72. To analyze the validity of the learned weights, we use the grid search to compare the performance of different set weights. Let the joint training loss be balanced by fixed weights λ 1 and λ 2 (λ 1 + λ 2 = 1), it can be expressed as: We implement three tests with the configurations of λ 1 = 0.05, 0.50, 0.75, 440 respectively. Note that when λ 1 = 0, our whole framework degenerates to the baseline that is trained with only the cross-entropy loss. The results of different weights in the loss function are illustrated in the 9th to 11th rows of Table 3 . "Ours (Res18+CRE)" with the adaptive learned weights (λ * 1 = 0.28 and λ * 2 = 0.72) consistently outperforms all the three methods by 2%∼3% improvements 445 on Acc and F1 score. It also achieves a comparable AUC score with marginal differences (0.05%∼0.28%). The results indicate that it is effective to learn the weights adaptively for balancing the contrastive loss and cross-entropy loss in our diagnosis approach. To show the impact of backbone models, we run experiments with 3D ResNet10, ResNet18, as well as ResNet34 [26] , and the results are shown in the last two rows of Table 3 . It can be seen that 3D ResNet10 and ResNet34 achieve inferior overall performance than ResNet18. Besides, both ResNet10 and ResNet34 get high sensitivity but extremely low specificity in COVID-19 455 diagnosis, which indicates that they tend to make more false positive predictions of COVID-19. In contrast, our ResNet18 model achieves a better comprehensive performance on AUC for COVID-19. Finally, we adopt the best-performing 3D ResNet18 model as the backbone architecture. which demonstrates the reliability and interpretability of the diagnosis results from our model. Therefore, the attention maps can be possibly used as the basis to derive the COVID-19 diagnosis in clinical practice. We conduct low-data classification experiments to investigate the impact of our PSP as an unsupervised pre-training scheme. We start by evaluating the performance of a supervised network, which is trained from scratch. Concretely, 480 we train a separate network on each subset with various proportions of labeled data from 0.1% to 100%, and then evaluate each model's performance on our entire testing set. The detailed numbers of 2D training images corresponding to each proportion are presented in Table 4 . As can be seen in Figure 6 (blue curve), the model trained from scratch tends to overfit more severely with decreasing 485 amounts of data. When the proportion of labeled data decreases from 100% to 0.1%, the performance drops from 92.47% to 58.24% for accuracy. We next evaluate our pre-trained periphery-aware model on the same data proportions. Here, we pre-train the feature encoder on our entire unlabeled training set, and then learn the classifier and fine-tune the encoder using a subset We compare our approach with the ImageNet pre-trained model, which is a commonly used model pre-trained by a large-scale nature image dataset [34] . Figure 6 shows that our PSP pre-trained model even surpasses the ImageNet 500 pre-trained model (green line) in the regimes with 0.1%, 0.2%, 10%, 20% proportions of data, and achieves a comparable performance with 1% and 5% data. These results demonstrate that our unsupervised PSP pre-training method effectively introduces the important spatial pattern prior, which largely improves the diagnosis performance in most regimes, especially in the low-data regime. To further improve the diagnostic performance in the low-data regime, we train the PSP task on the ImageNet pre-trained model and then fine-tune the "ImageNet+PSP" pre-trained model. As can be seen from the yellow curve in Figure 6 , it significantly boosts the diagnosis accuracy compared with other models when the number of images decreases. It even represents a 16% im-510 provement over the model trained from scratch in 0.1% data regime. The results suggest that the diagnostic performance can be further improved by the combination of ImageNet and PSP pre-training. Since it is difficult to acquire a large number of annotated medical images 515 in many real-world scenarios, we further show the effect of our PSP and CRE on 1% CT samples of our 2D dataset, as illustrated in Table 5 . Compared with the ResNet50 trained from scratch ("Res50"), the PSP pre-trained model "Res50+PSP" clearly improves the performance on both COVID-19 diagnosis and overall classification (p<0.001). The model "Res50+Mask" indicates that we 520 degenerate our PSP pre-training task to the general lung segmentation task, in which the network trained to predict lung region masks obtained by our unsupervised segmentation algorithm. It can be observed that the network pre-trained by our proposed PSP task ("Res50+PSP") achieves better overall performance than the lung segmentation task ("Res50+Mask") with 2% improvements on 525 both Acc and F1 metrics. In COVID-19 diagnosis, "Res50+PSP" even improves the sensitivity and the AUC score by over 11% and 1.5%. The results confirm the significance of PSP with the boundary distance prediction and indicate the effectiveness of PSP in multi-type pneumonia classification and COVID-19 diagnosis. Then, we evaluate the importance of our CRE. For each initialization, (i.e., random initialization, PSP, ImageNet, ImageNet+PSP), incorporating our CRE can improve the performance on all the overall metrics, as shown in the last four rows of Table 5 . Our method "Res50+ImageNet+PSP+CRE" reaches the best [22, 23, 9, 14] . 550 Notably, in the Covid-CT dataset, COVID-19 is defined as negative (label = 0) and non-COVID-19 is positive (label = 1). It is observed from Table 6 that our method "Ours (Res50+ImageNet+CRE (λ * 1 =0.38))" clearly outperforms all We also conduct ablation study on the Covid-CT dataset, as shown in Table 7 . With random initialization, our model "Res50+CRE (λ * 1 =0.37)" significantly outperforms the baseline model "Res50 (m=1)" by 2% Acc, 11% AUC, and 2% F1 score. When pre-trained on Imagenet, our model "Res50+ImageNet+CRE (λ * 1 =0.38)" also achieves the improvements of 2% 3% on the three metrics, 565 compared with the baselines "Res50+ImageNet (m=1)" and "Res50+ImageNet (m=2)". We can also observe that our designed joint training strategy can learn better adaptive loss weights (λ * 1 =0.38), which outperforms the models with other fixed weights (λ 1 =0.25,0.50,0.75). As for the network architecture, the ResNet50 is verified to be a preferable backbone architecture, compared with 570 To train our model "Res50+ImageNet+PSP+CRE" on the CC-CCII dataset, we first use 2D ImageNet pre-trained ResNet50 to perform 2D PSP task with 575 a small number of labeled lung masks (750 masks). Then, we initialize the 3D ResNet50 model with the inflated 2D weights pre-trained on PSP [35] . Finally, we train the 3D model with CRE on the entire dataset. Table 8 shows the results of our method and other existing approaches on the CC-CCII dataset. "Binary cls" denotes the classification between non-COVID-580 19 and COVID-19; and "Ternary cls" is the classification among normal controls, COVID-19, and common pneumonia (CP). For "Binary cls", our model achieves 98.18% Acc, 97.75% Sen., 98.44% Spec., and 99.29% AUC, which surpasses the other approaches by a large margin. For "Ternary cls", our model obtains the improvements of 4.24% on Acc and 1.37% on AUC compared with Zhang 585 et al.'s in [13] . Both binary classification task and ternary classification task demonstrate the superiority of our method on the CC-CCII dataset. In this work, we propose a periphery-aware COVID-19 diagnosis approach with contrastive representation enhancement, which detects COVID-19 in com-590 plex scenarios of multi-type pneumonia using chest CT images. Particularly, we design an unsupervised Periphery-aware Spatial Prediction (PSP) pre-training task, which can effectively introduce the important spatial pattern prior to networks. It is confirmed that the model pre-trained by our proposed PSP task even outperforms the fully-supervised ImageNet pre-trained model in the low-595 data regime. The design of the unsupervised PSP pre-training task provides a new paradigm to inject location information into neural networks. To further obtain more discriminative representations, we build a joint learning framework that integrates Contrastive Representation Enhancement (CRE) as an adaptive auxiliary task to pneumonia classification. Our CRE extends the general 600 contrastive learning, which can capture the intra-class similarity and inter-class difference for a more precise multi-type pneumonia diagnosis. We also construct a large-scale COVID-19 dataset with four categories for the fine-grained differential diagnosis. The effectiveness and clinical interpretation of our proposed approach have been verified on our in-house 3D and 2D CT datasets as well 605 as the public Covid-CT and CC-CCII datasets. In the future, we will further explore the potential of our proposed CRE and PSP in multiple lesion diagnosis tasks, including lesion detection and segmentation. Contrastive learning assists the network to learn lesion-specific features in images or volumes, making the network a stronger feature extractor. As our proposed PSP embeds location 610 prior about COVID-19, it can also be extended to location-sensitive tasks such as lesion localization. where L(θ) = − log Softmax(y, f θ (x)) is the cross-entropy loss of y. Here, we 615 derive the log likelihood of InfoNCE loss in detail and obtain our joint loss with adaptive weights. The InfoNCE loss can be formulated as the instance-level classification objective using the Softmax criterion. Given n samples and their features, for a sample x with feature z = f θ (x), the probability of it being recognized as i-th example is: We adapt the likelihood to squash a scaled version of the model output with a positive scalar σ: The log likelihood can then be written as: In maximum likelihood inference, we maximize the log likelihood of the network, which equals to minimize: − log P (y = i|f θ (x), σ) = − 1 σ 2 (f θ (x i ) T · f θ (x)/τ ) + log j exp( 1 σ 2 (f θ (x j ) T · f θ (x)/τ )) ≈ 1 σ 2 L(θ) + log σ, where L(θ) = − log Softmax(f θ (x i ) T · f θ (x)/τ ) is written for the contrastive InfoNCE loss. The second term is approximately equal to log σ when σ → 1. In the case of multiple network outputs, the likelihood is defined to factorize over the outputs. We define f θ (x) as the sufficient statistics and obtain the following likelihood: p(y 1 , y 2 |f θ (x)) = p(y 1 |f θ (x)) · p(y 2 |f θ (x)), with the model outputs y 1 , y 2 stand for classification and contrastive learning, 32 respectively. Therefore, our joint loss L(θ, σ 1 , σ 2 ) is given as: L(θ, σ 1 , σ 2 ) = − log p(y 1 = c, y 2 = i|f θ (x)) = − log p(y 1 = c|f θ (x), σ 1 ) − log p(y 2 = i|f θ (x), σ 2 ) ≈ 1 σ 2 1 L con (θ) + 1 σ 2 2 L ce (θ) + log σ 1 + log σ 2 . Covid-19 weekly epidemiological update Frequency and distribution of chest radiographic findings in covid-19 positive patients A weakly-supervised framework for covid-19 classification and lesion localization from chest ct Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct Deep learning-640 based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography Ai-assisted ct imaging analysis for covid-19 screening: building and deploying a medical ai system in four weeks Automatically discriminating and localizing covid-19 from community-acquired pneumonia on chest x-rays Metacovid: A siamese neural network framework with contrastive loss for n-shot diagnosis of covid-19 patients A deep learning algorithm using ct images to screen for corona virus disease (covid-19) Deep learn-655 ing enables accurate diagnosis of novel coronavirus (covid-19) with ct images Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid-19 Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of 665 covid-19 pneumonia using computed tomography Covid-19 classification with deep neural network and belief functions Dual-sampling attention network for diagnosis of covid-19 from community acquired pneumonia Prior-attention residual learning for more discriminative covid-19 screening in ct images A deep learning system 675 to screen novel coronavirus disease 2019 pneumonia M 3 lung-sys: A deep learning system for multi-class lung pneumonia screening from ct imaging Representation learning with contrastive predictive coding A simple framework for contrastive learning of visual representations Momentum contrast for unsupervised visual representation learning Sample-efficient deep learning for covid-19 diagnosis based on ct scans Contrastive cross-site learning with redesigned net for covid-19 ct classification Momentum contrastive learning for few-shot covid-19 diagnosis from chest ct images Multi-task contrastive learning for automatic ct and x-ray diagnosis of covid-19 Deep residual learning for image recognition Boundary-aware instance segmentation Annual Conference on Neural Information Processing Systems 2020 Multi-task learning using uncertainty to weigh losses for scene geometry and semantics Covid-ct-dataset: a ct scan dataset about covid-19 Gradcam: Visual explanations from deep networks via gradient-based localization Evaluate the malignancy of pulmonary nodules using the 3-d deep leaky noisy-or network Imbalanced learning: foundations, algorithms, and applications 2009 IEEE Conference on Computer Vision and Pattern Recognition Quo vadis, action recognition? a new model and the 725 kinetics dataset Our joint loss is comprised of the InfoNCE loss in the contrastive representation enhancement and the cross-entropy loss in the pneumonia classification task. From [29] , the log likelihood of the classification task can be written as: ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: