key: cord-0809930-4ixwd6ii authors: Li, Jinpeng; Zhao, Gangming; Tao, Yaling; Zhai, Penghua; Chen, Hao; He, Huiguang; Cai, Ting title: Multi-task Contrastive Learning for Automatic CT and X-ray Diagnosis of COVID-19 date: 2021-01-26 journal: Pattern Recognit DOI: 10.1016/j.patcog.2021.107848 sha: 1e010fdf6ff91ddba831545eae68ee20ca3f6d91 doc_id: 809930 cord_uid: 4ixwd6ii Computed tomography (CT) and X-ray are effective methods for diagnosing COVID-19. Although several studies have demonstrated the potential of deep learning in the automatic diagnosis of COVID-19 using CT and X-ray, the generalization on unseen samples needs to be improved. To tackle this problem, we present the contrastive multi-task convolutional neural network (CMT-CNN), which is composed of two tasks. The main task is to diagnose COVID-19 from other pneumonia and normal control. The auxiliary task is to encourage local aggregation though a contrastive loss: first, each image is transformed by a series of augmentations (Poisson noise, rotation, etc.). Then, the model is optimized to embed representations of a same image similar while different images dissimilar in a latent space. In this way, CMT-CNN is capable of making transformation-invariant predictions and the spread-out properties of data are preserved. We demonstrate that the apparently simple auxiliary task provides powerful supervisions to enhance generalization. We conduct experiments on a CT dataset (4,758 samples) and an X-ray dataset (5,821 samples) assembled by open datasets and data collected in our hospital. Experimental results demonstrate that contrastive learning (as plugin module) brings solid accuracy improvement for deep learning models on both CT (5.49%-6.45%) and X-ray (0.96%-2.42%) without requiring additional annotations. Our codes are accessible online. In December 2019, the coronavirus disease broke out in Wuhan, China. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was identified as the cause of COVID-19 and spread rapidly to other locations of China and the world. By December 1, 2020, there were more than 63,492,196 confirmed cases with 1,469,003 deaths worldwide. The World Health Organization (WHO) has declared COVID-19 as a global pandemic. The prognosis of COVID-19 is poor. According to several studies, over 60% of patients died when the disease have developed to the severe/critical stage [1, 2] . The main causes of death include massive alveolar damage and progressive respiratory failure [3] . Moreover, the respiratory viruses SARS-CoV-2 spread amongst humans do transmit even in the absence of symptoms. Therefore, fast and accurate screening and diagnosis of COVID-19 is of great significance for planning early interventions, blocking the transmission path and formulating clinical schemes to improve the prognosis [4] . There are two main diagnostic methods for COVID-19. The first is nucleic detection implemented with real-time polymerase chain reaction (RT-PCR) test. RT-PCR has been widely used in clinical diagnosis, discharge assessment and recovery follow-up. However, the sensitivity of RT-PCR is low from swab samples [5] , which may result in substantial false negative predictions [6] . The second approach to detect COVID-19 is chest medical imaging. Clinical studies have suggested that certain manifestations on computed tomography (CT) such as multiple small patches and ground glass shadow are associated with COVID-19 [7] . From the perspective of pathology, CT can provide detailed information to facilitate a quantitative assessment for pulmonary changes, which may have prognostic implications [8] . Despite its high sensitivity (97%) [5] , CT is not suitable for large-scale screening due to the relatively high cost. Besides, CT has high radiation dose, which is harmful to human body [7] . Therefore, CT can be used for accurate clinical diagnosis, whereas not recommended in clinical applications where repetitive data-acquisition is required (e.g., recovery assessment and follow-up analysis). X-ray is another medical imaging to detect COVID-19. Considering that X-ray cannot provide 3D information like CT, radiologists generally use X-ray as a preliminary screening before CT diagnosis. However, X-ray has certain advantages. The low radiation dose brings less damage to human body. In addition, the relatively low cost makes it suitable for less developed countries and regions as an important method for COVID-19 diagnosis. At present, most researches have focus on CT diagnosis [5, 7, 8] , whereas X-ray diagnosis has been less investigated [3] . Benefited from the strong representational learning ability of deep learning, artificial intelligence (AI) has demonstrated impressive capability in the automatic diagnosis of COVID-19 based on both CT [9] [10] [11] [12] and X-ray [13, 14] . AI has four advantages: (1) Diagnose quickly especially when the medical system is overloaded. (2) Reduce the burden of radiologists. (3) Assist undeveloped areas to realize accurate diagnosis. (4) Most importantly, as a new pandemic, current understandings on the sensitive and specific manifestations of COVID-19 lack systematic consensus. AI can automatically learn discriminative features in a data-driven manner, especially to distinguish COVID-19 from other pneumonia [4, 10, 13] . Despite the success of AI in both CT and X-ray diagnosis of COVID-19, the generalization of these models needs to be improved. This paper proposes CMT-CNN, a novel multi-task framework for the improved generalization on unseen data. Different from typical multi-task models, CMT-CNN requires no additional supervisions from related tasks, but seeks for annotation-free performance improvement based on the self-supervised learning [15] [16] [17] . Our work is quite different from that of Doersch and Zisserman [18] , which explored multiple self-supervised tasks in a multi-task framework, whereas no main task was explicitly appointed. Furthermore, different from recent advances in self-supervised learning who have focused on generating well-aggregated embedding [19] [20] [21] , CMT-CNN pays more attention on fulfilling certain tasks. The main contributions of this paper are: (1) A novel CMT-CNN model that brings solid improvement in generalization without additional annotations. While completing specific tasks, the model seeks to evolve into an embedding function with fine spatial aggregating properties. (2) We present a series of effective augmentation methods for CMT-CNN based on distortion, painting and perspective transformations. They are not used as a data-preprocessing trick, but to enhance the representational learning at the model level. More importantly, these methods are highly related to the characteristics of CT/X-ray images, and thereby having good interpretability. (3) Experimental results on a large-scale CT dataset and an X-ray dataset both demonstrate that CMT-CNN has significant advantage over CNN for diagnosing COVID-19 from other pneumonia and normal controls. We first review some representative works using AI to diagnosis COVID-19 (both CT and X-ray). Then, from the perspective of methodology, we present a literature review on some highly relevant concepts or works to our method. Several studies have reported various results concerning deep learning in CT-based diagnosis. Li et al. [9] applied ResNet-50 as backbone and added a fully-connected [10] proposed a dual-sampling attention network to diagnose COVID-19 from other pneumonia, where segmented lesions are used to refine the attention localization. Wang et al. [11] proposed a weakly-supervised framework to improve the lesion localization for diagnosing COVID-19 from other pneumonia. Kang et al. [12] incorporated multiple radiomic features and handcrafted features into a multi-view learning framework to separate COVID-19 from other pneumonia. The model used the complementary information from multiple types of features and achieved an accuracy of 94% using 2,522 CT samples. Others have discussed about the feasibility of CT in screening and early diagnosis [7] , the multi-modal diagnosis [23] and the correlation of CT and RT-PCR results [5, 6] . Compared with CT, X-ray has been less studied in the automatic diagnosis of this pandemic. Cohen et al. [24] [25] proposed a patch-based CNN for COVID-19 diagnosis, where decisions are made based on the majority voting from multiple patches at random locations within lungs on X-rays. The patch-based methods allow for a relatively small number of training samples and trainable parameters. The medical imaging datasets used in the above researches are much smaller than those in computer vision tasks (e.g., ImageNet). To enhance the generalization ability of the diagnostic models, we integrate contrastive learning into CNN, and hereby proposing a multi-task learning framework: CMT-CNN. In the following, we review some concepts or works that are highly relevant to our method. domain-specific information contained in the training signals of related tasks [26, 27] . MTL typically involves a main task and an auxiliary task. The auxiliary task enables the model to learn representations that are shared or helpful for the main task. Substantial studies have presented both theoretical [28] and empirical [29, 30] evidences that machine learning models would generalize better by sharing representations between related tasks. The labeled data for an auxiliary task are available in a typical MTL scenario. When labeled data are unavailable, pseudo-labels can be defined and used. For example, Ganin and Lempitsky [31] used domain label as the pseudo-label and applied an adversarial training scheme to reduce the representational differences between domains. Finding an effective and pervasive approach to improve the model generalizations in the absence of explicitly-annotated auxiliary tasks is a challenging and interesting topic. We think the self-supervised learning which develops rapidly in the near past is a feasible solution. Self-supervised Learning (SSL). SSL designs pretext tasks to synthesize pseudo labels and then formulates it as a prediction task to learn the representations [15] . For example, Gidaris et al. [17] proposed to augment 2D images by applying rotations, and then learn image features to recognize the rotations. They demonstrated that this task provided a powerful supervisory signal for semantic feature learning and significantly closed the gap between supervised learning and unsupervised learning. Doersch et al. [32] proposed to learn representations by predicting context information of local patches. Noroozi and Favaro [33] approached SSL by predicting the position of randomly rearranged local patches of images. Pathak et al. [34] used inpainting to learn representations. Chen et al. [35] proposed a SSL strategy based on context restoration to exploit unlabeled medical images. They conducted classification on 2D ultrasound images, localization on CT images and segmentation on magnetic resonance (MR) images. Experimental results showed the semantic features learned by context restoration improved the performance of machine learning models on these tasks. From the perspective of whether manual annotations are needed, SSL belongs to the unsupervised learning. In terms of the learning principle, SSL belongs to the discriminative learning. Discriminative approaches learn representations through objective functions associated with pretext tasks, and the models are optimized in a supervised manner. SSL approaches rely on heuristics to design pretext tasks. In particular, we regard Instance Recognition (IR) as one type of SSL, whose pretext task is to identify each instance. The model can be parametric or non-parametric. The Exemplar CNN [36] is a parametric example. The instance discrimination method [37] is a non-parametric example. Contrastive Learning. This approach learns representations by contrasting sample pairs. IR is typically implemented using contrastive learning. The Exemplar CNN [36] represented each instance as a vector and trained a network to recognize each instance. Sampling is an important issue in contrastive learning. Wu et al. [37] constructed a memory bank to store instance vectors, and several works have Recent studies have proved that SSL can improve the quality of few-shot learning [39] , transfer learning [40] and semi-supervised learning [37] . However, the value of SSL in the MTL framework remains undiscovered. We argue that SSL is able to provide annotation-free supervisions to improve the generalization performance of supervised learning in a MTL scheme. MTL, in turn, provides clear task-specific orientations for SSL. The CMT-CNN proposed in this paper integrates SSL as a plugin module into MT-CNN, which can improve the diagnostic accuracy of COVID-19 when the data amount is not very large. Our goal is to learn a parametric model ( ) from a set of labeled images where denotes the sample amount of the dataset. We hope that the representations generated by ( ) are able to (1) facilitate the high-accuracy classification model denoted as ( ), and (2) exhibit good embedding properties. The former can be achieved by supervised training to yield inductive bias. The latter improves the quality of the representations to enhance the generalization on unseen samples, especially when the data amount is small. Recent works have shown evidences that high-quality embedding can achieve good classification performance with limited labeled data [16, 41] , or even in the absence of labeled data [16, 38] . We therefore propose the CMT-CNN, a MTL solution to achieve these goals simultaneously. In the following, we will introduce the flowchart of CMT-CNN in an intuitive manner, formulate the objective function, and then explain the optimization procedure. We conduct CT experiment using 3D volumes, which involve substantial pixels (e.g., ). To avoid overfitting, they are uniformly sampled as volumes to feed the CMT-CNN. The X-ray images are originally stored as 3-channel images. We resize every X-ray image to as a greyscale image. For every 3D CT volume and every 2D X-ray image, we normalize the pixels to [ ] using min-max scaling, and then every pixel is subtracted by 0.5. Therefore, the pixel values are normalized to [ ]. The flowchart of the proposed framework is summarized in Figure diagnosis. Note that the test images do not need to be transformed, and the original images are directly fed into the model. The transformations used in relevant literatures are quite different. Giving full consideration to the characteristics of CT/X-ray diagnosis of pneumonia, we define three categories of transformations, and new transformations can be easily added to our framework. Figure 2 visualizes the applied transformations. We randomly pick out a COVID-19 X-ray for illustration. (1) Distortion methods. These methods change the value of pixels on the image. Because the pixel values at the micro level have changed, the model can pay more attentions to macro features such as shape and texture. We believe that this characteristic is beneficial to the diagnosis of COVID-19. We adopt three distortion methods. The first is Hounsfield transformation, which was recently proposed by Zhou et al. [42] to provide strong pixel-wise supervision. Restoring images distorted by the Hounsfield transformation can focus the model on learning lesion appearance in terms of shape and intensity distribution. We use the Bézier curve, a smooth and monotonous function to assign every pixel a value. The second is local pixel shuffling. For a given image, this method randomly samples some windows, and then shuffles the order of the pixels within the window. Apparently, the pixel values have not changed, but the relative positions between the pixels have changed. Zhou et al. [42] have demonstrated that to recover from local pixel shuffling, the model must memorize the global geometry, lesion spatial layout, local boundaries and texture, which encourage the model to learn the shapes and boundaries of disease-related lesions as well as the relative layout of different parts of them. To avoid changing the global content of the image, the window size should be smaller than the receptive field of the model. The third is Poisson noise. We propose the Poisson noise as a new training scheme for SSL. CT and X-ray images are based on X-ray. Physicists and radiologists have pointed out that the quantum noise of the X-ray imaging obeys the Poisson distribution [43] rather than the widely-used Gaussian noise in computer vision tasks [16] . Different from the Hounsfield transformation, which computes the values of every pixel based on its context pixels, the Poisson noise applies random noise to CT and X-ray images to improve the robustness to random noise, aside from the same benefits brought by Hounsfield transformation. The first row of Figure 2 gives an example of how the distortion methods change an image. (2) Painting methods. These methods block part of the image randomly using random windows. As part of the image is lost, the model has to seek for complementary information in other pixels across the image. This characteristic will lead to context-awareness. Different from the context encoder [34] conducting in-painting at the central region of images, we adopt both in-painting and out-painting, which are suitable for SSL in medical image analysis. As for in-painting, the pixels inside the window are replaced by a constant, and the pixels outside the window are preserved. This characteristic allows the model to learn local continuities of lesions via interpolating [42] . As for out-painting, the pixels in the window are preserved, and the pixels outside the window are replaced by a constant. This characteristic compels the model to learn global geometry and lesion layout via extrapolating [42] . The in-painting and out-painting operations are complementary to each other. COVID-19 often manifests as multiple pathological changes on CT and X-ray images, and the painting operations allow the model to make context-aware predictions. We illustrate the effect of the painting methods at the second row of Figure 2 . (3) Perspective methods. These methods view the same image from different geometric perspectives, and then compel the model to recognize the semantic concept in them without difference. We adopt two perspective methods. The first is rotation, and the second is flip. Inspired by several recent studies in SSL [16, 17] , we rotate the image according to an angle set { }. Different form the rotation prediction [17] which decodes the rotation at the output of the model via a classification task, we focus on making the rotated images with exact semantic information share similar representations. We argue that in order a model to be able to generate rotation-invariant representations it will require to understand the semantic concepts of objects (lesions) in the image, including the location, the shape, the appearance, and in particular, the Contrastive learning is the auxiliary task of the CMT-CNN. Figure 3 shows its effect in the representational learning. Numerous studies have demonstrated that well-optimized embedding through representational learning can provide high-quality representations for downstream tasks [15, 16, 38] . Motivated by this, we introduce contrastive learning as the auxiliary task to enhance the embedding quality so as to improve the model generalization. Note that the feature extractor ( ) can be seen as an embedding function, which is trained based on self-supervisions. The self-supervisions we use are the instance labels, i.e., the network can recognize every instance in a transformation-invariant manner and the distance between paired instances is encouraged. Contrastive learning can be seen as an information retrieval task. Assuming that is fixed, we use as a query to search from a dictionary { }, where is the correct retrieval. Conversely, the correct retrieval for is . Now we present the contrastive learning scheme within a training batch consisting of randomly-sampled images from the training set. We apply two transformations to the training batch, and data points are thus obtained. Let { } be a positive pair, i.e., they are representations of two views of a same image. When is fixed, the loss function can be written as where is a temperature parameter controlling the concentration level of the sample distribution. Several studies have demonstrated that an appropriate can help learn from hard negatives [16, 21, 44] . The value of is generally small (e.g., 0.1). The indicator function ( ) is defined as The loss function in equation (1) have been used in several works [37] , and was termed the normalized temperature-scaled cross entropy (NT-Xent), which has shown significant advantage over NT-Logistic and Margin Triplet in contrastive learning [16] . Diagnosing COVID-19 is the main task of CMT-CNN, which is formulated as a binary or ternary classification task. The latent representation of an image sample is ( ( )) , and ̃ ( ) is the predicted label. Our goal is to make ̃ similar to , which is the true label of the sample. In a training batch, we use the Kullback-Leibler (K-L) divergence to measure the distance between the true label distribution ( ) and the predicted label distribution ( ̃): where the first term is the (negative) entropy of ( ), a constant in a training batch. The second term is the cross entropy loss, which we denote as We are seeking for parameters and for ( ) and ( ) according to We write the overall loss function of CMT-CNN as ( ) where [ ] balances the main task and the auxiliary task. In the optimization process, we use the following method to update the model parameters: where is the learning rate. Note that the update procedure shown in (9) is applied on training batches with instances. Large batchsize tends to smooth the training curves, and existing studies have demonstrated large batchsize is beneficial to contrastive learning such as SimCLR [16] . In our training scheme, is applied to optimize both ( ) and ( ), whereas is applied to optimize ( ) only. We regard the performance on the test data as the proxy to evaluate the generalization ability of CMT-CNN. For the binary classification, the performance is We evaluate the method on a CT dataset [4] and an X-ray dataset [13] for both the binary and ternary classifications. We use the CC-CCII 1 MT-CNN embraces any kinds of CNNs without constriction. To demonstrate the advantages of contrastive learning in improving the generalization ability of the model, we explore the following three CNNs, which are the most commonly-used classical models, or the latest and most powerful model in computer vision tasks. Note that all these three CNNs have been applied to diagnose COVID-19 under single-task schemes [9, 13, 22] . (1) VGG-19. This CNN won the localization and classification on ILSVRC 2014 with an architecture like the AlexNet. VGG-19 has 16 convolution layers using kernel with stride for feature extraction followed by three FC layers to fulfill tasks. Note that five max-polling kernels are inserted in the sequential convolutions. VGG-19 has been used for analyzing X-ray images for COVID-19 diagnosis [13] . We use the output of the first FC layer (4,096-D) as the latent representation of a given image. (2) ResNet-50. This CNN won the classification on ILSVRC 2015. Since then, it has been widely applied as backbone for computer vision tasks including medical image analysis. ResNet-50 introduces shortcut connections to combat the problem of gradient-vanishing in deep structures. ResNet-50 has been used for the automatic diagnosis of COVID-19 based on both X-ray [13] and CT [9] . For each CT/X-Ray image, we adopt the output of the last max-pooling layer (4,096-D) as the latent representation. EfficientNet B4 has been used in a recent research for the diagnosis of COVID-19 [22] and therefore we adopt this version. We use the output of the first FC layer (4,096-D) as the latent representation of a given image. The models are trained from scratch using PyTorch with Adam optimizer with a learning rate of 1e-3. Large batchsize (e.g., 2,048, 4,096) seems to be necessary in SSL for providing reliable gradients [45] . However, this will require a large-scale memory and high hardware requirements such as multiple TPU training. In our experiment, CMT-CNN considers supervisions from the disease labels, which allows for small batchsize for SSL during training. We find that the CMT-CNN works well with small batchsize, and assign the batchsize to 8 and 64 for CT and X-ray, respectively. We adopt the suggested setting in the literature [16] and set The optimal value of is determined by the validation accuracies. We We use five-fold cross-validation to validate the CMT-CNN on the CT dataset. The results are summarized at [4] , where the best binary accuracy and ternary accuracy was both 92.49%. Their model also use the 3D CNN to analysis 3D CT volumes. Although using 2D slices enables more training samples, 3D models are more straight-forward in CT analysis. However, Zhang et al. [4] used a two-stage scheme: the pulmonary parenchyma segmentation and the classification on the cropped pulmonary area. As a comparison, the proposed CMT-CNN is a single-stage model working in an end-to-end manner. We use five-fold cross-validation to validate the CMT-CNN on the X-ray dataset. EfficientNet with respect to ResNet-50. We find that EfficientNet has significant advantage over VGG-19 across tasks and metrics, and significantly better than ResNet-50 under most of the metrics except for specificity (for binary classification) and accuracy (for ternary classification). Generally, EfficientNet has better performance and fewer parameters. For the binary classification, we achieve an AUC of 92.13% with a sensitivity of 92.97% and a specificity of 91.91%. For the ternary classification, we achieved an accuracy of 93.49%, which is the same as the best report in the existing literature (93.48%) [13] . We Five-fold cross-validation is used to validate the model when only classification loss is applied, and the results (both CT and X-ray) are summarized in Table 3 . For both the imaging modalities and both the binary and ternary classifications across all the considered metrics, EfficientNet outperforms ResNet-50 and VGG-19 in terms of mean value. Therefore, the EfficientNet is not only suitable for the multi-task scenario, but also shows advantage in the single-task scenario. For all the results, the single-task CNN using classification loss only is inferior to the proposed CMT-CNN in terms of the mean value. We use the symbol ' ' to indicate statistical significant decline ( evaluated by paired t-test) of the classification-only CNN with respect to the CMT-CNN. For CT, the best-performing EfficientNet encounters a significant decline on accuracy (6.25%), sensitivity (6.45%), specificity (5.63%) and AUC (3.45%) for binary classification, as well as a significant decline on accuracy (5.49%) for the ternary classification. For X-ray, the EfficientNet encounters a significant decline on accuracy (2.42%) for the binary classification. These observations indicate that CT is more sensitive to the auxiliary task. To quantitatively measure the contribution of each transformation, we summarize the accuracy improvements of CMT-CNN with respect to the baseline CNN in Figure 4 . We use the EfficientNet B4 for evaluation. The transformations are applied individually or in sequential pairs, which is similar to the protocol in SimCLR [16] . For each matrix, the last column shows the averaged accuracy over each row. According The last column shows the averaged accuracy improvement over each row. The To further illustrate the semantic learning ability of SSL, Figure 6 shows an example of the contrastive losses between X-rays with different semantic labels. For a given COVID-19 X-ray, its contrastive losses with respect to other COVID-19 X-rays are small (e.g., 0.2), its contrastive losses with respect to other pneumonia are larger (e.g., 1.03), and its contrastive losses with respect to normal controls are the largest (e.g., 1.52). These proves that the contrastive loss enables discrimination to semantic labels. Figure 6 is only for an intuitive illustration. Considering the purpose of this paper, we did not evaluate the semantic classification performance by adding a simple classifier upon the representations. More in-depth analysis can be found elsewhere [16, 21] . Figure 6 . Contrastive losses computed between a given COVID-19 X-ray and other X-rays from the dataset. We split the dataset into a training set and a test set. An EfficientNet is trained using contrastive loss only. Then, we randomly pick up a COVID-19 X-ray from the test set and contrast it with every X-ray in the test set. We use different colors to represent different semantic labels. The lines represent the connections between X-ray pairs. The smaller the contrastive loss, the bolder the line and the higher the similarity between the pair. SSL implemented with contrastive loss shows excellent ability to distinguish semantic labels. [4, 10, 12, 13, 25] , the binary and ternary classification accuracies on CT are mainly between 90% and 96%, and the accuracies on X-ray are between 89% and 98%. Table 5 shows the comparison between CMT-CNN and the state-of-the-arts. In CT diagnosis, the metrics obtained by CMT-CNN are basically the same as those of Zhang et al. [4] , except that the sensitivity has decreased by 4%. One reason is that they segmented the lung parenchyma to make the model focus more on the lesion. As a comparison, segmentation is not a precondition in CMT-CNN, and we implement the diagnosis in an end-to-end manner. Under an end-to-end framework, Ouyang et al. [10] generated attention maps to help the model focus on the lesions, and exploited a dual-sampling algorithm to reduce the model bias. Although being superior to CNN, their model is inferior to the CMT-CNN by large margins. One reason is that the training data (2,186 scans) are less than what we used (3,806 scans in each fold). The structured latent multi-view representation learning proposed by Kang et al. [12] shows advantage on sensitivity by 4% over CMT-CNN by considering 93 radiomic features and 96 handcrafted features, where substantial feature engineering and annotations are needed. It is noteworthy that Ouyang et al. [10] and Kang et al. [12] only considered the binary classification, whereas the ternary classification results were not presented. In X-ray diagnosis, Oh et al. [25] proposed a patch-based CNN to confront the shortage of annotated data in the binary classification. The accuracy (88.9%) is much lower than that of CMT-CNN (97.23%) due to the lack of data. Apostolopoulos et al. [13] conducted both the binary and ternary classification on a relatively small dataset (1,427 images) using CNNs pre-trained on the ImageNet with superior sensitivity (98.66%) and specificity (96.46%), although the accuracies are basically the same as those of CMT-CNN. Ablation results show that the gap between supervised learning and unsupervised learning has largely been bridged. For CT (EfficientNet), the gaps between supervised learning and unsupervised learning are 0.75% (binary accuracy), 2.39% (AUC) and 1.77% (ternary accuracy). For X-ray, the gaps are 3.09% (binary accuracy), 1.50% (AUC) and 2.06% (ternary accuracy). These phenomena imply SSL (unsupervised learning) has a bright future in (1) unsupervised semantic feature learning and (2) assisting supervised learning in a multi-task framework. We have proposed the CMT-CNN, an effective multi-task learning framework for the automatic diagnosis of COVID-19. We have validated that the contrastive learning module can be easily integrated into existing CNN models without constrains to bring solid improvement to the performance. The module is based on self-supervisions and requires no additional human annotations. As a core element of self-supervised learning, we for the first time categorize the transformations into distortion methods, painting methods and perspective methods. These transformations have good interpretability in the medical image analysis. Among all the transformations, Hounsfield transformation, out-painting and flip have been identified to be the top-3 effective transformations in both the CT and the X-ray analysis. Experimental results on a CT dataset and an X-ray dataset show that CMT-CNN significantly outperforms CNNs in diagnosing COVID-19, and the results are comparable to the state-of-the-arts. CMT-CNN shows good generalization ability, which can help doctors make more objective diagnostic decisions. This study has two insufficiencies. First, we validate the effectiveness of the CMT-CNN on the classification task, whereas not on the localization and segmentation tasks. We believe that CMT-CNN can also improve the performance in localization and segmentation tasks since the contrastive loss can effectively focus the model on the lesions. With the completion of datasets with box-level and pixel-level annotations, we will evaluate our method in localization (under both the normal setting and weakly-supervised setting) and segmentation tasks. Second, like existing studies, we put together pneumonias such as MERS, SARS and ARDS collectively as 'other pneumonia' without further discrimination. One reason is that there are few samples for each type of pneumonia and the samples are unevenly distributed. Developing effective algorithms to solve the long-tail dataset is a valuable research direction. The main methods include, but are not limited to, more sophisticated cost-sensitive learning and sampling strategies for medical images. ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Clinical Characteristics of Coronavirus Disease 2019 in China Pathological findings of COVID-19 associated with acute respiratory distress syndrome An Infectious cDNA Clone of SARS-CoV-2 Pr es s In es Chest CT for Typical Coronavirus Disease 2019 (COVID-19) Pneumonia: Relationship to Negative RT-PCR Testing CT screening for early diagnosis of SARS-CoV-2 infection Radiological findings from 81 patients with COVID-19 pneumonia in Wuhan, China: a descriptive study Using Artificial Intelligence to Detect COVID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy Dual-Sampling Attention Network for Diagnosis of COVID-19 from Community Acquired Pneumonia A Weakly-Supervised Framework for COVID-19 Classification and Lesion Localization from Chest CT Structured Latent Multi-View Representation Learning Covid-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks MetaCOVID: A Siamese neural network framework with contrastive loss for n-shot diagnosis of COVID-19 patients, Pattern Recognit Self-supervised visual feature learning with deep neural networks: A survey A simple framework for contrastive learning of visual representations Unsupervised representation learning by predicting image rotations Multi-task Self-Supervised Visual Learning Local aggregation for unsupervised learning of visual embeddings Unsupervised deep learning by neighbourhood discovery Unsupervised embedding learning via invariant and spreading instance feature Artificial Intelligence Augmentation of Radiologist Performance in Distinguishing COVID-19 from Pneumonia of Other Origin at Chest CT Artificial intelligence-enabled rapid diagnosis of patients with COVID-19 COVID-19 Image Data Collection Deep Learning COVID-19 Features on CXR Using Limited Training Data Sets An Overview of Multi-Task Learning in Deep Neural Networks * , ArXiv The benefits of target relations: A comparison of multitask extensions and classifier chains Learning multiple tasks using shared hypotheses Representation learning using multi-task deep neural networks for semantic classification and information retrieval Multitask multiclass support vector machines: Model and experiments Unsupervised domain adaptation by backpropagation, 32nd Int. Conf. Mach. Learn. ICML 2015 Unsupervised visual representation learning by context prediction Unsupervised learning of visual representations by solving jigsaw puzzles Context Encoders: Feature Learning by Inpainting Self-supervised learning for medical image analysis using image context restoration Discriminative unsupervised feature learning with exemplar convolutional neural networks Unsupervised Feature Learning via Non-parametric Instance Discrimination Momentum Contrast for Unsupervised Visual Representation Learning Boosting few-shot visual learning with self-supervision Self-supervised domain adaptation for computer vision tasks Rethinking few-shot image classification: A good embedding is all you need? Models Genesis with the whole Supplementary Materials Models Genesis: Generic Autodidactic Models for 3D Medical Image Analysis, Miccai A dictionary learning approach for Poisson image Deblurring Big Self-Supervised Models are Strong Semi-Supervised Learners Very deep convolutional networks for large-scale image recognition Author Biographies His research interests include machine learning, deep learning, transfer learning algorithms and their applications in brain-computer interfaces and medical image analysis Pattern Recognition and Intelligent System from Institute of Automation, Chinese Academy of Sciences. His research interest mainly include deep learning and its applications in computer vision tasks Immunology from Institute of Zoology, Chinese Academy of Sciences. Her research interests mainly include immunology, oncology, and clinical medical imaging This work was supported in part by Zhejiang Provincial Natural Science