key: cord-0677297-1m7ixupl authors: Zhang, Pengyi; Zhong, Yunxin; Tang, Xiaoying; Deng, Yunlin; Li, Xiaoqiong title: Learning Diagnosis of COVID-19 from a Single Radiological Image date: 2020-06-06 journal: nan DOI: nan sha: f8eaa60b3b1d348f0605b88fb8578baf2f02d2ef doc_id: 677297 cord_uid: 1m7ixupl Radiological image is currently adopted as the visual evidence for COVID-19 diagnosis in clinical. Using deep models to realize automated infection measurement and COVID-19 diagnosis is important for faster examination based on radiological imaging. Unfortunately, collecting large training data systematically in the early stage is difficult. To address this problem, we explore the feasibility of learning deep models for COVID-19 diagnosis from a single radiological image by resorting to synthesizing diverse radiological images. Specifically, we propose a novel conditional generative model, called CoSinGAN, which can be learned from a single radiological image with a given condition, i.e., the annotations of the lung and COVID-19 infection. Our CoSinGAN is able to capture the conditional distribution of visual finds of COVID-19 infection, and further synthesize diverse and high-resolution radiological images that match the input conditions precisely. Both deep classification and segmentation networks trained on synthesized samples from CoSinGAN achieve notable detection accuracy of COVID-19 infection. Such results are significantly better than the counterparts trained on the same extremely small number of real samples (1 or 2 real samples) by using strong data augmentation, and approximate to the counterparts trained on large dataset (2846 real images). It confirms our method can significantly reduce the performance gap between deep models trained on extremely small dataset and on large dataset, and thus has the potential to realize learning COVID-19 diagnosis from few radiological images in the early stage of COVID-19 pandemic. Our codes are made publicly available at https://github.com/PengyiZhang/CoSinGAN. The highly contagious Coronavirus Disease 2019 , caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus [1] [2] [3] , has spread rapidly across the world and millions of people has been infected. This surge in infected patients has overwhelmed healthcare systems in a short time. Due to the close contact with patients, many medical professionals have also been infected, further worsening healthcare situation. To date (May 19 th 2020), COVID-19 has resulted in over 4.8 million infections and 310,000 deaths. Early detection of COVID-19 is significantly important to prevent the spread of this epidemic. Reverse transcription polymerase chain reaction (RT-PCR) is the de facto golden standard for COVID-19 diagnosis [4] [5] . However, the global shortage in RT-PCR test kits has severely limited the extensive detection of COVID-19. Meanwhile, the current clinical experience implies RT-PCR has a low sensitivity [6] [7] [8] especially in the early outbreak of COVID-19. That means multiple testing may be required to rule out the false negative cases [9] , which may delay the confirmation of suspected patients and take up huge healthcare resources. Since most patients infected by COVID-19 are initially diagnosed with pneumonia [10] , radiological examinations, including computed tomography (CT) and X-ray, are able to provide visual evidence of COVID-19 infection for diagnosis and patient triage. Existing chest CT findings in COVID-19 infection [11] have implied that chest CT screening on patient at the early stage of COVID-19 presents superior sensitivity over RT-PCR [7] and even confirms the false negative cases given by RT-PCR [4] . Therefore, radiological examinations are currently used as parallel testing in COVID-19 diagnosis. However, as the number of infected patients dramatically increases, clinicians need to analyze radiographs repeatedly, which brings huge pressure to them. Therefore, there is an immediate need for developing methods for automated infection measurement and COVID-19 diagnosis based on radiological images to reduce the efforts of clinicians and accelerate the diagnosis process. Many approaches, mostly using deep models, have been proposed for automated COVID-19 diagnosis based on chest CT [9] [12] [13] [14] or chest X-ray [10] [15] , and have claimed notable detection accuracy of COVID-19 infection. However, the research of these approaches tends to lag slightly behind the outbreak of COVID-19 pandemic. It is probably because accumulating sufficient radiological images that are required to train deep models is difficult in the early stage of COVID-19 pandemic. To solve the dilemma of training deep models on insufficient training samples and realize automated COVID-19 diagnosis in the early stage, some researches resort to shallow network [10] , prior knowledge [10] , transfer learning [15] [16] , and data augmentation method based on generative adversarial network (GAN) [16] [17] . However, these methods still require relatively large training dataset, and thus cannot respond immediately to the outbreak of COVID-19 pandemic. An ideal solution is to learn COVID-19 diagnosis from a single radiological image. In this paper, we explore the feasibility of learning deep models for COVID-19 diagnosis from a single radiological image by resorting to synthesizing diverse radiological images. Specifically, we propose a novel conditional generative model, called CoSinGAN, which can be learned from a single radiological image with a condition, i.e., the annotations of the lung and COVID-19 infection. Inspired by SinGAN [18] , we build CoSinGAN with a pyramid of GANs, each of which has a two-stage UNet-style [19] [20] generator and is responsible for translating condition mask into radiological image at a different scale. Unlike SinGAN estimating the 'unconditional' distribution of a single nature image, our CoSinGAN is designed to capture the 'conditional' distribution of a single radiological image. Estimating conditional distribution from a single image is much more difficult. Because one should prevent the generators from being 'overfitted' to the single input condition, and meanwhile, need to 'overfit' them to the single training image as much as possible. Therefore, we design the two-stage generator at each scale of CoSinGAN to cooperate with the multi-scale architecture by progressively adding image details and enhancing the condition constraints. Besides, we introduce a mixed reconstruction loss and a hierarchical data augmentation module to train CoSinGAN smoothly. The mixed reconstruction loss consists of weighted pixel-level loss (WPPL), multi-scale feature-level VGG [21] loss, multiscale feature-level UNet [19] loss and multi-scale structural similarity (MS-SSIM) [22] [23] loss. The mixed reconstruction loss is able to provide rich and stable gradient information for the optimization of generators. The hierarchical data augmentation module can produce data augmentation with different intensities for the training of two-stage generators at different scales, which facilitates the balance between fitting conditions and fitting images. Moreover, to enable CoSinGAN to generate diverse radiological images, we provide two effective approaches, including randomizing input conditions and fusing images of different modalities. Extensive ablation experiments strongly confirm the efficacy of our proposed methods. Compared to the popular pix2pix [20] model, our CoSinGAN is able to synthesize diverse and high-resolution (512×512) radiological images that match the input conditions and visual finds of the lung and COVID-19 infection more precisely. Both deep classification and segmentation networks trained on synthesized samples from CoSinGAN achieve notable detection accuracy of COVID-19 infection. Such results are significantly better than the counterparts trained on the same extremely small number of real samples (1 or 2 real samples) by using strong data augmentation, and approximate to the counterparts trained on large dataset (2846 real images). It confirms our method can significantly reduce the performance gap between deep models trained on extremely small dataset and on large dataset, and thus has the potential to realize learning COVID-19 diagnosis from few radiological images in the early stage of COVID-19 pandemic. CoSinGAN consists of three key components, including multi-scale architecture with a pyramid of two-stage GANs, a mixed reconstruction loss, and a hierarchical data augmentation module. Input conditions and outputs from previous scale Multi-scale architecture of proposed CoSinGAN. CoSinGAN consists of a pyramid of GANs, each of which has a twostage generator and is responsible for translating condition mask into radiological image at a different scale. The input to i G is an augmented condition mask, and the generated radiological image from the scale 1 i  , which is upsampled to the scale i (except for scale 0). Through iterating optimizations from small image scale to large image scale, CoSinGAN progressively learns to generate realistic and high-resolution (512×512) radiological images with COVID-19 infection. Overall. Learning a generative model to synthesize high-resolution and high-quality images is quite difficult due to the unstable training process of GAN. A useful trick is to learn a pyramid of GANs as adopted by SinGAN [18] to increase the resolution of generated images progressively. We borrow this trick and build CoSinGAN with a multi-scale architecture as depicted in Fig 2. It is worth noting that we expect to use the synthesized radiological images with COVID-19 infection to train both classification and segmentation models for COVID-19 diagnosis. Thus, the synthesized images should match the given input conditions precisely, especially in the infection regions. To achieve that, CoSinGAN is designed to capture the 'conditional' distribution of a single radiological image rather than the 'unconditional' distribution of a single nature image as done by SinGAN. Estimating conditional distribution from a single image is much more difficult, because one should pay more attention to preventing deep models from being 'overfitted' to the single input condition, and meanwhile, need to 'overfit' them to the single training image as much as possible. To tackle this problem, at each scale we design a two-stage GAN to cooperate with the pyramid hierarchy as illustrated in Fig. 3 . The first stage is mainly responsible for fitting the input condition and increasing the resolution of radiological image, and the second stage is responsible for restoring image details that may not be reconstructed in the first stage. Through iterative optimization of enhancing condition constraints and restoring image details across all scales of GANs, our CoSinGAN is able to generate realistic and high-resolution radiological images that match the given input conditions precisely as demonstrated in Fig. 1 and Fig. 9 . Multi-scale architecture. As shown in Fig. 2 , modulating the input of GANs across N image scales, will enforce the output of CoSinGAN to match the given conditions strictly, which is exactly what we expect. Two-stage GAN. At a specific image scale i, we design a two-stage GAN as depicted in Fig. 3 . The generator i G in its first stage, called s i G , is designed to perform conditional image super-resolution, responsible for learning the condition constraints and increasing the resolution of radiological image simultaneously. Thus, the full image generation process of proposed two-stage GAN can be formulated as We specially design a hierarchical data augmentation module, which can produce strong augmentation and weak augmentation (detailed in section 2.3) to train such a two-step GAN. We first perform strong augmentation on training sample   G to restore image details as much as possible despite of the possibility of violating the given conditions. Therefore, the two-stage GAN is actually trained by a two-step optimization: (a) enhance given condition constraints but may blur image details, and (b) restore image details but may violate given conditions. By iterating such a two-step optimization through all image scales progressively, the two-stage GANs with larger image scales are able to generate high-resolution radiological images that match the given conditions strictly, and meanwhile, have clear and accurate image details. Implementation details. A total of 9 image scales are used in our implementation of CoSinGAN for synthesizing high-resolution chest CT slices, including 32×32, 48×48, 64×64, 96×96, 128×128, 192×192, 256×256, 384×384 and 512×512. We purposely choose such image scales to facilitate the design of the multi-scale generators with different numbers of down-sampling layers. Specifically, we choose a network architecture similar with the popular pix2pix model [20] , including a UNet-style generator and a patch discriminator. Considering the reusability of trained models between two adjacent image scales, we set the number of 2×downsampling layers in the UNet-style generators of CoSinGAN to (4, 4, 5, 5, 6, 6, 7, 7, 8) , respectively. Meanwhile, the numbers of convolutional layers in discriminators are set accordingly to (6, 6, 7, 7, 8, 8, 9, 9, 10) . At the i-th image scale, we train generator i G in the manner of adversarial learning to obtain realistic images. It is done by learning i G to minimize the reconstruction loss rec and the adversarial loss adv simultaneously, thereby fooling the discriminator i D to maximize the probability of generated image being classified as real image. Therefore, our objective for optimizing and the objective for optimizing The same adversarial loss adv as pix2pix [20] is adopted in our implementation. Besides, we propose a mixed reconstruction loss, including the weighted pixel-level loss (WPPL) WPPL , multi-scale featurelevel VGG [21] MS-FUL, respectively. Such a mixed reconstruction loss is able to provide rich and stable gradient information for the optimization of generators. WPPL. WPPL computes the weighted mean of L1 distances between the pixels of generated image and real image, where the weight of each pixel is determined by its category, i.e., background, lung or COVID-19 infection, where p is the pixel index, P is the total number of pixels and M denotes a mapping function from category to weight. We use L1 loss rather than mean squared error (MSE) loss because optimizing MSE loss tends to obtain over-smoothed image details. We suggest a relatively higher weight for the regions of the lung and COVID-19 infection to emphasize the reconstruction of the lung and COVID-19 infection. MS-SSIM loss. Different from mean-based metrics like L1 distance and MSE, SSIM [24] and MS-SSIM [22] are perceptually motivated metrics to evaluate image similarity based on local structure. As discussed in [23] , MS-SSIM loss is differentiated and thus can be back-propagated to optimize the parameters of CoSinGAN. We adopt MS-SSIM loss [23] to optimize the reconstruction of local anatomical structures. MS-FVL. The distance between deep features of two images from a pre-trained CNN classifier is frequently used as the perceptual loss [25] [26] [27] in image restoration tasks. Compared with pixel-level metrics, perceptual loss is able to obtain visually appealing results. The multi-scale feature-level VGG loss [27] used at the i-th scale of CoSinGAN is formulated as: where j F denotes the j-th layer with j P elements of the VGG network [21] and j  denotes the weight of the j-th feature scale. Similar with MS-FVL, we design a multi-scale feature-level UNet loss, which measures the similarity of two images using the deep features from a pre-trained UNet [19] : where k F denotes the k-th layer with k P elements of the UNet network [19] and k  denotes the weight of the k-th feature scale. Compared to VGG features that are trained for classification tasks, the UNet features trained for semantic segmentation encode much more positional and structural information, and thus are more sensitive to the positional distribution of pixels. As described in section 2.1, to learn the conditional distribution from one single image, one need to well handle the two things: (a) ensure the generator can generalize to different input conditions, and (b) fit the single image as much as possible for visually accurate and appealing results. Performing strong data augmentation on the single training image is an effective approach to avoid overfitting, whereas it may corrupt the real data distribution and put an additional burden on the generator, and thus lead to blurry and unreal images. It is critical to design an appropriate data augmentation module to tackle this problem. Accordingly, we propose a hierarchical data augmentation module, involving strong augmentation and weak augmentation, to collaborate with the proposed two-stage GANs at multiple image scales. Specifically, at the i-th image scale, the hierarchical data augmentation module produces strong augmentation (SA) to train is G , and produces weak augmentation (WA ) to train ir G . Meanwhile, as the image scale increases, the intensity of SA decreases gradually whereas WA keeps unchanged. Some augmented images and conditions produced by the hierarchical data augmentation module are visualized in Fig. 4 . We design such a hierarchical data augmentation module with following advantages: (1) SA is very critical to enable is G to generalize to different input conditions. (2) WA can facilitate to fit the real image distribution without introducing additional learning burden. (3) Decreasing the intensity of SA along with the increasing of image scales can well handle the balance between fitting conditions and fitting images. Specifically, we implement the hierarchical data augmentation module based on random cropping, random rotation, random horizontal flipping, random vertical flipping and elastic transform. SA is designed by composing all these transforms, where the cropping size is between 0.5 and 1 times the image size and the parameters of elastic transform is set according to the specific image size. In comparison, WA does not use elastic transform, and the cropping size is between 0.75 and 1 times the image size. It is worth noting that the augmentation imposed on images should be consistent with augmentation imposed on conditions all the time. At the i-th image scale, to obtain consistent input for the generate i G , we perform SA or WA on conditions and use the augmented conditions to generate the previous output 1 i O  from scale 0 to scale i-1 rather than directly imposing SA or WA on 1 i O  that is generated by original conditions. Besides, we treat the augmented samples as different samples and thus construct batched samples to realize mini-batch training. In this paper, we explore the feasibility of learning COVID-19 diagnosis from a single radiological image. We resort to synthesizing diverse radiological images with COVID-19 infection and thus propose a novel conditional GAN, i.e., CoSinGAN, to realize the radiological image generation process. Therefore, we first conduct experiments to evaluate the effectiveness of CoSinGAN on synthesizing high-resolution and high-quality radiological images that can well match the given conditions and visual finds of the lung and COVID-19 infection. Next, we evaluate the effectiveness of synthesized radiological images from CoSinGAN on training both classification and segmentation networks for COVID-19 diagnosis. We use the public COVID-19-CT-Seg dataset [28] , which consists of 20 public COVID-19 CT scans with pixel-level annotations of the left lung, right lung and COVID-19 infection. The annotations, first labeled by junior annotators, are refined by two radiologists with 5 years experience, and are further verified and refined by a senior radiologist with more than 10 years experience in chest radiology. In our experiment, we randomly select 15 scans for training and the other 5 scans are left for test. We slice them into slices and resize these slices to the size of 512×512, and thus constitute a training set with 2846 chest CT slices and a test set with 674 chest CT slices, respectively. We observe that the CT slices in our materials mainly present two distinct appearances as shown in Fig. 5 , which may be caused by using different ranges of Hounsfield Unit during the CT imaging process. For convenience, we roughly treat them as two different modalities, and name them modality 1 and modality 2 respectively to indicate the difference. We use two representative slices of different modalities as depicted in Fig. 5 from the training set to train two individual CoSinGANs separately. We first conduct ablation experiments on the three key components of CoSinGAN to verify their efficacies. Second, we perform evaluation and comparison on the image quality of synthesized radiological images. We finally conduct experiments to test the ability of CoSinGAN in generating diverse samples. We train CoSinGAN with 9 image scales, including 32×32, 48×48, 64×64, 96×96, 128×128, 192 ×192, 256×256, 384×384 and 512×512, which means 9 two-stage GANs are required to train sequentially from coarsest scale to finest scale. The loss weights of WPPL, MS-SSIM, MS-FVL and MS-FUL in proposed mixed reconstruction loss are empirically set to 10.0, 1.0, 10.0 and 10.0. As suggested in section 2.2, we set the category weights of background, lung and COVID-19 infection in WPPL to 0.1, 0.5 and 1.0 respectively to emphasize the reconstruction of the lung and COVID-19 infection. Meanwhile, we set the pixel values of these three categories in the input conditions to 0, 128 and 255, separately. We do strong augmentation to train these two-stage GANs with 4000 epochs and mini-batch of 4 in their first stage by using Adam optimizer with the parameters of 1 = 0.5  and 2 = 0.999  . We use an initial learning rate of 0.0002 that is linearly decayed by 0.05% each epoch after 2000 epochs. Correspondingly, we perform weak augmentation to train these two-stage GANs with 2000 epochs and mini-batch of 4 in their second stage by using Adam optimizer with the parameters of 1 = 0.5  and 2 = 0.999  . We use an initial learning rate of 0.0001 that is linearly decayed by 0.1% each epoch after 1000 epochs. The batch size of training samples at the image scale of 512×512 is set to 2 due to memory limitation. The trained models are further used to synthesize radiological images with given input conditions from the training set of our materials. Mixed reconstruction loss. We introduce mixed reconstruction loss to provide rich and stable gradient information for the optimization of generators. To evaluate its efficacy, we train CoSinGAN with a single scale of 256×256 on the single training image of modality 1 by adopting WPLL, MS-SSIM loss, MS-FVL, MS-FVL, and mixed reconstruction loss as the reconstruction loss function separately. The training curves, including adversarial learning curves and reconstruction loss curves, are depicted in Fig. 6 . We directly use the first stage of the two-stage GAN to perform our evaluation, thus simplifying CoSinGAN to have a similar architecture with the well-known pix2pix model [20] . These trained models are used to generate radiological images with given input conditions as shown in Fig. 7 . Compared to WPLL and MS-SSIM loss, MS-FVL using deep features from pre-trained VGG network tends to produce visually pleasing images with less noise, but at the cost of losing more local image details as indicated by pink arrows in the fourth column of Fig. 7 . MS-FUL also achieves visual impact similar with MS-FVL, but can reconstruct more image details like sharp contours and edges (highlighted by yellow arrows in the fifth column of Fig. 7) than MS-FVL. We argue it is because that the deep features from UNet, designed for semantic segmentation, encode much more positional and structural information, and thus make MS-FUL sensitive to the positional distribution of pixels. Correspondingly, WPLL and MS-SSIM loss, using raw pixel features, can synthesize much more image details as pointed by green arrows in Fig.7 at the cost of presenting visual unpleasing impact with more background noises. By combining WPLL, MS-SSIM loss, MS-FVL and MS-FUL together, our mixed reconstruction loss can inherit advantages of them, complement with each other, and thus produce visual pleasing images with less noise and more local details (highlighted by white arrows). Moreover, as depicted in Fig. 6 (b) , the model trained with mixed reconstruction loss achieves consistently smaller WPLL, MS-FVL and MS-FUL than the same model trained with only one of loss items in the mixed reconstruction loss. It strongly confirms the mutual collaboration efficacy between different loss items in the mixed reconstruction loss. Multi-scale architecture and two-stage GAN. We train a complete CoSinGAN with 9 two-stage GANs on the single training image of modality 1. We plot the adversarial learning curve and reconstruction loss curve of CoSinGAN with the image scale of 256×256 in Fig. 6 . It is worth noting that all the models compared in Fig. 6 use the same training configuration except that the complete CoSinGAN is trained gradually from the scale of 32×32 to the scale of 256×256. As can be seen from Fig. 6 (a) , the complete CoSinGAN presents a better adversarial learning curve than the other models trained with a single scale. The adversarial loss values of generator G and discriminator D are kept close to each other throughout the entire training process. It indicates that the adversarial training of CoSinGAN is stable and thus G is able to capture the distribution of real images gradually through the continuously fighting with D. The reconstruction loss curve in Fig. 6 (b) also shows that the complete CoSinGAN trained with multi-scale architectures achieves lower fitting error. Besides, the radiological images produced by the complete CoSinGAN as depicted in Fig. 7 present significantly better visual impact with realistic and sharp image details (highlighted by red arrows). These results strongly verify the effectiveness of multi-scale architecture. Moreover, we use the complete CoSinGAN to generate images with all 9 scales and compare them in Fig. 8 . Each scale includes two synthesized images, one from the first stage and the other from the second stage. We first use the red arrows to track the contour of lung and thus highlight the efficacy of proposed multi-scale architectures and two-stage GAN on enhancing the condition constraints. We can easily identify that the contour of lung in synthesized radiological images gradually match the input conditions. Besides, we utilize the green arrows to track the details of lung and COVID-19 infection in synthesized images as the image scale increases. Intuitively, the image details are enhanced progressively. Such results strongly confirm our claims that our multi-scale architectures are able to collaborate with the two-stage GANs by iteratively enhancing conditions and details. Scale 0~2 Scale 6~8 Input condition Figure 8 . Illustration of the synthesized images with all 9 scales from CoSinGAN. Each scale consists of two synthesized images, one from the first stage and the other from the second stage. All images are resized to 512×512 for better visualization. Fig. 4 , our hierarchical data augmentation module is able to produce strong augmentation (SA) for the first stage of GAN and weak augmentation (WA) for the second stage of GAN. SA is designed to enhance conditions, while WA is used to facilitate the restoration of image details. As can be seen from the first three scales of synthesized images in Fig. 8 , the generators trained with SA in the first stage present strong generalization ability to input condition because the shape of lung is synthesized consistently with the input condition. The generators trained with WA in the second stage show weak generalization ability to input condition as the shape of lung is not well maintained. Besides, the synthesized images from the second stage tend to be more realistic and have more details than those from the first stages. As the image scale increases, the intensity of SA gradually decreases, whereas the output of CoSinGAN contains more and more condition information. Benefiting from the output from previous scales, the generators in the later scales trained with relatively weaker SA are still able to generalize to input conditions. Thus, more energy can be assigned to optimize image details. Such results clearly confirm that our hierarchical data augmentation module is able to provide a well balance between preventing the generator from being overfitted to input condition and facilitating to overfit the single training image as much as possible when learning conditional distribution from a single image. Baselines. The pix2pix model [20] is a well-known conditional GAN framework for image-to-image translation. In our implementation, we directly adopt the enhanced pix2pix model to build our two-stage GAN of CoSinGAN. Specifically, we replace L1 reconstruction loss in the pix2pix model with the proposed mixed reconstruction loss to obtain the enhanced pix2pix model. Accordingly, we use the pix2pix model and the enhanced pix2pix model as our baselines to compare with our CoSinGAN and highlight our contributions. It is worth noting the baseline pix2pix model is also implemented with weighted L1 reconstruction loss, i.e., WPLL, to emphasize the reconstruction of lung and COVID-19 infection. We first train these models on the single image of modality 1 by using the training setting detailed in section 3.2.1 and sequentially train these models on the single image of modality 2. Qualitative comparison. We first show the synthesized images of modality 1 and modality 2 from our CoSinGANs in Fig. 1 and Fig. 9 to give a direct visual impact. As can be seen, CoSinGAN is very sensitive to the input conditions as even the small isolated regions of COVID-19 infection can be reconstructed very well (highlighted by red circles). Meanwhile, these synthesized radiological images are able to present sharp and rich image details with low noise and clean background, comparable to the single training image with the size of 512×512. The visual appearance of the lung and COVID-19 infection is also synthesized consistently with the training image. Such property of CoSinGAN is very important, and can facilitate the construction of synthesized training samples with pixel-level annotations of the lung and COVID-19 infection to explore the feasibility of learning COVID-19 diagnosis from a single radiological image. We then compare the results of different models with given the same input conditions in Fig. 10, 11 and 12 . These input conditions are sampled from different CT scans, where their corresponding real images may present different modalities, called reference ground-truth images. As can be seen, our CoSinGAN can produce visually appealing results with clear image details and clean background, significantly better than the baselines and comparable to the reference ground-truth images. First, the results of the pix2pix model contain too much grid artifacts, thus leading to visually unpleasant results. Meanwhile, the synthesized details of lungs are not clearly and appear to be lung artifacts, which makes it very difficult to distinguish the synthesized COVID-19 infection from these artifacts (highlighted by yellow arrows). Next, benefiting from the mixed reconstruction loss, the enhanced pix2pix model achieves better visual impact with less grid artifacts and richer lung details compared to pix2pix. Despite that, the synthesized details of lungs are also not clear enough to be distinguished from the synthesized COVID-19 infection (indicated by yellow arrows in the third column). Such synthesized images of pix2pix model and enhanced pix2pix model cannot be used to learn COVID-19 diagnosis smoothly. In comparison, our CoSinGAN effectively solves the problems of grid artifacts and blurred lung details, and can produce high-quality radiological images with accurate details of the lung and COVID-19 infection, facilitating to train deep models for COVID-19 diagnosis. We finally compare our results with the reference ground-truth images and our results achieve comparable image sharpness. We also find that the COVID-19 infection regions in the reference ground-truth images present different but correlated visual appearances in different CT scans, which motivates us to test the ability of CoSinGAN in synthesizing diverse radiological images with COVID-19 infection. The synthesized radiological images of modality 1 and modality 2 with given the same input condition. The input condition and the reference ground-truth radiological image are depicted in the first column. The last three columns are the results of pix2pix, enhanced pix2pix and CoSinGAN respectively, where the top is modality 1 and the bottom is modality 2 in each column. To quantify the quality of the generated radiological images, we follow the same evaluation method in [20] [27] . We use the baseline segmentation networks, i.e., ENet [29] and UNet [19] (detailed later in section 3.3), that are well trained for the lung and COVID-19 infection segmentation on the training set of our materials to segment the synthesized images, and compare how well the segmentation outputs match the corresponding inputs of CoSinGAN. The intuitive is that if CoSinGAN can produce realistic radiological images, the segmentation networks trained on real images should be able to well segment them. The common Dice similarity coefficient (DSC) is computed as the image quality score to compare different models. Specifically, we first use pix2pix, enhanced pix2pix and CoSinGAN to synthesize the same number (2846) of radiological images with all given conditions in the training set of our materials. We then perform lung and COVID-19 infection segmentation on these synthesized images, and calculate the mean DSC scores of the lung and COVID-19 infection as the image quality scores to compare different models. As detailed in Table 1 image quality scores on both lung and COVID-19 infection, surpassing the baseline methods by a large margin. It indicates that CoSinGAN has reconstructed the visual appearance of the lung and COVID-19 infection more precisely at the locations specified in the input conditions. Such results strongly confirm the effectiveness of our CoSinGAN on learning the conditional distribution of radiological image from a single radiological image. The images synthesized by GAN tends to be lack of diversity and present a single modality, which does not facilitate the training of deep models. Given an input condition, we expect CoSinGAN is able to generate diverse samples that are different but correlated in visual appearance. We explore three approaches to enable such ability of CoSinGAN, including applying dropout at test time, randomizing the input condition and fusing synthesized images of different modalities, called data diversification methods. We expect to use such methods to improve the diversity of synthesized radiological images and thus enable deep models to be trained effectively on synthesized radiological images for COVID-19 diagnosis. Figure13. The diversification results of synthesized radiological images. The rows from top to bottom represent input conditions, synthesized images without using any data diversification methods, results of applying dropout at test time, results of randomizing input condition and results of fusing synthesized images of different modalities, respectively. Red and yellow arrows highlight the differences between diversification results. Applying dropout (AD). Applying dropout at inference time with a dropout rate of 50% can add randomness to the forward propagation of CoSinGAN through randomly inactivating some activation units of neural network. As can be seen, this operation presents a slight effect on CoSinGAN's output, including the fade of some image details (indicated by arrows in the third row of Fig. 13) . Thus, this approach may not contribute much to the diversity of synthesized radiological images. Randomizing input condition (RC). During the training process, the pixel values of background, lung and COVID-19 infection in the input conditions are set to 0, 128 and 255, separately. After training, we can randomize the input condition by adding random noise to it to synthesize diverse images. Specifically, the pixel values of background, lung and COVID-19 infection are randomly set as follows: where b  , l  , and i  denote the magnitudes of the random noises. In our experiment, we set them to 16, 16 and 32, respectively. It is worth noting that such randomness also exists in the multi-scale input conditions. That means the input condition at each image scale may be different, which further promotes the synthesis of diverse samples. As can be seen, RC produces relatively diverse radiological images with notable differences in background, local lung details and COVID-19 infection (highlighted by the arrows in the fourth row of Fig. 13) . Besides, RC does not spoil the sharpness and local details of synthesized images. Despite the lack of clinical evidences for the diverse appearance of synthesized COVID-19 infection, such results still confirm that RC is an effective data diversification method. Image fusion (IF). The radiological images in different chest CT scans present different modalities. The CoSinGAN, trained only on the single radiological image of modality 1, cannot smoothly generate radiological images of modality 2. Thus, we propose to fuse the synthesized images from two CoSinGANs that are trained separately on the single image of modality 1 and on the other single image of modality 2 directly. Given the same input condition, the two CoSinGANs are able to generate paired images of two different modalities that match each other pixel-by-pixel. Accordingly, we can simply fuse the two paired images as follows without losing image details and sharpness: where  is the fusion coefficient. We introduce diversity by randomly setting the value of  from 0.0 to 1.0 in our experiments. As can be seen, IF can synthesize radiological images of intermediate modalities. Moreover, the synthesized COVID-19 infection of IF are more realistic and may have more clinical relevance than those of RC. Therefore, extending CoSinGAN by IR may have the potential to realize learning deep models for COVID-19 diagnosis from few representative radiological images. Most studies have adopted either or both of classification models and segmentation models to realize the automated COVID-19 diagnosis [8] . The classification models input a radiological image and output a binary scalar, where 1 indicates COVID-19 infection and 0 represents no COVID-19 infection. In comparison, the segmentation models input a radiological image and output a ternary mask that indicates where the lung and COVID-19 locate. Thus, we conduct our experiments on both of classification and segmentation models to explore the feasibility of learning COVID-19 diagnosis from a single radiological image. Baselines. We train two baseline classification networks, i.e., ResNet18 and ResNet50 [30] , on the training set of our materials for automated COVID-19 diagnosis. Deep residual learning networks are popular network architectures, where ResNet18 is a lightweight version whereas ResNet50 is heavyweight version. Using such two baselines can make the experiment results more convincing. Training sets. Given all the conditions in the training set, we use the per-trained CoSinGAN with modality 1 to synthesize radiological images, called originally synthesized training samples (O-STS). We then impose RC as described in section 3.2.4 on CoSinGAN to get the first version of diversified training samples, called RC-STS. Similarly, we use the IF method to get the second version of diversified training samples, called IF-STS. We also build one training set with the same single image of modality 1 that is used to train CoSinGAN, called Sin-TS, and build another training set with the same two images of modality 1 and modality 2 that are used to train the two individual CoSinGANs, called Two-TS. For convenience, we call the original complete training set of our materials OC-TS. We finally train the baseline networks on OC-TS, Sin-TS, Two-TS, O-STS, RC-STS and IF-STS separately. It is worth noting that our CoSinGAN is able to synthesize infinite number of radiological images theoretically, but we only use the conditions in OC-TS to obtain the same number of radiological images as OC-TS. It means that each training set contains the same number of training samples, i.e., 2846 radiological images, which ensures that the parameters of models are updated the same number of times during each training process. Training details and evaluation metrics. We train these baseline models with 10 epochs and minibatch of 16 by using Adam optimizer with the parameters of 1 = 0.9  and 2 = 0.999  . We use an initial learning rate of 0.0001 that is linearly decayed by 20% each epoch after 5 epochs. All the models are trained and evaluated with input channel of 3 and image size of 256×256. We use the weights pre-trained on ImageNet to initialize the parameters of our models. We perform the same strong augmentation (SA) that is used to train CoSinGAN on Sin-TS, Two-TS, O-STS, RC-STS and IF-STS, and perform the same weak augmentation (WA) on OC-TS. The binary cross-entropy (BCE) loss is adopted. All these trained models are finally evaluated and compared on the test set (674 real chest CT slices from 5 CT scans) of our materials by the common used metrics, i.e., sensitivity, specificity, and accuracy. After that, we repeat the same training and evaluation process with input image size of 512×512. Results and discussions. We report our evaluation results in Table 2 and Table 3 . First, in the training process of Sin-TS and Two-TS, we observe that the training loss decreases rapidly to less than 0.01 in 1 or 2 epochs even with strong data augmentation, thus leading to poor classification results as shown in the second and third rows of Table 2 and 3. Next, the synthesized images achieve consistently better classification accuracy than Sin-TS and Two-TS. Specifically, RC-STS, synthesized by randomizing the input conditions of CoSinGAN, achieves slightly superior accuracy over the originally synthesized training samples O-STS except for the case of ResNet50 trained with image size of 512×512 (the last columns in the third and fourth rows of Table 3 ). We argue it is because that RC can synthesize radiological images with diverse appearance of COVID-19 infection, which facilitate the training of deep models for COVID-19 diagnosis; however, such synthesized COVID-19 infection, not confirmed in clinical, may mislead deep models. At last, IF-STS obtained by fusing the paired images of two different modalities from two CoSinGANs achieves notable classification accuracy of COVID-19 infection, significantly better than Sin-ST, Two-ST, O-STS and RC-STS, and even comparable to OC-TS. Such results confirm that our CoSinGAN have the potential to realize learning COVID-19 diagnosis from few representative radiological images. Table 2 . Evaluation results of the baseline classification networks trained on different training sets with image size of 256×256. The second column in this table represents the number of real samples that are used in the entire training process (including the training process of CoSinGANs). As can be seen, the fused radiological images synthesized by our CoSinGAN using only two real images achieve notable classification accuracy of COVID-19 infection. The 95% confidence intervals for evaluation results on 5 CT scans in the test set of our materials are calculated by using Student's t-distribution with (5 -1) degrees of freedom (although values larger than 1.0 and smaller than 0.0 are meaningless, we can use them to highlight the differences between different results). Table 3 . Evaluation results of the baseline classification networks trained on different training sets with image size of 512×512. The second column in this table represents the number of real samples that are used in the entire training process (including the training process of CoSinGANs). As can be seen, the fused radiological images synthesized by our CoSinGAN using only two real images achieve notable classification accuracy of COVID-19 infection. The 95% confidence intervals for evaluation results on 5 CT scans in the test set of our materials are calculated by using Student's t-distribution with (5 -1) degrees of freedom (although values larger than 1.0 and smaller than 0.0 are meaningless, we can use them to highlight the differences between different results). Baselines. We train two baseline segmentation networks, i.e., ENet [29] and UNet [19] , on the training set of our materials. ENet is a well-known segmentation network that has shown a good trade-off between accuracy and inference speed [29] [31] . UNet is one of the most successful segmentation framework in medical imaging. In comparison, ENet is a lightweight network whereas UNet is a heavyweight network. Using such two baselines are much easier to obtain convincing results. Training sets. We use the same training sets as detailed in section 3.3.1, including OC-TS, Sin-TS, Two-TS, O-STS, RC-STS and IF-STS. It is worth noting that each training set contains the same number of training samples, i.e., 2846 radiological images, which ensures that the parameters of models are updated the same number of times during each training process. Training details and evaluation metrics. We train these baseline models with 50 epochs by using Adam optimizer with the parameters of 1 = 0.9  and 2 = 0.999  . We adopt mini-batch of 8 for ENet and mini-batch of 2 for UNet respectively due to memory limitation. We use an initial learning rate of 0.0001 that is linearly decayed by 4% each epoch after 25 epochs. All the models are trained with input channel of 1 and image size of 256×256 from scratch. We perform the same strong augmentation (SA) that is used to train CoSinGAN on all training sets. Besides, the category-weighted cross entropy loss is adopted to emphasize the optimization of COVID-19 infection segmentation, where the weights of background, lung and COVID-19 infection are set to 0.1, 1.0 and 5.0. All these trained models are finally evaluated and compared on the test set (674 real chest CT slices from 5 CT scans) of our materials by Dice similarity coefficient (DSC). Meanwhile, we also compute the DSC scores on the subset of modality 1 and on the subset of modality 2 separately to make a more detailed comparison. Table 4 . Evaluation results of the baseline segmentation networks. The second column in this table represents the number of real samples that are used in the entire training process (including the training process of CoSinGANs). As can be seen, the fused radiological images synthesized by our CoSinGAN using only two real annotated images achieve notable segmentation accuracy of the lung and COVID-19 infection. The 95% confidence intervals for overall evaluation results on 5 CT scans in the test set of our materials are calculated by using Student's t-distribution with (5 -1) degrees of freedom (although values larger than 100.0 and smaller than 0.0 are meaningless, we can use them to highlight the differences between different results). Results and discussions. The segmentation scores measured by DSC are detailed in Table 4 . As can be seen, the synthesized training sets, including O-STS, RC-STS, and IF-STS, consistently outperform Sin-TS and Two-TS by a large margin in COVID-19 infection segmentation. Considering the domain discrepancy between different modalities, we first compare Sin-TS, O-STS, and RC-STS that all use one real image of modality 1 specifically on the subset of modality 1; we find that O-STS and RC-STS achieve notable infection segmentation scores, much higher than (more than 20%) Sin-TS and even comparable (less than 20%) to OC-TS that contains 2846 real images. Such results implicate that the deep segmentation models trained on synthesized samples from CoSinGAN can generalize to the other image modalities better than the same models trained on a single real image directly by using strong data augmentation. Besides, we also notice that RC-STS obtains higher infection segmentation scores than O-STS, and such gaps are more obvious on the subset of modality 2 (3.4% for ENet and 24.6% for UNet). We argue it is caused by the using of proposed RC method (i.e., randomizing input condition of CoSinGAN) in RC-STS. We design the RC method to enable CoSinGAN to generate diverse samples, and expect to improve the generalization ability of deep models trained on synthesized samples. Thus, such results confirm the efficacy of RC. Next, we compare the results of Two-TS and IF-STS that both use an additional real image of modality 2. We observe that the additional real image significantly improve the infection segmentation scores on the subset of modality 2. Besides, we find that IF-STS achieve notable infection segmentation scores, much higher (9.6% for ENet and 26% for UNet) than Two-TS and even approximating (gap of 19.6% for ENet and gap of 21.8% for UNet) to OC-TS that contains 2846 real images. Such results strongly confirm that our methods have the potential to reduce the segmentation performance gap between deep models trained on extremely small image dataset and on large image dataset. The highly contagious COVID-19 has spread rapidly and overwhelmed healthcare systems across the world. Automated infection measurement and COVID-19 diagnosis at the early stage is critical to prevent the further evolving of COVID-19 pandemic. Unfortunately, collecting large training data systematically in the early stage is difficult. To address this problem, in this paper we explore the approaches of learning deep models for COVID-19 diagnosis from a single radiological image by resorting to synthesizing diverse radiological images. We propose CoSinGAN that can learn the conditional distribution of visual finds of COVID-19 infection from a single radiological image precisely and synthesize diverse, highresolution and high-quality radiological images with COVID-19 infection effectively. Both deep classification and segmentation networks trained on synthesized samples from CoSinGAN (using 1 or 2 real images) achieve notable detection accuracy of COVID-19 infection. It strongly confirm that our method has the potential to realize learning deep models for COVID-19 diagnosis from few radiological images in the early stage of COVID-19 pandemic. Due to the strong ability in learning conditional distribution of visual finds of COVID-19 infection from a single radiological image, our CoSinGAN can also be used to perform semantic manipulation, for instance, the addition and removal of COVID-19 infection. By adding COVID-19 infection to the offthe-shelf radiological images, we may obtain training samples that are much more diverse and thus may achieve much better detection accuracy of COVID-19 infection. [1]. J. T. Wu, K. Leung, and G. M. Leung, "Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study," Pathological findings of covid-19 associated with acute respiratory distress syndrome Radiological findings from 81 patients with covid-19 pneumonia in wuhan, china: a descriptive study Chest CT for typical 2019-ncov pneumonia: relationship to negative RT-PCR testing Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases Coronavirus disease 2019 (covid-19): Role of chest ct in diagnosis and management Sensitivity of chest ct for covid-19: comparison to rt-pcr Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 Artificial intelligence-enabled rapid diagnosis of COVID-19 patients Deep learning covid-19 features on cxr using limited training data sets Chest CT findings in 2019 novel coronavirus (2019-nCoV) infections from Wuhan, China: key points for the radiologist Diagnosis of coronavirus disease 2019 (covid-19) with structured latent multi-view representation learning COVID-19 Chest CT Image Segmentation--A Deep Convolutional Neural Network Solution Dual-Sampling Attention Network for Diagnosis of COVID-19 from Community Acquired Pneumonia COVID-19: automated detection from x-ray images utilizing transfer learning with convolutional neural networks Within the Lack of Chest COVID-19 X-ray Dataset: A Novel Detection Model Based on GAN and Deep Transfer Learning CovidGAN: Data Augmentation Using Auxiliary Classifier GAN for Improved Covid-19 Detection Singan: Learning a generative model from a single natural image U-net: Convolutional networks for biomedical image segmentation Image-to-Image Translation with Conditional Adversarial Networks[C]. computer vision and pattern recognition Very deep convolutional networks for large-scale image recognition Multiscale structural similarity for image quality assessment Loss functions for image restoration with neural networks Image quality assessment: from error visibility to structural similarity Generating images with perceptual similarity metrics based on deep networks Perceptual losses for real-time style transfer and superresolution High-resolution image synthesis and semantic manipulation with conditional gans Towards Efficient COVID-19 CT Annotation: A Benchmark for Lung and Infection Segmentation Enet: a deep neural network architecture for real-time semantic segmentation Deep Residual Learning for Image Recognition Constrained-CNN losses for weakly supervised segmentation