key: cord-154091-uuupn82y authors: Xu, Zhanwei; Cao, Yukun; Jin, Cheng; Shao, Guozhu; Liu, Xiaoqing; Zhou, Jie; Shi, Heshui; Feng, Jianjiang title: GASNet: Weakly-supervised Framework for COVID-19 Lesion Segmentation date: 2020-10-19 journal: nan DOI: nan sha: doc_id: 154091 cord_uid: uuupn82y Segmentation of infected areas in chest CT volumes is of great significance for further diagnosis and treatment of COVID-19 patients. Due to the complex shapes and varied appearances of lesions, a large number of voxel-level labeled samples are generally required to train a lesion segmentation network, which is a main bottleneck for developing deep learning based medical image segmentation algorithms. In this paper, we propose a weakly-supervised lesion segmentation framework by embedding the Generative Adversarial training process into the Segmentation Network, which is called GASNet. GASNet is optimized to segment the lesion areas of a COVID-19 CT by the segmenter, and to replace the abnormal appearance with a generated normal appearance by the generator, so that the restored CT volumes are indistinguishable from healthy CT volumes by the discriminator. GASNet is supervised by chest CT volumes of many healthy and COVID-19 subjects without voxel-level annotations. Experiments on three public databases show that when using as few as one voxel-level labeled sample, the performance of GASNet is comparable to fully-supervised segmentation algorithms trained on dozens of voxel-level labeled samples. statistics related to the lesion area are part of the criteria for determining the severity [3] [4] . However, this task is challenging as the lesion areas are extremely varied. Three typical COVID-19 CT scans from a public dataset [5] are shown in Fig. 1 . It can be seen that the lesions range from small to large, and the appearance may be of ground glass opacity, consolidation, or mixed type. Due to blurry and indistinguishable boundaries between infected and healthy areas, voxel-level labeling of lesions is not only time-consuming, but also tends to contain inconsistency between different annotators. Fig. 1 also shows the infectious annotation as ground truth (GT) along with the dataset and the manual segmentation results by two other radiologists. Since the boundary of the infected area is very fuzzy, even the segmentation results given by two experienced radiologists have obvious inconsistencies with GT. Deep learning shown encouraging performance for lesion segmentation of COVID-19 CT, but only when a sufficient amount of labeled data such as thousands of slices is available [6] [7] [8] [9] . It takes more than 200 minutes on average to annotate the lesion area of one COVID-19 CT volume [8] . The high cost of collecting expert annotations is a big obstacle to the development of medical image segmentation algorithms for new diseases like COVID-19. Data augmentation [10] [11] and image synthesis [12] [13] may alleviate the lack of pixel/voxel-level annotations to a varying degree. Self-learning or active learning [14] [15] [16] updates the segmentation model by iteratively providing pseudo label to the unlabeled data, and hopes to gradually improve the precision. Other methods try to make up for the lack of pixel/voxel-level supervision information by using image/volume-level labels, such as Class Activation Maps (CAMs) [17] , Generative Adversarial Network (GAN) [18] , and Multiple Instance Learning (MIL) [19] . However, these methods have more or less the following problems: (1) a certain number of samples with pixel/voxel-level annotations are still necessary; (2) using pseudo-label data may introduce noise; and (3) the mapping from volume-level annotation information to voxel-level segmentation is usually not accurate enough. Detailed comment on related medical image segmentation methods is provided in section II. Our idea is to 'restore' the CT volume of a COVID-19 patients to the status when he/she is healthy by combining a segmentation network and a generative network. Restoration performance is supervised by a discriminator that is trained using CT scans of many healthy people and COVID-19 Ground Truth Radiologist A Radiologist B Fig. 1 . Typical CT scans of three COVID-19 patients from the public Dataset-A [5] . From the second row to the fourth row are the different annotations provided along with the dataset and by two other radiologists from Wuhan Union Hospital. The difference between different annotations is obvious. patients (without voxel-level labeling of lesion areas). This scheme is feasible since a large number of volume-level labels, indicating whether a CT volume is COVID-19 positive or not, are directly available from diagnosis results in COVID-19 designated hospitals and more reliable [20] than voxel-level annotations obtained manually. The proposed framework is designed to mine the potential knowledge contained in many COVID-19 positive and negative CT volumes by embedding Generative Adversarial training in a standard Segmentation Network, referred to as GASNet, and hence its demand for voxel-level annotations is very small. Fig. 2 shows the pipeline of GASNet. Both the generator and the segmenter take a COVID-19 CT volume as input, and the two outputs together with the original CT volume are fused to form a synthetic healthy volume. Both real and synthetic healthy volumes are fed to the discriminator. In the training process, the goal of the discriminator is to distinguish between the synthetic healthy volume and the real healthy volume, while the goal of the generator and the segmenter is to deceive the discriminator. Such an adversarial training strategy will push the segmenter to segment the lesion areas of a COVID-19 CT as precisely as possible. We also propose a simple but effective strategy of synthesizing COVID-19 CT volumes with voxel-level pseudolabels during the adversarial training process, which further improves the segmentation performance of GASNet. A detailed description of the algorithm will be given in section III. Compared with other weakly-supervised methods, a major advantage of GASNet lies in utilizing volume-level labels in an adversarial learning way, alleviating the burden of voxel-level annotation and maintaining a good segmentation performance at the same time. When using only one voxel-level labeled sample in training, GASNet obtains a 70% Dice score on a public COVID-19 lesion segmentation dataset [5] , comparable to representative fully-supervised algorithms (U-Net [21] , V-Net [22] , and UNet ++ [23] ) requiring a large number of voxel-level annotated samples. Code of GASNet is available at https://github.com/xzwthu/GASNet. Details of the experiments are in section IV and section V. In this section, we first introduce existing public COVID-19 CT databases containing lesion annotation, then review current COVID-19 lesion segmentation methods both in fullysupervised and weakly-supervised way, and finally describe recent weakly-supervised methods and GAN methods in general medical image segmentation. Performance evaluation using public datasets is very important for comparing different image segmentation algorithms. Being an emerging research direction, most COVID-19 studies are conducted independently [6] [7] [8] , using non-public data. Very recently, a few public databases are available [5] [24] [25] . Giving a brief description of these databases is necessary before we discuss performance of different algorithms. We summarize the current public COVID-19 CT segmentation datasets in Table I . Detailed descriptions are in section IV. Most of the COVID-19 lesion segmentation methods [6] [7] [8] are based on U-Net [21] structure or its modifications, containing an encoding path and a decoding path, which are connected by skip connection at the corresponding resolution. Zhang et al. [6] adopt a two-stage segmentation framework for segmenting lung lesions into five classes. They train on a total of 4,695 CT slice images with voxel-level annotations and obtain an mDICE score of 58.7%. Wu et al. [7] jointly train a segmentation network and a classification network, using over 144K slices including 3,855 voxel-level labeled CT scan slices from 200 COVID-19 patients. They obtain a 78.3% Dice score on their dataset. Shan et al. [8] use a 3D VB-Net as the backbone and employ a Human-In-The-Loop (HITL) strategy to train the network on 400 CT volumes. The HITL strategy reduces the annotation time and improves the accuracy. The net time spent on labeling data is still more than 176 hours, and they report a 91% Dice score on Three modules with optimizable parameters compose the framework of GASNet, the segmenter (S), the generator (G), and the discriminator (D). Only the S part is needed during the test. The input of the pipeline is presented in 2D for clarity. In fact, the input to each module of GASNet is the entire 3D volume. their own dataset. However, [7] [8] have not published their codes, neither did they report their performance on public datasets. We reproduce the network structures of these works and test their performance on Dataset-A [5] following a 5-fold cross-validation strategy. The performance is slightly improved compared with a normal U-Net network, with Dice scores of 64% and 63%, while the performance of U-Net is 62%. For more details, please refer to section V. Besides U-Net, other deep models have also been used for COVID-19 lesion segmentation. Fan et al. [27] propose modules named Parallel Partial Decoder and Reverse Attention Module to improve lesion segmentation performance. They also conduct a test with semi-supervised strategy, collecting an unlabeled dataset and giving pseudo values iteratively, and gain a Dice score of 59.7% on Dataset-B [24] . Qiu et al. [9] propose a lightweight 2D model pre-trained on ImageNet dataset [28] and obtain performance comparable to heavy models like the fully convolutional network (FCN) [29] structure (77% VS 75%) on a dataset consisting of 110 axial CT slices from ∼60 patients with COVID-19 [30] . The latest research begins to explore lesion segmentation of COVID-19 volumes in weakly-supervised scenarios. Laradji et al. [31] propose to train a neural network with active learning on a point-level annotation scenario. Yao et al. [32] design a set of simple operations to synthesize lesion-like appearances, generate paired training datasets by superimposing synthesized lesions on the lung regions of healthy images, and train a model to predict the healthy lung part of the input. A set of specially designed methods combining threshold selection, morphological processing, and region growth are used to determine the lesion segmentation during the test. Zhang et al. [33] also use the GAN network as we do, but the purpose of GAN in their method is to perform data augmentation based on existing voxel-level labeled samples, so as to generate more paired samples with pseudo labels. Two segmentation networks, ENet [34] and U-Net, are trained to verify the effectiveness of the proposed data augmentation. Different from the above methods, we focus on designing a weakly-supervised segmentation framework under volumelevel label supervision. Our framework simultaneously trains the GAN and the segmentation network and dynamically extracts the volume-level annotation information through adversarial learning, thus minimizing the requirement for voxellevel annotations. The comparison of GASNet with the above methods on the division of dataset, the number of annotations, and performance of segmentation are given in Table II , and a more detailed comparison will be given in section IV. Various methods of using weak annotations have been proposed in medical image segmentation area. Several works are devoted to the use of extra but sparse annotations, including scribbles [35] , points [36] [37] , and bounding boxes [38] [39] . Scribbles and points require labeling at least one scribble or point for each RoI and the labeled areas will be used to calculate the segmentation loss directly. As for the unlabeled part, Wang et al. [36] propose generating initial segments via a random walker algorithm [40] , and then train a fullysupervised segmentation network. Qu et al. [37] propose a similar pipeline using a different method for label generation, combining K-means clustering results and Voronoi partition diagram. Instead of generating a pseudo label for the unlabeled areas, Valvano et al. [35] directly predict the segmentation results by adding shape constraints through multi-GAN to make the segmentation results look realistic at multi-scales. Bounding boxes provide a more well-refined position constraint for segmentation but are more time-consuming for annotation [38] [39] . The major limitation of the aforementioned approaches is relying on additional dataset annotations, which can be time-consuming and is prone to errors (for example, not all voxels in the bounding box should be positive; scribbles and points annotation can miss challenging labeled samples), and the errors can be propagated to the models during training. Methods using GAN, such as [35] even need unpaired real segmentation masks, which are voxel-level labeled, as the real samples for the discriminator. Weakly-supervised learning under volume-level label supervision earns increasing interest in medical image segmentation because it adds no annotation burden. Xu et al. [41] enrich the volume-level labels to instance-level labels by multiple instance learning (MIL) and segment histopathology images using only volume-level labels. However, MIL shows unsatisfactory performance on lesion segmentation of COVID-19 as shown in section IV. Feng et al. [42] propose a method especially for pulmonary nodules segmentation that learns discriminative regions from the activation maps of convolution units (CAM) in an image classification model. Ouyang et al. [43] employ the attention masks derived from a volume-level classification model as the voxel-level masks for weakly-annotated data. Because the attention masks are rough and inaccurate, hundreds of voxel-level annotations are still necessary for accurate lesion segmentation like pneumothorax segmentation in chest X-ray [43] . GAN is increasingly adopted as an assistance to medical image segmentation task. The mainstream directions of GAN based methods include: (1) synthesizing more available training sample pairs [12] [13] [44] , where GAN is a tool for data augmentation, and the training of segmentation network has no feedback on the quality of synthetic data. (2) Adapting domain to leverage external labeled datasets [45] [46] [35] . The external dataset is required to contain enough pixel/voxel-level labeled training samples. And (3) considering the segmentation network as a generator and designing the discriminator as a structure of FCN [29] to obtain a confidence map of segmentation prediction, and thus helping the optimization of the segmentation network based on it [15] [47] [48] . Such methods do not use volume-level annotation, and their requirement for voxel-level labeled samples is considerable. In this section, we first illustrate the pipeline of GASNet. Then, we describe the auxiliary constraint terms in the form of loss functions used to make the training more stable and GASNet perform better. We will also detail a simple but effective method of generating COVID-19 positive CT volumes with voxel-level pseudo-label to improve the segmentation performance of GASNet. Finally, we provide the implementation details, including the specific structure, data preprocessing, and the training hyperparameters. GASNet consists of three modules: the generator (G), the discriminator (D), and the segmenter (S). The data input to GASNet includes a small amount of voxel-level labeled data I l , and a large amount of volume-level labeled data I d and I h , where I d is the diseased volume data and I h is the healthy volume data. Our method is based on a simple fact: the appearance of a lesion area contains the most obvious feature to distinguish COVID-19 CT from healthy CT. We train a segmenter that can provide segmentation masks and utilize a generator to replace the predicted lesion area with a generated one whose appearance is close to the uninfected area while maintaining the uninfected area. If the synthetic healthy volumes successfully deceive the binary classifier, which is the discriminator in GASNet, we can obtain an accurate enough segmentation result. The synthetic volume is fomulated by: is the generated volume, and θ S , θ G are the learnable parameters of S and G respectively. To fully deceive the discriminator, the segmenter needs to segment all infected areas and the generator needs to generate confusing volumes at the predicted lesion area of the segmentation. In contrast, the discriminator will try to distinguish the synthetic volume from the real healthy one. We label the synthetic volume as 1 and the real healthy volume as 0, and train the GASNet in an adversarial way via the following minimax game: where the objective function L GAN 1 is given by where D(I; θ D ) is the prediction of the D, I s is fomulated by Eq 1, and θ D represents the learnable parameters of D. As the formation of the synthetic volume is related to the prediction mask and the generated volume, gradient of L GAN can feed back to both the S and G. Also, we add a basic segmentation loss measuring the difference between the output of S and the GT of a small number of voxel-level labeled samples: L S 2 = CEL(M l , M l ), whereM l = S(I l ; θ S ), M l is the ground truth of the voxel-level labeled data I l . Logically and theoretically, provided that we carefully train G, D, and S, the synthesized volume will be nearly close to the healthy volume. However, frameworks with GAN are generally difficult to train [48] [49] [50] [51] . The quality of the generator and the discriminator is the crux for our ultimate goal of segmenting infected areas accurately. Several auxiliary 1 We denote L GAN L GAN (G, D, S), E I h E I h ∼p data (I h ) and for simplicity. 2 We denote L S L S (S) and CEL(·,·) CrossEntropyLoss(·,·) for simplicity. constraints are added to the loss functions to make the adversarial training more stable, leading to better performance. First, the naive GASNet contains defects of the bias of the input. The segmenter is fed with only diseased volumes in the original GASNet. This brings sample bias and leads to false-positive predictions on healthy samples during testing. For healthy CT volumes, we expect the predicted segmentation maps are all zero, and the output of the generator is a reconstructed volume of the original input. Therefore, the healthy volumes are also inputted into the segmenter and the generator. The cross entropy loss between S(I h ; θ S ) and M h , where M h are all zero, is added to L S . Second, no supervision constrains the parts of the generated volume where the segmentation values are close to zero because they are not used to form the synthetic volume, and the quality of the generated volume is uncontrollable due to the lack of the supervision signal. It becomes the bottleneck of improving the final performance of the segmenter. A reconstruction loss L recons constrains the output of the generator: L recons = MSE(G(I h ; θ G ), I h ), where MSE(·,·) is the mean square error function, alleviates the problem. We also introduce an additional loss L IgT oD = E I d [log(D(G(I d ; θ G ); θ D )] to L GAN for further improvement by feeding the generated volume of I d into the D. Fig. 3 shows the comparison of the generated volumes before and after adding L recons and L IgT oD . We can see that the quality of the generated volume and the synthetic volume are significantly improved, so as to the performance of the segmenter, which will be detailed in section V. When the input of the S are healthy volumes, the forward propagation process of synthesis and discrimination is not needed, and L GAN will not be calculated. As the training proceeds and the synthetic volume gets closer to the real healthy volume, the lesion signal that can be captured by the D becomes weaker and weaker. The D will tend to learn the noise signal between the data I s and I h rather than the pathological signals, which leads to the performance collapse of the GASNet. Fig. 4 gives an example where performance collapse happens during training. The segmenter not only segmented the lesion area but also the healthy area, modifying both the pathological signals and the noise signals of the synthetic volume to confuse the discriminator. This leads to an extremely low segmentation performance. Inspired by the idea of dropout in the field of weakly supervised localization [52] [53] [54] , where a dropout layer randomly determines whether to block the distinguishing features coming into the next layer of the classification network, we also feed the original diseased volume I d to the D in order to maintain the sensitivity and discriminability of the discriminator to the lesion signal during the training, meaning the dropout ratio is fixed at 0.5. A constraint loss L IdT oD = E I d [log(D(I d ; θ D )] is added to L GAN , hoping that the D can always distinguish between volumes of the patients and the healthy people. Fig. 5 compares the training curves with and without the auxiliary constraint from L IdT oD and shows that L IdT oD alleviates the performance collapse of GASNet markedly. Finally, as the data I d has no voxel-level annotations, the final S may segment any lesion areas for some mild infected CT volumes. Inspired by MIL [19] , we add a MIL loss to L S : L M IL = −log(max(S(I d ; θ S ))), meaning at least one voxel of an diseased volume should be predicted as positive. To sum up, we extend loss functions L GAN and L S by adding four new losses as auxiliary constraints. The final objective function is defined as follow: With the losses detailed in subsection III-B, GASNet can be trained stably and achieve good performance. We can further improve the segmentation performance by synthesizing COVID-19 positive CT volumes with voxel-level pseudo-label during the training process. Given an unlabeled COVID-19 data I d and its predicted lesion segmentation maskM = S(I d ; θ S ), a healthy data I h and its predicted lung mask M lung where M lung can be obtained by existing automatic algorithms [55] , we can synthesize a COVID-19 positive volume I ps and its corresponding pseudo-label M ps as follows: where label 2 means voxels whose labels are not considered. We set ξ as 0.3 in our experiment. Different from the synthesis method in [32] , in which the distribution and shpae of lesions added to the healthy volumes are artificailly defined, the lesion area of our synthetic data is dynamically extracted from (1), the yellow area is the part ignored when calculating loss (2) , and the other areas represent the background (0). The last column in Fig. 7 follows the same rule. the real COVID-19 positive volumes. Different from [33] , in which the synthetic volume is generated through complex cascade generative networks given a label map of lesion and lung, our synthetic data is formed by simple linear weighted fusion of real infectious areas and real health volumes. Fig. 7 gives three examples of I ps and corresponding M ps . The synthetic COVID-19 volumes look very natural and diverse. Relying on the generated paired data I ps and M ps , we can alleviate the problem of insufficient voxel-level labeled samples by adding corresponding voxel-level cross entropy loss L ps = CSL(S(I ps ; θs), M ps ) to L S during the training. L ps boosts the segmentation performance of GASNet by 5.5% in our experiments, as shown in section V. a) Data pre-processing: The 3D volume sample of each CT is cropped along the lung mask. The cropped volume is then resized into 40×160×160. Following the advice from [5] , we clipped the value of CT volumes into [-1250,250] . As the tanh operation is used as the output of the generated volume, and the value of the output after a tanh operation ranges from -1 to 1, we normalized the input volume into the same range. The automatic lung segmentation algorithm is based on an opensource pre-trained U-Net model [55] . This lung segmentation algorithm may not be perfect in some cases, but we just use it to get an approximate bounding box around the lung area. Lung segmentation of COVID-19 volumes are never used after pre-processing. b) The structure of GASNet: Without loss of generality, we adopt the standard U-Net structure as the segmenter of GASNet. Regarding the memory usage of the 3D volume, the number of the basic channel is reduced to 16 from 64 in the original paper. The generator and discriminator follows the structure of CycleGAN [56] . Following the advice in [57] , we add the spectral normalization operation to the discriminator. c) Training strategy and hyperparameters: Four datasets are required in order to train GASNet and get the best model, as shown in Algorithm 1. Since the target between training the S and the G and training the D is adversarial, we iteratively update their parameters using the corresponding loss in each step. Both L GAN and L S have contributions to the optimization of the segmenter, so there exists a hyperparameter λ s to balance the two losses: L GAN + λ s L S . As G and D are trained alternately, a hyperparameter θ i controls the ratio of the number of times G and D are trained in each alternation. Validation is carried out every V al iter and only the parameters with the best performance on the validation dataset will be saved. We test the performance of our method on three public COVID-19 CT segmentation datasets [24] [25] [5] . Another public dataset with only slice-level annotations [30] used in [27] [9] is not suitable for GASNet for two reasons: (1) GASNet takes 3D CT volumes, rather than 2D slices as input; (2) annotations on slice-level, indicating whether a slice contains lesion area, are not directly available from diagnosis results. Dataset-A [5] consists of 20 CT volumes. Lungs and areas of infection were labeled by two radiologists and verified by an experienced radiologist. CT values of 10 volumes have been transformed to the range of [0,255]. Considering original CT values are unavailable, some work [32] did not test the performances on these volumes. We divide the dataset into subset 1 (original CTs) and subset 2 (10 transformed CTs), like [32] [33], and report the separate and overall performances. Dataset-B [24] consists of 9 COVID-19 CT volumes with voxel-level annotations by a radiologist. Dataset-C and Dataset-D (Volume-level annotation) are from MosMed [25] , which consists of 856 CT volumes with COVID-19 related findings as well as 254 CT volumes without such findings. 50 COVID-19 cases have voxel-level annotations of lesions by experts, which forms Dataset-C. The rest of the data, consisting of 254 healthy volumes and 806 COVID-19 volumes excluding 50 voxel-level labeled samples, forms Dataset-D. The diagnosis results of the CT volumes can be used as volume-level labels directly. Dataset-E (Volume-level annotation) is a large dataset with volume-level annotation we collected, in which 1,678 COVID-19 CT volumes come from the Wuhan Union Hospital, whose patients have been diagnosed as COVID-19 positive by nucleic acid testing, and 1,031 healthy CT volumes come from the routine physical examination. For training, all volume-level labeled data in Dataset-E is used to optimize GASNet. As for voxel-level labeled data, one volume randomly selected from Dataset-A is used for training and all of the rest, including 19 cases of Dataset-A and all volumes from Dataset-B and Dataset-C, are used for the test. Since Dataset-D comes from the same source as Dataset-C, we finetune GASNet using the volume-level Dataset-D when testing the performance on Dataset-C. The finetuned model is marked as GASN et f inetune . As for the hyperparameters, λ s is set to 100; θ i is set to 5, meaning GASNet optimizes D 5 times each time it optimizes S and G. GASNet is trained jointly from scratch (without pretraining), with a batch size of 4, learning ratio of 1e-5 for the D and the S and 1e-4 for the S. L ps is not calculated in the first 7,000 iterations as we found the predicted mask for I d is prone to errors at first. Simple data augmentation techniques, including random cropping, Gaussian noise, and rotation lead to slight improvement on the test dataset. It takes about 24 hours (∼14,000 iterations) for training using a Titan RTX GPU with a 24G memory. During the test, voxels greater than 0.5 in the probabilistic segmentation maskM are predicted to be lesion (1), and those smaller than 0.5 are predicted to be healthy (0). We adopt typical metrics in COVID-19 lung infection quantification [58] [8], i.e. the Dice Score, Sensitivity, and Specificity for evaluation. Dice Score measures the overlap between the prediction and the ground truth: Dice = Sensitivity measures the fraction of real positive samples that are predicted correctly: Sensitivity = T P T P +F N . Specificity measures the fraction of real negative samples that are predicted correctly: Specif icity = T N F P +T N . Quantitative results of GASNet and other small sample learning work on COVID-19 segmentation testing on three public datasets are shown in Table III and IV. We also reproduce the MIL strategy to represent the mainstream weaklysupervised methods in general medical image segmentation, together with a standard segmentation network, which is a U-Net structure in our experiment. Because different methods used inconsistent division strategies for datasets, the tables also show the number of training samples and testing samples used by each method on each dataset. In order to understand the difficulty of COVID-19 lesion segmentation, two radiologists from Wuhan Union Hospital annotated cases of Dataset-A in voxel level independently and their performances are measured by comparing with the ground truth of Dataset-A. The Dice scores of two radiologists are 73.5% and 73.9%, while GAS-Net achieves 70.3%. Comparing with other existing works, only LabelFree [32] and CoSinGAN [33] use less voxel-level labeled samples in training (zero and one slice from one sample) than GASNet, but GASNet exceeds their performance with a large margin. Other methods including [31] [59] and 3D Qualitatively, visualization of the output of GASNet on four samples from the test datasets is shown in Fig. 8 . The generated volume looks like a blurry version of the original input, except for the predicted lesion areas. The appearance in the predicted lesion areas changes a lot, making the generated volume look closer to a real healthy volume. Therefore, GASNet replaces the lesion areas of the original diseased volume with the corresponding parts of the generated volume, which makes the synthetic volume look quite similar to real healthy volumes. These examples show that GASNet does optimize its parameters to reach the goal of restoring original healthy CT volumes as we expect. The segmentation results of three samples using different methods are shown in Fig. 9 . Compared with the standard 3D U-Net baseline and Multi-Instance Learning method, GASNet holds obvious advantages in eliminating both false positive and false negative. Fig. 10 shows three cases where GASNet has relatively poor performance. The first CT volume contains a small lesion, while GASNet missed it. In the second case, the lesion segmentation of GASNet is partially missing near the edge of lung. In the third case, the lesion area is so complex that annotations of the two radiologists and the ground truth are inconsistent, while the segmentation of GASNet is closer to those of radiologists. To understand the impact of the number of training samples with voxel-level annotations, we use 1 sample, 4 samples, 20 samples and 45 samples respectively from Dataset-C as voxel-level labeled samples, use Dataset-D as volume-level labeled samples to train GASNet, and test the performance their performance on Dataset-A. We also train the baseline, i. e., using only a corresponding number of voxel-level labeled samples to train the standard U-Net network. Besides of 3D U-Net, we also adopt 3D VB-Net [22] and U-Net++ [23] as the segmenter of GASNet and test the performance following the same experimental scenario. The Dice scores on the test dataset of all the experiments are shown in Fig. 11 . Note that 3D VB-Net is also the network used in [8] . The performance of GASNet, no matter which segmentation model is used as the segmenter of GASNet, is always better than that of the corresponding baseline, which demonstrates the robustness of the framework. As demonstrated in subsection III-B, auxiliary constraints added as loss functions benefit the training of GASNet and the final performance. We quantitatively analyzed the ability of different constraints to improve the final segmentation performance by gradually adding the constraint losses to the framework. The quantitative results are shown in Table V . Each auxiliary constraint benefits the performance, with L IgT oD and L IdT oD benefiting the most. As shown in Fig. 3 and Fig. 5 , L IgT oD improves the quality of the generated volume and L IdT oD alleviates the performance collapse of GASNet. Compared with the original GASNet without any auxiliary constraints, the Dice score has cumulatively risen more than 10 percent, proving the great impact of auxiliary constraints on network training. L ps further improves the segmentation performance of GASNet from 64.75% to 70.3%, by adding reliable supervision signal in voxel-level to the segmenter of GASNet. We propose a weakly-supervised framework for COVID-19 infection segmentation, named GASNet. Utilizing volumelevel annotation information, GASNet needs only a single voxel-level labeled sample to obtain performance comparable to fully-supervised methods. Several auxiliary constraint losses benefit the training of GASNet, improving the segmentation performance and the quality of the synthetic volumes. Extensive experiments demonstrate the robustness of the algorithm. Given that volume-level labels are directly available as diagnosis results, GASNet is valuable in medical practice. However, more research on explaining and improving the framework is necessary, including embedding state-of-the-art segmentation structure to pull up the performance and relaxing some constraints used in this study. In the future, we will try to extend GASNet to handle multi-class segmentation tasks. Experiments on segmenting lesions of other diseases will also be carried out to valid the generalization of GASNet. Sensitivity of Chest CT for COVID-19: Comparison to RT-PCR Chest CT Findings in 2019 Novel Coronavirus (2019-nCoV) Infections from Wuhan, China: Key Points for The Radiologist CT Imaging Features of 2019 Novel Coronavirus (2019-nCoV) Radiological Findings from 81 Patients with COVID-19 Pneumonia in Wuhan, China: A Descriptive Study Towards Efficient COVID-19 CT Annotation: A Benchmark for Lung and Infection Segmentation Clinically Applicable AI System for Accurate Diagnosis, Quantitative Measurements, and Prognosis of COVID-19 Pneumonia Using Computed Tomography JCS: An Explainable COVID-19 Diagnosis System by Joint Classification and Segmentation Lung Infection Quantification of COVID-19 in CT Images with Deep Learning MiniSeg: An Extremely Minimum Network for Efficient COVID-19 Segmentation 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation When Unseen Domain Generalization is Unnecessary? Rethinking Data Augmentation Medical Image Synthesis for Data Augmentation and Anonymization Using Generative Adversarial Networks," in Simulation and Synthesis in Medical Imaging CT-realistic Lung Nodule Simulation from 3D Conditional Generative Adversarial Networks for Robust Lung Segmentation Semi-supervised Learning for Network-based Cardiac MR Image Segmentation ASDNet: Attention Based Semi-supervised Deep Networks for Medical Image Segmentation Self-learning to Detect and Segment Cysts in Lung CT Images without Manual Annotation Grad-cam: Visual Explanations from Deep Networks via Gradient-based Localization Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks Multi-Instance Learning by Treating Instances As Non-I.I.D. Samples Development and Evaluation of An AI System for COVID-19 Diagnosis U-net: Convolutional Networks for Biomedical Image Segmentation V-net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation UNet++: A Nested U-Net Architecture for Medical Image Segmentation," in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support Covid-19 CT Segmentation Dataset MosMedData: Chest CT Scans with COVID-19 Related Findings Inf-Net: Automatic COVID-19 Lung Infection Segmentation from CT Images ImageNet Classification with Deep Convolutional Neural Networks Fully Convolutional Networks for Semantic Segmentation Italian Society of Medical and Interventional Radiology COVID-19 Dataset A Weakly Supervised Region-Based Active Learning Method for COVID-19 Segmentation in CT Images Label-Free Segmentation of COVID-19 Lesions in Lung CT Learning Diagnosis of COVID-19 from A Single Radiological Image Enet: A Deep Neural Network Architecture for Real-time Semantic Segmentation Weakly Supervised Segmentation with Multi-scale Adversarial Attention Gates Weakly Supervised Segmentation from Extreme Points," in Large-Scale Annotation of Biomedical Data and Expert Label Synthesis and Hardware Aware Learning for Medical Imaging and Computer Assisted Intervention Weakly Supervised Deep Nuclei Segmentation Using Partial Points Annotation in Histopathology Images Bounding Boxes for Weakly Supervised Segmentation: Global Constraints Get Close to Full Supervision DeepCut: Object Segmentation from Bounding Box Annotations Using Convolutional Neural Networks Random Walks for Image Segmentation CAMEL: A Weakly Supervised Learning Framework for Histopathology Image Segmentation Discriminative Localization in CNNs for Weakly-Supervised Segmentation of Pulmonary Nodules Weakly Supervised Segmentation Framework with Uncertainty: A Study on Pneumothorax Segmentation in Chest X-ray Efficient Active Learning for Image Classification and Segmentation Using A Sample Selection and Conditional Generative Adversarial Network Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-ray Image Segmentation Adversarial Image Synthesis for Unpaired Multi-modal Cardiac Data Learning to Segment Skin Lesions from Noisy Annotations," in Domain Adaptation and Representation Transfer and Medical Image Learning with Less Labels and Imperfect Data Adversarial Confidence Learning for Medical Image Segmentation and Synthesis Improved Techniques for Training Gans Unsupervised Representation Learning with Deep Convolutional Neural Network for Remote Sensing Images Large Scale Gan Training for High Fidelity Natural Image Synthesis Attention-Based Dropout Layer for Weakly Supervised Object Localization Attention-based Dropout Layer for Weakly Supervised Single Object Localization and Semantic Segmentation ADCM: Attention Dropout Convolutional Module Automatic Lung Segmentation in Routine Imaging is A Data Diversity Problem, Not A Methodology Problem Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks Spectral Normalization for Generative Adversarial Networks Large-scale Screening of COVID-19 from Community Acquired Pneumonia Using Infection Size-aware Classification A Weakly Supervised Consistency-based Learning Method for COVID-19 Segmentation in CT Images nnu-net: Self-adapting Framework for U-net-based Medical Image Segmentation