key: cord-0137883-bnmc312a
authors: Chen, Xiaocong; Li, Yun; Yao, Lina; Adeli, Ehsan; Zhang, Yu
title: Generative Adversarial U-Net for Domain-free Medical Image Augmentation
date: 2021-01-12
journal: nan
DOI: nan
sha: 9352d04c52a2c3b0f0e50e34383603857c1b87ba
doc_id: 137883
cord_uid: bnmc312a

The shortage of annotated medical images is one of the biggest challenges in the field of medical image computing. Without a sufficient number of training samples, deep learning based models are very likely to suffer from over-fitting problem. The common solution is image manipulation such as image rotation, cropping, or resizing. Those methods can help relieve the over-fitting problem as more training samples are introduced. However, they do not really introduce new images with additional information and may lead to data leakage as the test set may contain similar samples which appear in the training set. To address this challenge, we propose to generate diverse images with generative adversarial network. In this paper, we develop a novel generative method named generative adversarial U-Net , which utilizes both generative adversarial network and U-Net. Different from existing approaches, our newly designed model is domain-free and generalizable to various medical images. Extensive experiments are conducted over eight diverse datasets including computed tomography (CT) scan, pathology, X-ray, etc. The visualization and quantitative results demonstrate the efficacy and good generalization of the proposed method on generating a wide array of high-quality medical images.

In recent decade, deep learning has attracted increasing research interests for the studies of medical imaging computing and its applications. One of the biggest challenges in applying deep learning to the medical imaging domain is to learn generalizable feature patterns from small datasets or a limited number of annotated samples. Deep learning based methods require a large amount of annotated training samples to support the inference which is hard to be fulfilled on medical imaging analysis [1] - [3] . In medical imaging tasks, annotations are conducted by radiologists with expert knowledge about the data and the related tasks. Benefit from increasingly released medical datasets and grand challenges, the shortage of dataset is relieved to some extent. However, those datasets are still in limited size as they inevitably require laborious work from radiologists [4] .

To overcome this problem, data augmentation has been popularly utilized. The most commonly used data augmentation strategy is dataset manipulation including various simple modifications of the data, such as translation, rotation, flip, crop, and scale [5] , [6] . These data augmentation methods have been widely applied to enrich the training set, thereby improving the model performance in various computer vision tasks [7] , [8] . Image modification can introduce some pixellevel side information to improve the performance. However, pixel-level modification cannot introduce new images but only the variants of the original one, and hence is still likely to suffer from the over-fitting problem. Instead, synthetic data augmentation method is considered to be more reasonable alternative as it can generate sophisticated types of data based on the original images. Generative adversarial networks (GAN) is a representative of the synthetic data augmentation method [9] and capable of providing more variability to enrich the dataset.

Inspired by the game theory, GAN aims to achieve the nash equilibrium [9] inside the model. GAN consists of two major networks which are trained jointly under an adversarial setting, where one network generates fake images based on inputs and the other network distinguishes the generated images from the real images. GANs have been increasingly applied to image synthesis [9] , [10] , such as denoising [11] , image translation [12] , etc. In addition, multiple variants of GANs were proposed to generate high quality realistic natural images [13] , story visualization [14] , and synthesize highresolution images based on the low-resolution one [15] .

Recently, several studies on medical imaging adopted the GANs and their variants as the main framework [4] , [16] , [17] . Most studies have employed the GAN technique for generating images or medical cross-modality translations. Zhang et al. [16] applied GAN to reduce the intrinsic noises in multisourced dataset as different devices would generate different types of noises and such site effects would significantly affect data distribution. Zhang et al. [18] proposed SkrGAN by incorporating the global contextual information which is fine foreground structures to improve the image quality. Xue et al. [19] used two GANs to learn the relationship between brain MRI images and a brain tumor segmentation map. Fird et al. [4] Fig. 1 : Structure of the proposed model. Given an arbitrary class c t ∈ C, our model can generate the corresponding images based on sampled image x j . Two different samples will be sampled from the given class c t each time to support the discriminator. x j will be encoded into a latent representation and concatenated with the Gaussian variable as the final representation to be used for generation. The generated image x g will be fed into discriminator together with real data x i , x j . The discriminator is designed to distinguish two distributions, i.e., true distribution {x i , x j } and fake distribution {x i , x g }. The generator aims to make those two distributions as similar as possible.

have also been successfully applied to segmentation. Dong et al. [20] adopted GAN to do the neural architecture search to find the best way to make the segmentation for chest organs. Khosravan et al. [21] introduced a projection module into GAN to boost the performance of segmentor on the lung.

However, most of the existing studies have been focusing on a specific task or domain, and there is no robust method that is generalizable across various domains. In this study, we aim to design a domain-free GAN structure which is suitable for any domains instead of a specific one. Specifically, the proposed method can be used in any domain such as X-Ray, CT scan, pathology etc. In addition, vanilla GANs suffer from the training instability problem which is hard to make the model converge [22] . We employ the Wasserstein GAN as the main framework in our model as it has shown higher training stability. U-Net is a well-known structure on medical imaging analysis, especially in segmentation [5] , [23] . Segmentation aims to find the pixel-level abnormality, which requires strong feature extraction capability. The generator in GANs requires a similar capability. Hence, we utilize U-Net as the generator in our study. U-Net is similar to auto-encoder which can learn the latent representation and reconstruct the output with the same size as the input. In order to fulfill the generation requirement, we concatenate a Gaussian variable into the latent representation to ensure that it will not generate the same image each time. The major contribution of our study can be summarized as following: The contribution of this paper can be summarized as following:

• We propose a new variant of GAN named generative adversarial U-Net for the domain-free medical image augmentation. Images generated by the proposed method have a better quality than vanilla GANs and its wellknown variant conditional GANs. • To leverage the superior features extraction capability, we first dissemble U-Net into encoder and generator. Then, we assemble the generator into the GAN structure to generate images.

• Extensive experiments are conducted on eight different datasets in different domains include CT scan, Pathology, Chest X-Ray, Dermatoscope, Ultrasound, and Optical Coherence Tomography. Experimental results demonstrate high generalizability and robustness across various data domains.

In this section, the proposed generative adversarial U-Net will be briefly introduced. We will start by describing the overall structure of the developed deep learning model followed by explaining the components including Residual U-Net generator, discriminator and the training strategy. The overall flowchart of our developed method is illustrated in Fig. 1 .

U-Net was first proposed in [24] and has been widely used in medical image segmentation [5] . It is a type of artificial neural network by using the auto-encoder structure with skip connections. The encoder is designed to extract features from the given images and the decoder is to construct the segmentation map by using those extracted spatial features. The encoder follows a similar structure like fully convolutional networks (FCN) [25] with the stacked convolutional layers. To be specific, the encoder consists of a sequence of blocks for down-sampling operations, with each block including several convolution layers followed by max-pooling layers. The number of filters in the convolutional layers is doubled after each down-sampling operation. In the end, the encoder outputs a learned feature map for the input image.

Differently, the decoder is designed for up-sampling and constructing the image segmentation. The decoder first utilizes a deconvolutional layer to up-sample the feature map generated by the encoder. The deconvolutional layer contains the transposed convolution operation and will halve the number of filters in the output. It is followed by a sequence of upsampling blocks which consist of two convolution layers and a deconvolutional layer. Then, another convolutional layer is used as the final layer to generate the segmentation result. The final layer adopts Sigmoid function as the activation function while all other layers use ReLU function.

In addition, the U-Net concatenates parts of the encoder features with the decoder, which is known as the skipconnection in ResNet [26] . For each block in the encoder, the result of the convolution before the max-pooling is transferred to the decoder symmetrically. In decoder, each block receives the feature representation learned from encoder, and concatenates them with the output of the deconvolutional layer. The concatenated result is then forwardly propagated to the consecutive block. This concatenation operation is useful for the decoder to capture the possible lost features by the maxpooling [23] .

As mentioned previously, U-Net demonstrates state-of-theart performance on medical image segmentation task, showing its superiority on the medical image feature extraction. Hence, we utilize the U-Net as the main structure for the proposed generative model. GANs G(z) are generative models that aim to learn a mapping from a random noise vector z to the output image x g , G : z → x g [9] .

However, the images generated by GANs are randomized which is hard to define the label. Hence, we use the conditional GANs [13] instead. Conditional GANs learn a mapping from a random noise vector z and observed images x i for class c t to the output image x g , G c : (z, x i ) → x g . Generator G is trained to generate images that cannot be distinguished from the "real" images by an adversarial trained discriminator D.

Discriminator D is trained to detect fake images produced by generator. Alternatively, GANs are designed to conduct the distribution discrepancy measurement between the generated data and real data. The objective function of the conditional GANs can be expressed as:

where G c tries to minimize the objective function against D that tries to maximize it. Mathematically, it is formulated as follows:

where G is the result generator when Eq.(3) reaches the Nash equilibrium. However, conditional GANs have similar limitations with the traditional GANs, which suffers from training instability and model collapse problems. Hence, we use the Wasserstein GAN [22] as the main structure for our generative model. To be specific, a normal GAN minimizes JS-Divergence which is shown on Eq.(2), whereas objective function for Wasserstein GAN is:

Furthermore, an gradient penalty is introduced [10] to enforce the Lipschitz constraint for Wasserstein GAN which is:

where the Px sampling uniformly along straight lines between pairs of points sampled from the data distribution and generator distribution. Thex is the combination of the original images and generated ones sampled from an uniform distribution U(0, 1) with a control factor . A generative adversarial network could be used to conduct data augmentation. Given a certain class c t and corresponding data point x, we are able to learn a representation of the input image r x through the encoder such that r x = g(x) where g(·) represents the encoder network. In addition, a latent Gaussian variable z i is introduced into the learned representation to provide the variation with the following form

where g l is the linear projection to project the Gaussian noise into vector form so that it can be concatenated with the learned representation. Once the representation is learned, it will be fed into the generator to generate images. In the proposed method, U-Net is split up into encoder and generator. The structure of U-Net can be found in the right side of Fig. 1 . The ResNet block is used as the basic unit, which is defined as

where k j is k − th layer and w j is the corresponding trainable weight and r k is the residual. In addition, different from the traditional U-Net, we use leaky ReLU f (x) as the activation function.

It is worth noting that the generated image x g and original image x j are both provided to the discriminator. We want to ensure that the generator is capable of generating the image that is related to but different from the original image x j . That is, the generated image x g is supposed to be drawn from the same class as x j other than just a duplicate or simple modification of x j . By providing the current image x j , we can prevent generator from simply encoding it. In addition, the class information is provided where the generator can better learn the generalized pattern across all classes. 

The generator contains eight blocks where each block has four 3×3 convolutional layers with batch renormalization [27] followed by a downscaling or upscaling layer. Downscaling layers are convolutions with stride 2 followed by leaky ReLU, batch normalization and dropout. Upscaling layers are deconvolution with stride 1 2 followed by leaky ReLU, batch renormalization and dropout. As aforementioned, we also maintain the skip-connections inside the generator. We use the similar strategy with ResNet, we use a 1×1 convolutional layer to pass features between blocks. DenseNet [28] is adopted as the discriminator. Layer normalization is applied instead of batch normalization as we find it has a better performance. Discriminator contains four dense blocks and four transition layers where each dense block contains four convolutional layers and ends with a dropout layer. The reason why we apply the dropout at the last layer is that it can avoid the overfitting.

To optimize our networks, we follow the standard approach introduced in [9]: alternately update one gradient descent step on D, then one step on G. As suggested in the original WGAN paper, we train the model based on the algorithm 1.

Training algorithm for our model input: gradient penalty coefficient λ, Adam parameter α, β 1 , β 2 , batch size m, input image x i input: discriminator parameter w 0 , generator parameter θ 0 1 while θ not converged do 2 for t = 1, · · · do 3 for i = 1, · · · , m do 4 Sample real data x j , random noise from Eq.(1), random number ∼ U(0, 1); 

The experiments are conducted on multiple public available datasets which are: [36] to transfer them into 2D images with the axial view. OpenCV2 is utilized to resize all those medical images into 256 × 256 × 3. We use the ratio 7:1:2 to split the datasets into training set, validation set and test set on patient level. The information of all these datasets is summarized on Table I .

The experiments were conducted on a machine with eight GPUs which includes six NVIDIA TITAN X Pascal GPUs and two NVIDIA TITAN RTXs. The model is implemented in Tensorflow.

Quality evaluation of the generated images is a challenging problem [37] . Traditional metrics, such as per-pixel mean square error, is hard to reflect the performance. Hence, we use the Fréchet Inception Distance (FID) [38] to measure the distance between the generated distribution and the real distribution. Lower FID indicates that the generated images have higher quality. In addition, we also use the Per-pixel Accuracy (PA) to measure the discriminability of the generated images [12] , [37] , [39] . In order to demonstrate the superiority of the proposed method, we select the following baselines to compare the quality.

• Vanilla GAN [9] : the original version of the GAN. • Conditional GAN [13] : conditional GAN which consider about the label information. It is worth to mention that U-Net is type of auto-encoder, but auto-encoders can not be used as baselines. Auto-encoders are widely used on reconstruction task instead of augmentation, it is not suitable in this manuscript. For classification, we use several different metrics which are: 

In this part, we briefly present the parameters setting used in our experiments. The gradient penalty coefficient λ is 10. For Adma optimizer [40] , we set α = 0.0001, β 1 = 0, β 2 = 0.9.

Our Method GAN cGAN Fig. 2 : Generated images on three different domains: Lung CT, Chest X-Ray and Ultrasound. From left to right are: original images, our method, GAN generated image, and conditional GAN generated image. It is obvious that the images generated by our method is more similar with the original ones.

The growth rate for Dense Block k = 64, the number of training epoch is 5, 000. The classifiers are trained for 5, 000 epoches with Adma optimizer with α = 0.001, β 1 = 0.9, β 2 = 0.99.

We first compared the performance of image generation among GAN, cGAN, and our method. The results summarized in Table II show that our method achieves the best result compared with GAN and conditional GAN on both metrics. In order to better analyze the generated images, we provide the visualization for those generated images. We use three different datasets: Luna, ChestXray8 and BreatUltra as the demo datasets. The visualization can be found on Fig 2. It's obviously that the image generated by our model are more similar with the ground truth.

We also provide the classification results with and without augmentation on all the eight datasets with four different classifiers on Table III . We found that the performance of all the classifiers improved significantly after augmentation.

The lack of annotated medical images leads to a significant challenge for imaging-based medical studies, such as quick diagnosis, disease prediction, etc. Data augmentation is a popular approach to relieve this problem. In this paper, we propose a new method named generative adversarial U-Net for medical image augmentation. It can be used to generate multimodalities data to relieve the data shortage which is commonly faced in medical imaging research. Specifically, we adjust the structure of the generative adversarial network to fit into the U-Net. We conduct extensive experiments on eight datasets with different modalities from binary classification to multiclass classification. Our experimental results demonstrated the superior performance of the proposed method over the stateof-the-art approaches on all of these datasets. In the future, we plan to extend our work into the more challenging few-shot learning scenario or semi-supervised learning scenario [41] , [42] in which only a few samples even no samples available for certain class. We plan to investigate the transfer learning to help with enriching the current model to augment the ability of dealing with unseen samples from a brand new class.

Improving computer-aided detection using convolutional neural networks and random view aggregation

A survey on deep learning in medical image analysis

Multi-task generative adversarial learning on geometrical shape reconstruction from eeg brain signals

Gan-based synthetic medical image augmentation for increased cnn performance in liver lesion classification

Residual attention u-net for automated multi-class segmentation of covid-19 chest ct images

Momentum contrastive learning for few-shot covid-19 diagnosis from chest ct images

Unpaired image-to-image translation using cycle-consistent adversarial networks

Unsupervised image-to-image translation networks

Generative adversarial nets

Improved training of wasserstein gans

Low-dose ct image denoising using a generative adversarial network with wasserstein distance and perceptual loss

Image-to-image translation with conditional adversarial networks

Conditional generative adversarial nets

Storygan: A sequential conditional gan for story visualization

Photo-realistic single image super-resolution using a generative adversarial network

Noise adaptation generative adversarial network for medical image analysis

Medical image synthesis with context-aware generative adversarial networks

Skrgan: Sketching-rendering unconditional generative adversarial networks for medical image synthesis

Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation

Neural architecture search for adversarial medical image segmentation

Pan: Projective adversarial network for medical image segmentation

Wasserstein gan

Unet++: A nested u-net architecture for medical image segmentation

U-net: Convolutional networks for biomedical image segmentation

Fully convolutional networks for semantic segmentation

Deep residual learning for image recognition

Batch renormalization: Towards reducing minibatch dependence in batch-normalized models

Densely connected convolutional networks

Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

Identifying medical diagnoses and treatable diseases by image-based deep learning

Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge

Automated breast ultrasound lesions detection using convolutional neural networks

The liver tumor segmentation benchmark (lits)

Efficient multiple organ localization in ct image using 3d region proposal network

Improved techniques for training gans

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Generative image modeling using style and structure adversarial networks

Adam: A method for stochastic optimization

Adversarial variational embedding for robust semi-supervised learning

Learning from less for better: semi-supervised activity recognition via shared structure discovery