key: cord-0195250-gddg2kc3 authors: Liang, Jiamin; Yang, Xin; Huang, Yuhao; Li, Haoming; He, Shuangchi; Hu, Xindi; Chen, Zejian; Xue, Wufeng; Cheng, Jun; Ni, Dong title: Sketch guided and progressive growing GAN for realistic and editable ultrasound image synthesis date: 2022-04-14 journal: nan DOI: nan sha: db6d6bde546670094d0356d740a666bfd0eaca91 doc_id: 195250 cord_uid: gddg2kc3 Ultrasound (US) imaging is widely used for anatomical structure inspection in clinical diagnosis. The training of new sonographers and deep learning based algorithms for US image analysis usually requires a large amount of data. However, obtaining and labeling large-scale US imaging data are not easy tasks, especially for diseases with low incidence. Realistic US image synthesis can alleviate this problem to a great extent. In this paper, we propose a generative adversarial network (GAN) based image synthesis framework. Our main contributions include: 1) we present the first work that can synthesize realistic B-mode US images with high-resolution and customized texture editing features; 2) to enhance structural details of generated images, we propose to introduce auxiliary sketch guidance into a conditional GAN. We superpose the edge sketch onto the object mask and use the composite mask as the network input; 3) to generate high-resolution US images, we adopt a progressive training strategy to gradually generate high-resolution images from low-resolution images. In addition, a feature loss is proposed to minimize the difference of high-level features between the generated and real images, which further improves the quality of generated images; 4) the proposed US image synthesis method is quite universal and can also be generalized to the US images of other anatomical structures besides the three ones tested in our study (lung, hip joint, and ovary); 5) extensive experiments on three large US image datasets are conducted to validate our method. Ablation studies, customized texture editing, user studies, and segmentation tests demonstrate promising results of our method in synthesizing realistic US images. Ultrasound (US) imaging is prevalent in routine clinical examinations because of its relatively low cost, real-time imaging capability and avoidance of radiation exposure (Kutter et al., 2009; Alessandrini et al., 2015) . During US diagnosis, sonographers first manually operate an imaging equipment to produce arXiv:2204.06929v3 [eess.IV] 25 May 2022 images required for diagnosis, and then review and analyze the images to find abnormalities (Doi, 2007) . This process relies heavily on sonographers' knowledge and experience. It usually takes a long time for novices to acquire operating and diagnostic skills. This is even truer when diagnosing rare diseases, due to the lack of training on real data (Mattausch et al., 2017) . In recent years, we have witnessed considerable progress in computational medical image analysis for the detection, diagnosis, and treatment of diseases (Cheng et al., 2020) . Compared with medical image interpretation by human experts, automated analysis is more efficient, objective, and does not suffer from inter-observer variations (Cheng et al., 2016) . In the stream of applying machine learning, especially deep learning, to data analysis, large-scale datasets and annotations lie at the heart of its success to accomplish target tasks. For example, the Ima-geNet database, designed for visual object recognition, contains more than one million annotated images. However, in medical applications, usually only a very limited number of images are available, and annotations require expert knowledge about the data and task. Therefore, the lack of large-scale datasets and annotations remains a major obstacle hindering the successful application of deep learning algorithms to medical images Gao et al., 2019) . Researchers have been trying to circumvent this obstacle via data augmentation. The most common method is affine transformation, including translation, rotation, and scaling. This technique simply modifies original images to expand the dataset for model training. Although the sample size can be remarkably increased in this way, only little additional information is introduced into the dataset, due to the small content changes (e.g. rotating an image by an angle) (Frid-Adar et al., 2018; Salehi et al., 2020) . In this regard, there is an urgent need for a new data augmentation method that can enrich the dataset with more variability, so that the model trained on a small dataset can also generalize well on unseen data (Yi et al., 2019) . Image synthesis is a new and more sophisticated augmentation method. It can be classified into physics-based and learning-based methods. Well-known US simulation packages such as Field II (Jensen, 1997 (Jensen, , 2004 and k-Wave (Treeby and Cox, 2010) can be used to simulate B-mode ultrasound images, though they are not only designed for image synthesis. k-Wave is designed for the time-domain simulation of propagating acoustic waves and can account for both linear and nonlinear wave propagation, while Field II is a linear ultrasound simulation tool. Ramírez et al. (2004) proposed a physical model for simulating intravascular US (IVUS) images. Burger et al. (2012) built deformable mesh models from CT volumes to fulfill real-time simulation by GPU. Cardiac US sequences were simulated based on an electromechanical model (Prakosa et al., 2012) and a warping strategy (Zhou et al., 2017) . Although these methods follow US imaging principles, their computational complexity is often high due to the modeling of wave propagation process (Hu et al., 2017) . They are very timeconsuming, especially for generating high-resolution images. Moreover, their performance may be affected by the quality of pre-built models, which are often essential and difficult to construct. In the past few years, deep learning based synthetic methods have gained more and more interest. Among them, the generative adversarial network (GAN) (Goodfellow et al., 2014) is the most promising approach. Fujioka et al. (2019) employed a deep convolutional GAN (DCGAN) (Radford et al., 2015) to synthesize breast US images without additional constrains. Hu et al. (2017) first proposed a novel spatially-conditioned GAN based on conditional GANs (cGANs) (Mirza and Osindero, 2014) to synthesize US images from fetal phantoms. The proposed architecture can improve the training stability by taking pixel coordinates as conditioning input. Tom et al. (2018) introduced a multi-stage method including two different cGANs to transform tissue maps into synthetic IVUS images. Although the cascaded cGANs are hard to train, this system is the first one to use tissue labels as conditioning input to enhance the training stability. Although cGAN (Mirza and Osindero, 2014; Isola et al., 2017) is effective and enables the user-controlled image generation, the synthesized images often have low resolution and checkerboard artifacts. To make the structural details of generated images more realistic, auxiliary guidance information, such as the sketch and edge of the background, was introduced (Shin et al., 2018; Zhang et al., 2019) . However, it is still challenging to synthesize high-resolution images. Due to the more details in high-resolution images, the discriminator can easily recognize the differences between generated and real images, which may lead to the vanishing gradient problem and make the training difficult. Additionally, training such model is memoryintensive, which limits using a large batch size to improve training stability. To address above issues, we devise a novel sketch guided and progressive growing GAN (spGAN) to synthesize US images. The main contributions of our work include: 1. To the best of our knowledge, this is the first work that can synthesize realistic B-mode US images with highresolution and customized texture editing features. A software tool and a video demo of our method are available at GitHub (https://github.com/Carmenliang/UI synthesis). 2. To enhance the fidelity of synthesized structure details, we propose to introduce auxiliary sketch guidance into a cGAN. Specifically, we superpose the edge sketch onto the object mask and use the composite mask as the network input. Customized editing of the edge sketch and object mask makes our method quite flexible in generating different US images for training new sonographers and augmenting data in deep learning models. 3. To generate high-resolution US images, we adopt a progressive training strategy (Karras et al., 2017) to gradually generate high-resolution images from low-resolution images. In addition, a feature loss (FL) is proposed to minimize the difference of high-level features between the generated and real images, which further improves the quality of generated images. 4. The proposed US image synthesis method is quite universal and can also be generalized to the US images of other Fig. 1 . Examples of the US images and the corresponding label maps for three different datasets. On the left are the annotated label maps in pseudo colors, and on the right are the real US images. The red arrows indicate the non-target background regions in US images, which are difficult to synthesize realistically due to the lack of background information in the label maps. anatomical structures besides the three ones tested in our study (lung, hip joint, and ovary). 5. Extensive experiments on three large US image datasets are conducted to validate the efficacy of our method, including ablation studies, customized texture editing, user studies, and segmentation comparison between real and synthesized images. Some preliminary results of this study have been published in the ISBI 2020 conference (Liang et al., 2020) . In this paper, we make substantial extensions in the following two aspects. 1) A new regularization term is introduced into the loss function to make the generated images and real images alike in terms of high-level features, which can successfully remove the artifacts present in our previous study and other advanced GANbased synthesis methods. 2) Besides the ovary dataset used in our previous study, we collect two additional large datasets of US images (COVID-19 and infant hip joint) to further validate our method. 3) We add segmentation experiments to demonstrate the efficacy of our method as a data augmentation approach. Compared with traditional data augmentation such as image translation and rotation, our GAN-based augmentation method can provide greater variability with editable operations and therefore has great potential to improve performance. 4) We add extensive ablation studies to verify the effectiveness of each component of our method. 5) We investigate the effect of three key parameters on the performance of our method. 6) We release our US image synthesis tool to a public repository (https://github.com/Carmenliang/UI synthesis), which can be readily used by other researchers. This study was approved by local institutional review boards. A robust US simulation framework is expected to be able to synthesize photo-realistic images with various characteristics, such as different structural shapes, positions, and echo patterns. Hence, three representative datasets of B-mode US images were collected and used in our study: lung US for diagnosis of , infant hip joint US (hip joint), and ovary and follicle US (ovary). Each of the three datasets has its own special characteristics. For the COVID-19 dataset, observing the specific echo patterns is important for diagnosis. For the hip joint dataset, attention is focused on the relative position of different anatomical structures. For the ovary dataset, doctors analyze the size of the ovary and the size and number of follicles. All US images had corresponding segmentation maps annotated by experienced doctors. Example images and the corresponding segmentation maps are shown in Fig. 1 . The details of each dataset are described as follows: COVID-19. This dataset contained 6054 images totally, in which 4849 images were used as the training set and the remaining 1205 images as the test set. These images had different resolutions, with the height ranging from 179 to 799 pixels and width ranging from 109 to 1104 pixels. All images were resized to 256×256 pixels and 512×512 pixels for training lowresolution and high-resolution synthesis models. The annotated artifacts in lung US images included pleura line, A-line, B-line, and consolidation. The COVID-19 dataset was collected from multiple centers in Wuhan, including Cancer Center of Union Hospital, West of Union Hospital, Jianghan Cabin Hospital, Jingkai Cabin Hospital, and Leishenshan Hospital. Various ultrasound machines were used, including Mindray M7, M8, M9 and GE Logiq E9, Logiq E Portable Ultrasound Machine. Hip Joint. This dataset contained 1231 images totally, in which 992 images were used as the training set and the remaining 239 images as the test set. To remove the characters in original US images, we cropped them to 512×512 pixels and then resized them to 256×256 pixels. Both cropped and resized images are used for training the backbone structure of GAN. Four structures were annotated in the segmentation maps, including ilium, lower limb, labrum, and co-junction. The hip joint dataset was collected from Guangdong Women and Children Hospital with two different machines (Hitachi HI-Vision Preirus and Philips iU22). The frequencies of Hitachi's transducer are between 5-13 MHz, while the frequencies of the other one are between 3-9 MHz. Ovary. This dataset contained 3261 ovarian images totally, in which 2848 images were used as the training set and the remaining 413 as the test set. The image size was non-uniform with the height ranging from 380 to 530 pixels and width ranging from 610 to 860 pixels. All images were resized to 256×256 pixels and 512×512 pixels for training. Ovary and follicles were annotated in the segmentation maps. The ovary dataset was collected from The Third Affiliated Hospital of Guangzhou Medical University with two different machines (Mindray Resona 7S and GE Voluson 6S). The frequencies of Mindray's transvaginal transducer are between 3-9 MHz, while the frequencies of the other one are between 4-10 MHz. To synthesize high-fidelity and high-resolution US images from simple segmentation maps, we proposed a GAN-based image synthesis framework, as illustrated in Fig. 2 . Specifically, we first added fine-grained edge sketch to original label maps, which resulted in the composite label maps that can help the generator create images with realistic texture (Section 2.3). Next, the designed backbone GAN structure was used for a warm-up training of low-resolution images of size 256×256 (Section 2.4). Then, a progressive growing scheme is introduced for synthesizing high-resolution images (Section 2.5). To enable a smooth transition between layers, FIBs were incorporated into the backbone structure (Section 2.6). Finally, a feature loss was employed to further improve the texture fidelity of synthesized images (Section 2.7). The above key components of our image synthesis framework are described in detail in the following sections. There are two common image-to-image translation tasks in the field of medical image analysis: translation between different imaging modalities (such as MRI T1-to-T2 (Liu, 2019; Dar et al., 2019) , CT-to-MRI (Jiang et al., 2018) , MRI-to-CT (Nie et al., 2017; Zhao et al., 2018) , CT-to-PET (Ben-Cohen et al., 2017), PET-to-MRI (Choi and Lee, 2018) ) and transformation from segmentation maps to medical images. Image synthesis from segmentation maps can generate different images by simply modifying the content in the maps, which is a desirable feature for data augmentation. Since segmentation maps only contain the shape of target structures and lack background details, the transformation from segmentation maps to medical images is generally more difficult than translation between different imaging modalities. As for US images, synthesizing from segmentation maps is even harder due to a large amount of noise in US images. Motivated by previous work (Shin et al., 2018; Zhang et al., 2019) , we used the edge information of the background texture as auxiliary sketch guidance to achieve high-fidelity synthesis and customized editing of US images. Specifically, the Canny algorithm (Canny, 1986 ) was applied to real images to extract binary edge sketch because it is robust against noise. We then updated the original segmentation map of the target object O by superposing the edge sketch S onto it, resulting in the composite labelÕ, which is defined as: where M (M ∈ {0, 1}) denotes the binary map indicating the area for annotated structures. ⊗ refers to the operation of element-wise multiplication. Through the above operation, the auxiliary sketch S is superposed onto the original mask O without affecting the area of the target objects, as shown in Fig. 3 . With the additional auxiliary sketch of background provided for GAN to learn, our method can generate images with realistic background texture. The backbone structure of GAN synthesizing images of size 256×256, as shown in the middle of Fig. 2 , is a basic structure for subsequent high-resolution image synthesis. The composite label generated in the previous section was used as the input of both the generator and discriminator of the GAN structure. For the generator, the conditional input was the composite labels and the output was the synthesized US images. We utilized the encoder-decoder architecture with n residual blocks in between. The architecture of the encoder and decoder included three down-sampling blocks and three up-sampling blocks, respectively. Each up-sampling or down-sampling block was comprised of one convolution or deconvolution layer with a stride of 2, one instance normalization (IN) (Ulyanov et al., 2016) layer, and one ReLU activation function. The number of residual blocks controls the ability of feature extraction in the model, and therefore can be different for specific datasets. Each residual block contained two convolution layers with a stride of 1. Except for the first and the last convolution layers with a kernel size of 1×1 for channel size adaptation, all other convolution and deconvolution layers used a kernel size of 3×3. For the discriminator, we used the PatchGAN (Isola et al., 2017) , whose input was the concatenation of the composite labels and generated/real images. The PatchGAN can be designed to different output sizes. Each unit of the output reflects the possibility of an image patch being real, which is used to calculate adversarial loss when training. Because PatchGAN places more restrictions on local regions and has more highfrequency information to feed back to the generator, it often performs better than the image-based discriminator. Therefore, we used the PatchGAN structure as the discriminator, which consists of five convolution layers. The objective function of generator L G was composed of a conditional adversarial loss L GAN G and a L1-loss L L1 for lowfrequency restriction, which are formulated as follows: (2) Overview of our proposed US image synthesis framework. The composite labels contain original annotated structures as well as auxiliary sketches. For the generator, the input is the composite labels and output is the generated US images. For the discriminator, the input is the US images and the corresponding composite labels. In the progressive training scheme, the left backbone structure is adopted as a pre-trained model for low-resolution image (256×256) synthesis, and then fade-in blocks (FIB-D, FIB-U) are added for synthesizing realistic high-resolution images (512×512). Moreover, the additional feature extraction network is employed for calculating the mean and covariance of the generated and real images, and then they are used for constructing the feature loss aiming at improving synthesis quality further. where G, D represent the generator and discriminator respectively, x denotes the conditional composite labels, y denotes the ground truth US images, G(x) denotes the synthesized images from input x. The hyperparameter λ 1 was set to 1. The objective function of discriminator L D is calculated as: The objective function of the discriminator was maximized while the objective function of the generator was minimized during the adversarial training process. The discriminator training alternated with generator training in each epoch. This alternating process was repeated until the generated images were sufficiently realistic. Compared to low-resolution images (256×256), highresolution images (512×512) have more fine structures. The backbone architecture alone as described in Section 2.4 is not capable of extracting enough information for generating high-resolution, realistic images. In order to synthesize highresolution US images with high fidelity, we adopted the progressive growing scheme (Karras et al., 2017) to decompose the task as incremental learning ones. This scheme enables us to use only one generator and one discriminator with fast and smooth learning for high-resolution, realistic synthesis. Specifically, we started from an easier task that synthesizes lowresolution images in several warm-up epochs with the backbone structure, and then, the weights of the backbone structure were shared with the generator and discriminator for high-resolution US image synthesis. The entire training process can be divided into four phases. In phase 1, the backbone architecture of generator and discriminator (Section 2.4) was applied for low-resolution (256×256) US images synthesis. After several warm-up epochs until convergence, this pre-trained architecture enabled good quality synthesis of low-resolution US images. In phases 2 and 3, we trained the discriminator and generator for high-resolution image synthesis, sequentially and respectively. In this process, new layers were added to the networks, and we faded them in smoothly with the FIBs to avoids sudden shocks to the already well-trained layers (see Fig. 2 ). The discriminator was trained earlier than the generator to replenish the gradient information and force the generator to learn to synthesize higher resolution images. Finally, in phase 4, we trained the discriminator and generator together (the rightmost part of Fig. 2) for several more epochs to further enhance performance. To avoid the sudden shock during training when new layers are added, FIBs were adopted for a smooth transition in both generator and discriminator. Fig. 4 illustrates the structure of FIB. Following the design of residual neural network, FIB also uses a skip connection (Fig. 4) . The side branch skips over the convolutional layers in the main branch and then merges into . These two blocks are used in the progressive growing scheme, with α increasing during transition phases. The from image block represents a layer projecting image channels to feature vectors using 1×1 convolution, and the to image block functions the opposite way. The block 512×512 contains two 3×3 convolution layers, and the block 256×256 represents the original structures of backbone adjacent to newly added structures. the main branch through a weighted sum. The weight α (α ∈ [0, 1]) controls the balance between the main and side branches. α = 0 indicates that only the side branch determines the output of FIB, while α = 1 means that the output only depends on the main branch. The use of FIB not only smooths the transition between different resolutions but also makes weights sharing possible and training more efficient. Two kinds of FIB were designed for the purpose of downsampling and up-sampling and denoted as FIB-Down (FIB-D) and FIB-Up (FIB-U), respectively. The difference between FIB-D and FIB-U is the direction of resizing. For FIB-D, it halves the resolution using average pooling, and for FIB-U, it applies bilinear interpolation for doubling the resolution. FIB-D was used in both generator and discriminator while FIB-U was only used in the generator (Fig. 2) . Fig. 4 shows the two-branch structure in FIB. The lightweight side branch helps the pre-trained network adapt easily. The more complex architecture of the main branch has stronger feature extraction capabilities. Combining these advantages, the α is introduced to guide the side branch to gradually switch to the main one. Specifically, it is increased from 0 to α max with a fixed step. It is noted that updating the α in both discriminator and generator simultaneously may decrease the stability and thus we increased α alternately when training the two modules. Besides, we also noticed that α max had an impact on the performance of the generator. A lower α max may limit feature extraction capabilities, while a larger one tends to cause a sudden increasing loss. In comparison, the loss in the discriminator varies smoothly even with a larger α max . For these reasons, we set the α max to 0.5 and 1.0 in generator and discriminator, respectively. By employing the sketch guidance in segmentation maps, progressive training scheme and FIBs, we achieved editable, high-resolution synthesis with undistorted texture. However, there were some small flaws in generated high-resolution images, which are also commonly seen in other GAN-based synthesis methods. The generated images appeared to be a little blurry compared with ground truth images and had some salt and pepper noise especially in the regions without annotated labels and sketches. The flaws might be caused by the pixel-wise L1-loss when training the generator, since the loss only restricts pixel-wise difference and neglects the spatial similarity between pixels (Blau and Michaeli, 2018) . However, removing the L1loss term from the objective function led to poor performance due to the lack of low-frequency restriction. Therefore, we proposed a feature loss to add extra restrictions for better texture synthesis. Inspired by the perceptual loss (Johnson et al., 2016) originally used in style transfer and super-resolution tasks, we calculated the high-level features of the real and generated images and tried to minimize the mean and covariance between them. To extract high-level features, images were fed to a ResNet-50 (He et al., 2016) model which was trained with 1500K prenatal US images for standard plane detection, and the output of the layer conv4 of the model was used as the high-level features. We referred to the feature extraction layers in the ResNet-50 as feature extraction network (FEN). The feature loss L F is given by: The updated objective function of the generator is given by: The hyperparameter λ 2 was set to different values for different datasets, which will be discussed in Section 3. Our experiments were implemented using Pytorch with a single NVIDIA GeForce RTX 2080 Ti GPU. The proposed sp-GAN can be split into four phases, as described in Section 2.5. In phase 1, we first trained the backbone structure of GAN for low-resolution image synthesis. We used a batch size of 4 with Adam optimizer. The learning rates of generator and discriminator were set to 0.001 and 0.0001, respectively. The weight of the feature loss term in the objective function of the generator λ 2 , the number of residual blocks in the generator n, and the output size of PatchGAN s were 10, 15, and 30×30 for the COVID-19 dataset, 10, 15, and 120×120 for the hip joint dataset, and 5, 10, and 30×30 for the ovary dataset. Note the above settings were determined by grid search and used in all experiments unless specified otherwise. For the COVID-19 and ovary datasets, the models in phase 2 and 3 were trained for 50 epochs, with α increasing by a step of 1/50. For the hip joint dataset, the models in phase 2 and 3 were trained for 100 epochs, with α increasing by a step of 1/100. In phase 4, the full model was trained for 200 epochs for the COVID-19 and ovary datasets, and for 400 epochs for the hip joint dataset. We evaluated the performance of our proposed method on three US datasets, including COVID-19, hip joint, and ovary images. Both qualitative and quantitative results were presented in the experiments. For qualitative results, we compared the real US images and synthesized images and showed the heatmap of pixel-level differences between them. For quantitative evaluation, four numerical metrics were adopted, including freshet inception distance (FID) (Heusel et al., 2017) , kernel inception distance (KID) (Bińkowski et al., 2018) , multi-scale structural similarity (MS-SSIM) (Wang et al., 2003) , and learned perceptual image patch similarity (LPIPS) . Both FID and KID measure the distance between images at the feature level. The difference is that KID estimates are unbiased. The lower value of the FID and KID indicates better performance. MS-SSIM measures the similarity of the paired images, ranging from 0 to 1, where larger value means better performance. For LPIPS, we used a pre-trained ResNet-50 (He et al., 2016) network to calculate the perceptual differences in multiple layers, with smaller differences meaning better performance. In this section, we presented the results of low-resolution (256×256) image synthesis for different US datasets. To demonstrate the effectiveness of the sketch guidance and feature loss used in our synthesis framework, we compared the results of the following three methods: baseline, baseline+S, and baseline+S+FL. The baseline only used the backbone structure of GAN (Section 2.4), baseline+S incorporated the auxiliary sketch guidance (Section 2.3), and baseline+S+FL additionally incorporated the feature loss (Section 2.7). Due to different number of training samples and different texture complexity in the three datasets, the three methods were trained for different number of epochs: 150, 350, and 300 for the COVID-19, hip joint, and ovary datasets, respectively. The qualitative results are shown in Fig. 5 , and the difference heatmap between the generated and ground truth (GT) images is shown on the bottom right corner of the generated images. The color in the heatmap from blue to red corresponds to the difference from small to large. As shown in Fig. 5 , the addition of sketch guidance to baseline remarkably improved the quality of synthesized images. The checkerboard pattern in the generated COVID-19 images by the baseline method was successfully removed when the baseline+S method was used. Moreover, we saw further visual improvement when the feature loss was applied. Some scattered dark dots were observed in the generated COVID-19 images with the baseline+S method. However, they were removed successfully when the baseline+S+FL was used. More clearly, the quantitative results in Table 1 show a large improvement regarding all performance metrics for all datasets when the sketch guidance was added to the baseline. Addition of the feature loss to the baseline+S method further improved the performance for all datasets in terms of all metrics except the MS-SSIM and LPIPS for the hip joint dataset. For synthesizing realistic, high-resolution (512×512) US images, our proposed spGAN included three key components: the use of the auxiliary sketch guidance in segmentation maps, the pre-trained model for low-resolution image synthesis, and progressive growing of the α in the FIB. The proposed spGAN v2 included an additional key component, the feature loss. We conducted extensive ablation studies to evaluate these components. The proposed spGAN was compared with three variants, each with one of the three components removed as shown in Table 2 : spGAN-sketch, spGAN-pretrain, and spGAN-growing. Further, we compared spGAN and spGAN v2 to explore the effect of the feature loss. In addition, we also included as a comparison method the bilinear interpolation result of low-resolution images generated by baseline+S+FL (see Section 3.2), and denoted it as 256+interp. The quantitative results are presented in Table 2 . Compared with the interpolation results based on the generated lowresolution images, the proposed spGAN v2 achieved large performance gains regarding almost all evaluation metrics for all three datasets, except the MS-SSIM metric for the hip joint dataset. This means that the proposed high-resolution image synthesis framework can learn feature representations with more realistic details than simple image interpolation. By removing any one of the three components used in the spGAN, the performance was degraded in most cases, which confirms the effectiveness of these adopted components. Of the three components, the most important one is the auxiliary sketch guidance. Removing the sketch guidance led to a significant decrease in performance for all metrics and for all datasets. This indicates that the sketch guidance is essential for the model to learn convincing texture. Fig. 6, 7, 8 show some examples of the generated images and difference heatmaps for the COVID-19, hip joint, and ovary images, respectively. Consistent with quantitative results, we observed that the proposed spGAN produced images that were more realistic compared with other baselines in most cases. To conduct user studies 2 3 4 of each dataset, we randomly chose 200 images of four different types: GT and the generated images by three methods (i.e., spGAN-sketch, spGAN, and sp-GAN v2 as explained in Section 3.3), each type with 50 images. Five doctors were asked to view these images and tell whether they were real or fake. For each of the four types, we reported the accuracy as the fraction of images being correctly classified. The results are provided in Table 3 . A lower accuracy of an image synthesis method means that the images generated by this method are more realistic, while the accuracy of GT should be close to 1. Based on the results in Table 3 , radar charts were drawn for each dataset for a more intuitive comparison, as shown in Fig. 9 . Each vertex in the radar chart indicates the accuracy of the corresponding type of image source. Most doctors have good ability in recognizing GT images and therefore the average accuracy (last row in Table 3 ) of GT images was high. However, in the ovarian data set, the evaluations of Doctor 2 and other doctors are quite different. Doctor 2 tended to classify almost all images as fake whereas the other doctors tended to classify most images as real. Among different methods, the accuracy of the spGAN-sketch (i.e., the spGAN without sketch guidance) was higher especially in the COVID-19 dataset. This indicates sketch guidance plays an important role in synthesizing realistic images. Without the sketch guidance, the generated COVID-19 images present obvious checkerboard patterns (Fig. 6) , which makes it very easy for the doctors to distinguish them from real images. We also observed that the accuracy of spGAN-sketch method in the hip joint and ovary datasets was much lower than that in the COVID-19 dataset. This indicates that the sketch guidance is particularly necessary for lung US image synthesis. Among the three image synthesis methods, spGAN v2 achieved lower average accuracy than the other two methods in all datasets except that it achieved higher average accuracy than spGAN in the ovary dataset. This means that generally the use of the sketch guidance and feature loss can improve the quality of generated images. Furthermore, we developed a platform for editable image synthesis. Via customized editing of label maps, this platform can be used as a convenient tool to simulate various, meaningful US images for training sonographers and deep neural networks. Fig. 10 shows some synthesized examples after editing the label maps, using the proposed spGAN v2 method. For the COVID-19, the artifacts, such as the pleura line, A-line, B-line, and lung consolidation, are evaluated for grading disease severity. We can simulate a more severe COVID-19 case by adding the B-line to the label map. In hip joint US images, the diagnosis of infant hip dysplasia depends on the relative position of different structures. By changing the relative position of the structures in the label map of a normal case, we can create a case of developmental dislocation of the hip with high fidelity. At last, we can easily synthesize new ovary US images by changing the size of ovary and the number of follicles in the label map. A video demo of the editable synthesis using our developed platform is provided in Supplementary file Editabledemo.mp4. To explore whether using our method for data augmentation can improve the segmentation performance, we compared the results of three settings: no augmentation (baseline), traditional augmentation (trad), and traditional plus GAN-based augmentation (trad+GAN). Furthermore, in addition to using the full training data, we also experimented with 20% of the training data to see how these data augmentation methods performed on small datasets. We used U-Net (Ronneberger et al., 2015) as the segmentation model and DICE as the performance metric. We employed online augmentation and set the possibility to 0.3 for both traditional and GAN-based augmentation. For traditional augmentation, the operations included rotation, translation, scaling, blurring, gamma transformation, and adding Gaussian noise. For GAN-based augmentation, we randomly edited the label maps by morphological operations and used sp-GAN v2 to synthesize images. The segmentation results are presented in Table 4 . For the hip joint and ovary datasets, Traditional augmentation improved segmentation performance when 20% of the training data were used but failed when the full training data were used. For the COVID-19 dataset, traditional augmentation did not improve performance no matter a portion of or the full training data were used. The reason may be that the lesion areas in COVID-19 images have rich styles and thus it is hard to gain additional information through simple operations in traditional augmentation. In contrast, our GAN-based augmentation method can provide greater variability with editable operations and therefore has great potential to improve performance. As shown in Table 4 , our method in general not only achieved superior segmentation results when a small training set was available but also further improved performance even when the full training set was used. In our proposed spGAN v2 method, there are three important parameters, i.e., the weight of feature loss term λ 2 in the objective function of generator, the number of residual blocks n in generator, and the output size s of discriminator. To investigate the effect of the three parameters on the performance of our proposed spGAN v2 method, we tested different values for parameter λ 2 from {0, 5, 10, 15}, parameter n from {0, 5, 10, 15, 20}, and parameter s from {1×1, 30×30, 60×60, 120×120, 256×256}. Fig. 11, 12, and 13 show the results for each of the three parameters, respectively. According to the results in Fig. 11 , we chose λ 2 = 10 for the COVID-19 and hip joint datasets and λ 2 = 5 for the ovary dataset. A larger λ 2 means that more emphasis is put on the similarity of high-level features (feature loss) instead of the Table 3 The classification accuracies of five doctors for each of the four types of images: the GT and the synthesized images by three methods (b: spGAN-sketch, e: spGAN, f: spGAN v2). Hip Joint Ovary pixel-wise or structural similarity (L1-loss). Compared with the COVID-19 and the hip joint datasets, the structures (i.e., ovary and follicles) in the ovary US images have a clear boundary and regular shape. Therefore, a smaller λ 2 is more suitable for the ovary dataset. Based on the results in Fig. 12 and visual evaluation of the synthesized images, we set the number of residual blocks n to 15, 15, and 10 for the COVID-19, hip joint, and ovary datasets, respectively. Although for the hip joint dataset, n = 20 achieved better performance in terms of the MS-SSIM and LPIPS metrics than n = 15, the visual quality of synthesized images was worse, so n = 15 was used for the hip joint dataset instead. Generally, more residual blocks used in the generator mean a stronger ability to learn features. However, too many residual blocks tended to degrade performance. As shown in Fig. 12 , fewer residual blocks are needed for the ovary dataset compared with the other two datasets. This is probably because the texture present in the ovary images is simpler than that in the other two types of images. The PatchGAN was used as the discriminator in our synthesis framework. Each unit of the discriminator output is like a local receptive field. A smaller output size indicates that each unit of the output represents a larger region. For instance, the output size of 1×1 means that the discrimination is made from the whole image, without local information fed back to the generator. Conversely, the output size of 256×256 indicates that the discrimination is made from only several pixels, neglecting the context information in surrounding region. As shown in Fig. 13 , the results of 1×1 output size are worse in all three datasets, and too small or too large output size often degrades the performance. According to the results in Fig. 13 and visual quality of the generated images, we set the output size to 30×30, 120×120, and 30×30 for the COVID-19, hip joint, and ovary datasets, respectively. In this paper, to address the challenge of lacking enough data for training sonographers and deep neural networks, we propose an image-to-image translation framework aiming at generating high-resolution and high-fidelity US images from segmentation label maps. The proposed spGAN v2 method consists of four key components: auxiliary sketch guidance, progressive growing scheme, fade-in blocks, and feature loss. Specifically, the auxiliary sketch guidance provides auxiliary information for generating realistic background texture. The progressive growing scheme containing the pre-trained model of generating low-resolution images and fade-in blocks are employed for a smooth transition from low resolution to high resolution. To further improve the quality of the generated images, the feature loss is adopted to suppress noise and deblur the images. Extensive qualitative and quantitative experiments on the COVID-19, hip joint, and ovary datasets demonstrate the effectiveness of the proposed spGAN v2 method. Another important feature of our work is that we have developed an editable image synthesis platform that can easily create various and meaningful US images by modifying segmentation label maps. Overall, our study provides a useful and convenient tool for generating high-resolution and high-fidelity US images from label maps. A pipeline for the generation of realistic 3d synthetic echocardiographic sequences: Methodology and open-access database Virtual pet images from ct data using deep convolutional networks: initial results Demystifying mmd gans The perception-distortion tradeoff Real-time gpu-based ultrasound simulation using deformable mesh models A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence Computational analysis of pathological images enables a better diagnosis of tfe3 xp11. 2 translocation renal cell carcinoma Computer-aided diagnosis with deep learning architecture: applications to breast lesions in us images and pulmonary nodules in ct scans Generation of structural mr images from amyloid pet: application to mr-less quantification Image synthesis in multi-contrast mri with conditional generative adversarial networks Computer-aided diagnosis in medical imaging: historical review, current status and future potential Gan-based synthetic medical image augmentation for increased cnn performance in liver lesion classification Breast ultrasound image synthesis using deep convolutional generative adversarial networks. Diagnostics 9 Convolutional neural networks for computer-aided detection or diagnosis in medical image analysis: An overview Deep residual learning for image recognition Gans trained by a two time-scale update rule converge to a local nash equilibrium Freehand ultrasound image simulation with spatially-conditioned generative adversarial networks, in: Molecular imaging, reconstruction and analysis of moving body organs, and stroke imaging and treatment Image-to-image translation with conditional adversarial networks Field: A program for simulating ultrasound systems Simulation of advanced ultrasound systems using field ii Tumor-aware, adversarial domain adaptation from ct to mri for lung cancer segmentation Perceptual losses for real-time style transfer and super-resolution Progressive growing of gans for improved quality, stability, and variation Visualization and gpu-accelerated simulation of medical ultrasound from ct images. Computer methods and programs in biomedicine 94 Synthesis and edition of ultrasound images via sketch guided progressive growing gans Susan: segment unannotated image structure using adversarial network Deep learning in medical ultrasound analysis: A review. Engineering Comparison of texture synthesis methods for content generation in ultrasound simulation for training Conditional generative adversarial nets Medical image synthesis with context-aware generative adversarial networks Generation of synthetic but visually realistic time series of cardiac images combining a biophysical model and clinical images Unsupervised representation learning with deep convolutional generative adversarial networks Simulation model of intravascular ultrasound images U-net: Convolutional networks for biomedical image segmentation Generative adversarial networks (gans): An overview of theoretical model, evaluation metrics, and recent developments Abnormal colon polyp image synthesis using conditional adversarial networks for improved detection performance Simulating patho-realistic ultrasound images using deep generative networks with adversarial learning k-wave: Matlab toolbox for the simulation and reconstruction of photoacoustic wave fields Instance normalization: The missing ingredient for fast stylization Multiscale structural similarity for image quality assessment Generative adversarial network in medical imaging: A review The unreasonable effectiveness of deep features as a perceptual metric Skrgan: Sketching-rendering unconditional generative adversarial networks for medical image synthesis Craniomaxillofacial bony structures segmentation from mri with deep-supervision adversarial learning A framework for the generation of realistic synthetic cardiac ultrasound and magnetic resonance imaging sequences from the same virtual patients