key: cord-0444462-kxks62fc authors: Uzolas, Lukas; Rico, Javier; Coup'e, Pierrick; Cserey, Juan C. SanMiguel Gyorgy title: Deep Anomaly Generation: An Image Translation Approach of Synthesizing Abnormal Banded Chromosome Images date: 2021-09-20 journal: nan DOI: nan sha: 822092c3769ddcdce2ccabda0b2a1ab44e9a23c5 doc_id: 444462 cord_uid: kxks62fc Advances in deep-learning-based pipelines have led to breakthroughs in a variety of microscopy image diagnostics. However, a sufficiently big training data set is usually difficult to obtain due to high annotation costs. In the case of banded chromosome images, the creation of big enough libraries is difficult for multiple pathologies due to the rarity of certain genetic disorders. Generative Adversarial Networks (GANs) have proven to be effective in generating synthetic images and extending training data sets. In our work, we implement a conditional adversarial network that allows generation of realistic single chromosome images following user-defined banding patterns. To this end, an image-to-image translation approach based on self-generated 2D chromosome segmentation label maps is used. Our validation shows promising results when synthesizing chromosomes with seen as well as unseen banding patterns. We believe that this approach can be exploited for data augmentation of chromosome data sets with structural abnormalities. Therefore, the proposed method could help to tackle medical image analysis problems such as data simulation, segmentation, detection, or classification in the field of cytogenetics. In the field of cytogenetics, a discipline that studies the structure and function of chromosomes, karyotyping analysis is an important diagnostic procedure [1] . Using staining techniques, Giemsa staining being the most widely used [2] , the 23 pairs of metaphase chromosomes that determine the chromosome complement of an individual are divided by stripes known as banding patterns, or G-bands when Giemsa stained. Those bands, that serve as landmarks on each healthy chromosome type, are standardized in diagrams, called ideograms. Hence all bands present in an ideogram are present in a properly stained healthy chromosome, while the opposite is not necessarily true [3] . Chromosomal abnormalities are the direct cause of genetic diseases [4] , existing in two main types, which are manifested in chromosome number and chromosome structure, the former can be identified by simply counting [5] . In contrast, the latter is caused by a small portion of the chromatid arm of a chromosome breaking, recombining, or missing, hence reflecting a deviation from the standardized ideogram, and their detection is an important problem in cytogenetics [6] . Extensive works are being undertaken to study how machine learning can help pathology diagnosis. Li et al. [6] , for example, developed the CS-GANomally for chromosome anomaly detection. However, the lack of data is often mentioned [7] , especially for diseases with complex karyotypes involving multiple structural abnormalities (e.g. leukemia [8, 9] ). Many applications adopt Generative Adversarial Networks (GANs) as a Data Augmentation (DA) technique to improve the performance of deep learning models [10, 11] . This has become a trend in medical imaging, where plenty of examples exist, lung nodules [12] , liver lesions [13] , and COVID-19 CT scans [14] . Wu et al. [15] proposed MD-GAN to generate various data modes of healthy chromosomes to train a classifier for karyotyping. Furthermore, approaches can be found for DA of images presenting user-defined anomalies. C. Han et al. [16] , for example, used a conditional PGGAN-based DA to generate Magnetic Resonance (MR) images with brain metastases at desired positions, improving disease detection [17] . Another approach of generating data for a desired acquisition modality is through imagetranslation networks, such as pix2pix [18] or CycleGAN [19] . For example, Choi and Lee [20] used pix2pix to generate structural MR images from Positron Emission Tomography images, and Yan et al. [21] used CycleGAN for domain adaptation of MR images. While MD-GAN is used to generate healthy chromosomes [15] , and CS-GANomaly is employed for detection of abnormal ones [6] , we attempt to generate abnormal chromosomes through specific control over the banding patterns. To the best of our knowledge, this has not been attempted before. To achieve this, we condition a pix2pix network with 2D chromosome banded segmentation masks to synthesize realistic single chromosome images following user-defined banding patterns, allowing for the simulation of chromosomes with abnormalities of the structural type. We use pix2pix [18] for conditionally generating chromosome images based on a user-defined banding pattern and a shape mask. We extract 1D binary banding patterns in an automatic fashion, which are back-projected onto chromosome shape images, to finally generate 2D banded segmentation masks. The chromosome images, in combination with their respective 2D banded segmentation masks, can be used to train a pix2pix network (see Section 2.1). After training, realistic-looking chromosomes can be generated that display a desired band configuration (see Section 2.2). An overview of this approach can be seen in Figure 1 . More details are given in the following sections. Banding Pattern Extraction. An automated banding pattern extraction procedure is implemented, as the corresponding ideograms can not be used as banding pattern approximations. This is due to several reasons. Firstly, the lack of annotations, secondly, the variable regions of healthy chromosomes [3] , and thirdly, the presence of unhealthy chromosomes in the data set. Let y real ∈ R M ×N be a real grayscale chromosome image of spatial dimension M × N (see Figure 2a ). The banding pattern extraction function f BP E takes a chromosome image as input, producing a binary banding pattern vector bp real ∈ {0, 1} K such that f BP E (y real ) = bp real , where K denotes the length of the vector (see 2D representation of this vector in Figure 2g ). The function f BP E is based on extracting the density profile of a chromosome [22] , and follows the approach suggested in [23] with some modifications. The density profile is extracted by approximating the medial axis of a chromosome shape, s real , via several line segments. The shape mask itself is generated through binary segmentation of y real (see Figure 2b ). At each point k on the medial axis, perpendicular lines are constructed within the shape mask, we denote the rasterized points making up one line as P k (see Figure 2d , perpendicular lines). All grayscale values of y real at P k are averaged, and make up the density profile value at k (see Figure 2e ). Unlike [23] , we interpolate between angles of these perpendicular lines. This results in less over-and under-sampling of chromosome areas. As in [23] , a non-linear filter is applied on the density profile which transforms each band into a uniform density (see Figure 2f ). Where areas between peaks and valleys correspond to black and white bands respectively. Additionally, we split ambiguous saddle-points into two, assigning each half to the neighboring value, producing the final banding pattern bp real (see Figure 2g ). Banded Segmentation Masks Generation. Pix2pix needs a paired set of training data, displaying the same image in different domains. We define the source domain to be banded segmentation masks (see Figure 2h ), while the target domain is defined as realistic-looking chromosome images. We define a function f SEG , it takes as argument a chromosome binary shape mask and banding pattern, such that f SEG (bp real , s real ) =x real , wherex ∈ {0, 127, 255} M ×N is an approximation of the real banded segmentation mask x of the same dimension. The numbers correspond to black bands, white bands, and background respectively. To do so, each value bp(k) is replicated along its corresponding perpendicular points, P k , within the binary shape mask. Some points might not have been sampled during the extraction process in the shape mask. This happens frequently at points located far away from the medial axis, where the chromosome is curving. In these cases, holes are filled with the value of the nearest neighbor. Training of pix2pix. Pix2pix is a conditional GAN (cGAN) that can translate images from a source domain to a target domain. We use it to translate images from banded chromosome segmentation masks (source domain X) into realistic-looking chromosomes (target domain Y ). It is based on a U-Net [24] type generator G, and a patch-based discriminator D. Both networks are pitted against each other in a minimax optimization game, G learns how to generate y ∈ Y with x ∈ X as input, while D is tasked to decide whether y is synthetic or not. As no real banded segmentation masks exists, we utilize the approximations f SEG (f BP E (y real ), s real ) =x. This builds a tuple of paired data (x, y) for training, wherex is translated into y (see Figure 1a) . The learning objective of the network otherwise remains unchanged, and is given in [18] . After training, D is discarded, as only G is necessary to generate synthetic chromosomes. To mimic abnormal chromosomes, we randomly generate banding patterns based on Perlin noise. Considering that real banding patterns consist of clustered black or white regions of various sizes, random noise would be inadequate to mimic the coherence of real banding patterns. Perlin noise has been used, for example, to generate synthetic breast tissues [25] . For differentiation purposes, we denote these abnormal banding patterns as bp perlin and the real ones as bp real . These Perlin bands can be backprojected to arbitrary chromosome shapes, in the same manner as real ones. This results in abnormal banded segmentation masks which are further employed for generation of chromosome images exhibiting the given band configuration (see Figure 1b ). Dataset. Karyotypes from the PKi-3 data set [26] , and self-collected karyotypes from various public online resources compose our final G-banding chromosome data set, displaying 400-550 band level resolutions. The former source consists of 612 karyotypes with sizes of 768 × 582 × 1 with chromosomes at pro-or metaphase stage. The latter is a compilation of healthy and unhealthy karyotypes, consisting of 445 human karyotype RGB images with varying dimensions and quality. Single chromosomes are extracted from the karyotypes, resulting in a total of 42684 images. We apply the following operations to all images: (1) the images are transformed to grayscale, (2) square padding is applied to meet the size of the biggest chromosome in the karyotype, (3) the images are resized to 128x128. We split the chromosome images into a train, validation, and test set with a 0.7, 0.15, 0.15 ratio. Furthermore, we create an additional set of banded segmentation masks, where random Perlin bands are backprojected onto the real chromosome shapes of the test set. This set is titled test perlin , contrary to test real . Metrics. Evaluating the quality of GAN-based synthetic medical images is an ongoing research topic [27, 28] . However, the focus of our method lies in conditioning a GAN on banding patterns. Thus, we adopt appropriate metrics. Firstly, we calculate the dice score between the banded segmentation masksx input andx f ake , whereas input can be either real or perlin, omitting the background. As the generator yields G(x input ) = y f ake only, we create the banded segmentation masks for the synthetic chromosome in the same manner as for the real ones. Namely, x f ake = f SEG (f BP E (y f ake )) (see Figure 1b) . The dice score measures the area of overlap between two segmentation masks (1 best, 0 worst) and is commonly used in medical segmentation tasks [29] . Furthermore, A. S. Pires et al. [30] used dice coefficient to measure similarity of fungi minichromosome banded profiles. We measure dice score on the 2D banded segmentation mask, instead of employing a 1D metric on the banding patterns directly, as it is more robust to variation in the extracted medial axis ofx f ake . We further propose to measure the Mean Absolute Error Number of Bands (MAENB, the lower the better) which measures the mean deviation of bands in the synthetic chromosomes per band compared to the input band: where |bp| b and |bp| w denote the number of black and white bands in an extracted banding pattern respectively, and α is a normalization factor defined as α = 1/(|bp input | w + |bp input | b ). Hence, MAENB can give an insight on the discrepancy between the amount of input and output bands. Training and implementation details. The implementation is realized in Python, and makes use of the OpenCV [31] as well as scikit-learn [32] . The pix2pix network is taken from the PyTorch implementation by the original authors 2 , including the hyperparameters, except for the batch size which is set to 32. The network was trained for 100 epochs on a shared cluster with NVIDIA V100 and V100S GPUs on the training set and validated every 10 epochs. Each epoch took roughly 17 minutes to train. By default, the network upscales the images to 256x256, however, for evaluation the images are downscaled to 128x128. Quantitative Results. Using the metrics, we can define a point of convergence for pix2pix. Table 1 shows that the network yields the best results around epoch 50 with a dice score of 81.5% and a MAENB value of 0.111. However, Final test results are nearly identical for the real bands compared to the validation set (see Table 3 ). Nonetheless, a decline in performance for the test perlin set can be observed by roughly 10% in dice score and an increase of MAENB by around 0.05. We further compare the correspondingx i perlin ∈ test perlin withx i real ∈ test real for all i as a baseline, where i denotes the sample number. Consider that these corresponding masks have the same shape but only differ in their banding patterns. Doing so reveals that the real and Perlin bands are uncorrelated, being that the final dice score is around random binary guessing probability. It further serves as a baseline as it demonstrates the use-case where no conditioning of banding patterns, but the shape only, is possible. This shows that the generator is able to partially impose the abnormal banding patterns, as test perlin achieves a dice score of 71%, significantly higher than the baseline. Other state-of-the-art methods employing dice score as a measure of similarity for a segmentation task involving cGANs for image-translation report similar scores. For example, C. Chen et al. [33] obtained a max. averaged dice score of 83% on a semi-supervised segmentation task. In addition, Yan et al. [21] used CycleGAN for domain adaption of MR images, trained a U-Net and achieved dice score for two data sets of 80.5% and 86.7% respectively. The dice score is relatively consistent across chromosome classes, with the exception of higher classes with Perlin bands as input (see Figure 4 ). Chromosomes 1 to 22 are assigned their numbers based on size, apart from chromosome 22 which is slightly larger than 21 [5] . Hence, the drop in performance might be due to less variability in expressible banding patterns of short length. Similarly, the performance on class 23 likely might increases again, as we do not differentiate between the longer X and the shorter Y chromosome during evaluation. Qualitative Results. Figure 5a shows images y f ake which are generated by pix2pix with real banding patterns as input. The quality of the synthetic chromosomes is similar to the real ones (y real ). In most of the cases,x real and x f ake look alike, however, some bands are split in two or merged into a single one. This might also be a fault of the hand-crafted nature of the banding pattern extraction itself. Further comparingx real with y f ake , reveals that the synthetic chromosomes indeed display the desired input banding patterns. In the same manner, Figure 5b shows synthetic abnormal chromosomes generated withx perlin as input. Overall, the generation quality seems to be on a similar level compared to real bands. In some cases,x perlin andx f ake are very similar and only slightly differ in the size of the bands (see Figure 5b top left). In other cases, some of the bands are correctly positioned and have the correct size, but are missing or adding some bands. This can be seen in Figure 5b bottom left row, wherex f ake resembles a mix betweenx perlin andx real which suggest that the generator partially overfits on chromosomes from the training set. We further analyze how the generator behaves on banded segmentation masks with a shape mask but varying banding patterns (two rows on the right of Figure 5b ). Even though the same shape is used, the bands are clearly distinct in between each other and mostly exhibit the input banding patterns. Overall, the results suggest that this approach can The generator outputs y f ake , the banded segmentation maskx f ake of the output visualizes how well the banding patterns are imposed by the generator. In addition, y real andx real are given for comparisons sake. For naming conventions see Figure 1 . be utilized to generate abnormal chromosomes with more fine-grained control over the exhibited banding patterns. However, the input banding pattern is not always completely imposed. In this work, we propose a method of conditionally generating chromosome images based on banding patterns. Our method can be applied to chromosome data sets without annotations, as banded segmentation masks are created in an automated fashion. We validate on a test set of 6432 single chromosome images while using real as well as abnormal patterns. Synthetic chromosomes are of high visual quality when conditioning on real as well fake banding patterns up to a promising extent. However, banding patterns are not always strictly imposed which could be addressed through domain specific adaption of the learning objective in future work. We believe that this approach can be exploited for data augmentation purposes of healthy and unhealthy chromosomes, displaying deviations from the standardized ideograms such as the recombination of sections, or missing bands, improving on existing image-based methods in the field of cytogenetics. Automatic karyotype analysis Characterization of Giemsa dark-and light-band DNA An international system for human cytogenetic nomenclature Chromosome aberrations: past, present and future. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis The principles of clinical cytogenetics Cs-ganomaly: A supervised anomaly detection approach with ancillary classifier gans for chromosome images A New Multiple-Distribution GAN Model to Solve Complexity in End-to-End Chromosome Karyotyping Acute myeloid leukemia with myelodysplasia related changes. Atlas of Genetics and Cytogenetics in Oncology and Haematology Machine learning applications in the diagnosis of leukemia: Current trends and future directions Gan-based novel approach for data augmentation with improved disease classification Automatic data augmentation for 3D medical image segmentation How to fool radiologists with generative adversarial networks? A visual turing test for lung cancer diagnosis GANbased synthetic medical image augmentation for increased CNN performance in liver lesion classification A deep transfer learning model with classical data augmentation and CGAN to detect COVID-19 from chest CT radiography digital images End-to-end chromosome Karyotyping with data augmentation using GAN Learning more with less: conditional PGGAN-based data augmentation for brain metastases detection using highly-rough annotation on MR images Progressive growing of GANs for improved quality, stability, and variation Image-to-image translation with conditional adversarial networks Unpaired image-to-image translation using cycle-consistent adversarial networks Generation of structural MR images from amyloid PET: application to MR-less quantification The domain shift problem of medical image segmentation and vendor-adaptation by Unet-GAN Application of statistical and syntactical methods of analysis and classification to chromosome data A rule-based computer scheme for centromere identification and polarity assignment of metaphase chromosomes U-net: Convolutional networks for biomedical image segmentation Application of the fractal Perlin noise algorithm for the generation of simulated breast tissue Automatic segmentation of metaphase cells based on global context and variant analysis MedGAN: Medical image translation using GANs Generative adversarial network in medical imaging: A review Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation Cytogenomic characterization of colletotrichum kahawae, the causal agent of coffee berry disease, reveals diversity in minichromosome profiles and genome size expansion The OpenCV Library. Dr. Dobb's Journal of Software Tools Scikit-learn: Machine learning in Python Realistic adversarial data augmentation for MR image segmentation We thank András Kozma and Robert Caldwell for their feedback.