key: cord-0655472-oueaa6k2 authors: Krinski, Bruno A.; Ruiz, Daniel V.; Todt, Eduardo title: Light In The Black: An Evaluation of Data Augmentation Techniques for COVID-19 CT's Semantic Segmentation date: 2022-05-19 journal: nan DOI: nan sha: b4b87d929137c77e9f175ebd21fec506c89d8452 doc_id: 655472 cord_uid: oueaa6k2 With the COVID-19 global pandemic, computer-assisted diagnoses of medical images have gained much attention, and robust methods of Semantic Segmentation of Computed Tomography (CT) became highly desirable. Semantic Segmentation of CT is one of many research fields of automatic detection of COVID-19 and has been widely explored since the COVID-19 outbreak. In this work, we propose an extensive analysis of how different data augmentation techniques improve the training of encoder-decoder neural networks on this problem. Twenty different data augmentation techniques were evaluated on five different datasets. Each dataset was validated through a five-fold cross-validation strategy, thus resulting in over 3,000 experiments. Our findings show that spatial level transformations are the most promising to improve the learning of neural networks on this problem. Since 2019 the world has struggled with the new coronavirus (COVID-19) pandemic, with millions of infections and deaths worldwide [Wang et al. 2020] . Until now, there are a total of 427,169,421 global cases and a total of 5,902,878 global deaths [of Medicine 2022] (updated February 22th, 2022) . Due to the virus's quick dissemination, early diagnosis is highly desirable for faster treatment and tracking infected people [Chen et al. 2020a ]. Automatic detection of COVID-19 infections in Computed Tomographys (CTs) shows to be a great help for early diagnoses [Shi et al. 2021] , with the Semantic Segmentation [Cao and Bao 2020] of CTs being widely explored since the COVID-19 outbreak [Shi et al. 2021] . Deep Learning based techniques and Deep Neural Networks achieved impressive results in the segmentation of COVID-19 CTs [Shi et al. 2021 , Krinski et al. 2021 ]. However, it has two limiting factors. The first one is that labeling Semantic Segmentation is a labor-intensive and timing-consuming process, and each pixel of the image must receive the correct label. Otherwise, the network could converge to wrong results [Shi et al. 2021, Cao and Bao 2020] . In addition to that, labeling CT segmentation datasets must be made by highly specialized doctors to properly label the lesion regions of the image [Shi et al. 2021 ]. With new approaches being proposed quickly, an urgency aggravated by the global pandemic, the need for a proper evaluation becomes apparent. A broad benchmark of architectures was presented by [Krinski et al. 2021] , and one of their conclusions was that the models' generalization was impaired by the small number of samples on the field's datasets which also suffer from class imbalance introducing some bias to the models. Data augmentation can mitigate this issue; however, the influence of data augmentation during training was left out. In this work, we propose an extensive analysis of how different data augmentation techniques improve the training of encoder-decoder neural networks on this problem. Twenty different data augmentation techniques were evaluated, see section 4, in three distinct experiments using five CTs datasets: MedSeg [MedSeg 2021], Zenodo [Jun et al. 2020] , CC-CCII [Zhang et al. 2020] , MosMed [Morozov et al. 2020 ], Ri-cord1a [Tsai et al. 2020] . Each dataset was validated through a five-fold cross-validation strategy, thus resulting in over 3,000 experiments. The code for running these same experiments is publicly available 1 . Data augmentation aims to generate a synthetic image by applying different operations to a preexisting labeled image [Ruiz et al. 2020b ]. The most common operation are variations of an affine transformation such as flip, translate, rotate, scale. The Random Erasing [Zhong et al. 2020 ] is one example of data augmentation that adds information to the original image. In this technique, a rectangle with random values is positioned in the image. This rectangle's height, width, and center points are random values. This data augmentation helps the network learning process be more robust to object occlusions. Also, it reduces the overfitting in the training step. The CutMix [Yun et al. 2019] follows the same idea of the Random Erasing. However, instead of using constant values or even random ones, the technique mixes two images by adding an image A to some region of image B. This reduces information loss and encourages generalization. Following the same line of CutMix [Yun et al. 2019] , the study presented by [Summers and Dinneen 2019] evaluated several different non-linear mixing algorithms. The authors showed that non-linear mixing algorithms are also effective as linear mixing data augmentations. [Hendrycks et al. 2020 ] proposed a mixing data augmentation called AugMix. In AugMix, sequences of data augmentations are applied in parallel, generating a different image for each data augmentation sequence. In the end, an elementwise convex combination is applied to mix all generated images. In [Kisantal et al. 2019] , the authors proposed a data augmentation technique for small objects to improve such objects' detection and segmentation. The authors applied "copy and pasting" strategy to create several copies of the objects of interest. They showed that this strategy increases the number of anchor boxes generated by the Mask-RCNN [He et al. 2017] , which helps the network to learn and detect small objects. The ANDA [Ruiz et al. 2019 ] and IDA [Ruiz et al. 2020a] techniques follow the idea of introducing new objects, however since those are techniques focused on the generic problem of Salient Object Detection (SOD), some additional operations are necessary such as Image Inpainting to erase the original object and some additional computation to choose which combination of background and object produce a significant salience and the affine transformations to be applied to the new object that will replace the original one. In the Grid Mask [Chen et al. 2020b ], a mask composed of black squares uniformly distributed is generated and applied on top of the image. This augmentation follows the same idea of Random Erasing, which forces the network to learn from occluded regions in the image and reduces overfitting. The advantage over others that randomly remove regions from the image is that random algorithms can remove relevant regions from the image. The InstaBoost [Fang et al. 2019 ] uses an inpainting technique [Bertalmio et al. 2001 ] to remove the interest object from the image and place it in another region in the image. An appearance consistency heatmap [Field et al. 1993 ] is used to estimate the new region of the image where the object will be placed. The authors of [Liu et al. 2019] proposed a data augmentation based on imageto-image divided into two steps: training and deployment. The proposed method uses several images of different classes in the training step, called source images, and learns to translate the images between these classes. Then, in the deployment step, a small set of images from the target class is used, with the proposed method being able to translate from the source classes to the target class. For each image to be mixed in the Super-Mix [Dabouei et al. 2020 ], a mixing binary mask with the salience information of the respective image is generated. Then, a teacher model already trained in the problem is applied to mix the images, optimize the position of the salient region in the mixed image, and ensure that both salient regions are presented in the final mixed image. These Data Augmentations (DAs) are, in general, for generic segmentation problems. However, there is no proper comparison of DA methods for the COVID-19 segmentation problem. In this work, we focus on performing a extensive comparison of generic DA methods in the approached problem. The models were trained and evaluated across five different CTs datasets: MedSeg [MedSeg 2021], Zenodo [Jun et al. 2020] , CC-CCII [Zhang et al. 2020] , MosMed [Morozov et al. 2020] , and Ricord1a [Tsai et al. 2020] . The MedSeg has 929 images and labels for four classes, with the following pixel proportion: Background (0.98563), Ground Glass Opacity (GGO) (0.01072), Consolidation (0.00351), and Pleural Effusion (0.0001). The Zenodo dataset has 3,520 images and labels for four classes, with the following pixel proportion: Background (0.89893), Left Lung (0.04331), Right Lung (0.04923), and Infections (0.00852). The MosMed dataset is composed of 2,049 images, with labels for two classes, with the following pixel proportion: Background (0.99810) and GGO-Consolidation (0.00189). Ricord dataset is divided into three sets: 1a, 1b, and 1c. The set 1a is the only one with segmentation masks and has 9,166 images with labels for two classes, with the following pixel proportion: Background (0.95295) and Infections (0.04704). We also used a sub-set of CC-CCII with segmentation masks composed of 750 images and has labels for four classes, with the following pixel proportion: Background (0.87152), Lung Field (0.11691), GGO (0.00802), and Consolidation (0.00353). One of the problems pointed out in [Krinski et al. 2021 ] was the class imbalance due to several images with just the background class; in fact, recent work has shown that several problems suffer from class imbalance [Laroca et al. 2021 , Laroca et al. 2022 . Therefore, to mitigate this problem, in the first step of this evaluation, we removed from the datasets images with no lesion in the ground-truth mask. In the CC-CCII dataset, 4 images were removed; in the MedSeg dataset, 457 images were removed; in the MosMed, 1,264 images were removed; in the Zenodo dataset, 547 were removed; and in the Ri-cord1a, no image was removed. The datasets were split into 80% for training and 20% for testing. Then, a five-fold cross-validation strategy further divided the training set between training and validating sets. The metrics used for evaluation were the F-score described by equation 1 and Intersection over Union (IoU) described by equation 2. In general, when dealing with the COVID-19 CT segmentation problem it is usual to either completely neglect a dedicated evaluation of the impact of DA techniques or merely report using a limited set of generic DA, not optimized or specially designed for medical images, like Flip and Rotation operations, such as in [Zhao et al. 2021 , Qiblawey et al. 2021 , Raj et al. 2021 , Müller et al. 2020 , Chen et al. 2020c , Xu et al. 2020 . To properly measure the impact of data augmentation on the COVID-19 CT segmentation problem, we evaluated twenty data augmentation techniques. In this work we evaluated the following twenty data augmentation techniques: CLAHE, Coarse Dropout, Elastic Transform, Emboss, Flip, Gaussian Blur, Grid Distortion, Grid Dropout, Image Compression, Median Blur, Optical Distortion, Piecewise Affine Transformation, Posterize, RBC, Random Crop, Random Gamma, Random Snow, Rotate, Sharpen, Shift Scale Rotate. Figure 1 illustrates the twenty data augmentation techniques applied to a CT image 1(a). All data augmentation techniques evaluated here are available in the Albumentation library [Buslaev et al. 2020 ]. The encoder-decoder network chosen to evaluate the dataset augmentations was the RegNetx-002 [Xu et al. 2021 ] encoder and U-net++ [Zhou et al. 2018] decoder. Since the encoders achieved close results in the comparison performed in [Krinski et al. 2021 the RegNetx-002 was chosen due to being the network with a smaller number of parameters, making the RegNetx-002 faster for training. The U-net++ was chosen because it achieved the highest F-score compared with other decoders [Krinski et al. 2021 ]. The evaluation of how data augmentation affects the results of different encoders and decoders was left for future evaluation. Also, all experiments were evaluated through a five-fold cross-validation strategy. In total, we performed three sets of data augmentation evaluation, with each set varying network training parameters. The varied parameters were the number of epochs trained, the learning rate, and learning rate decay. epochs with a learning rate of 0.001. The learning rate was divided by 10 every 10 epochs. Two probabilities of applying the data augmentation were evaluated: 0.1 and 0.2. As presented in Table 1 , most of the data augmentations did not improve the F-score and the IoU. The MosMed dataset was the only dataset that applying data augmentations improved the results, with improvements in the F-score of 1% and 2% in most of the data augmentations applied. In the Zenodo dataset, most of the augmentations achieved similar results with the baseline. However, the Grid Distortion with the probability of 0.1 and the Optical Distortion with the probability of 0.2 improved the F-score by 1%. In the MedSeg, only the Shift Scale Rotate augmentation with the probability of 0.2 achieved better results with an F-score 1% higher than the baseline. The CC-CCII and Ricord1a datasets did not achieve improvements with data augmentations. To perform a statistical analysis of the data augmentation evaluation, the one-sided Wilcoxon signed-rank test was applied with the null hypothesis as the F-scores of the distribution without data augmentation are greater than the distributions with data augmentation. The datasets CC-CCII and MosMed were the datasets most sensitive to data augmentation and achieved better F-scores in seven data augmentations when the data augmentation was applied with probability 0.1 and eight data augmentations when applied with probability 0.2. In the MosMed dataset, the null hypothesis was rejected in most of the data augmentations applied in both probabilities. In the MedSeg, the null hypotheses were rejected in only one data augmentation when applied with probability 0.1 and two data augmentations when applied with probability 0.2. The Ricord1a and Zenodo did not achieve statistical significance to reject the null hypotheses in any data augmentation. In the second evaluation of data augmentations, the architecture was trained for 100 epochs with a learning rate of 0.001. The learning rate was divided by 10 every 20 epochs. Two probabilities of applying the data augmentation were evaluated: 0.1 and 0.2. As presented in Table 2 , the MosMed achieved the most significant improvements, with the Grid Distortion with probability 0.1 and the Elastic Transform with probability 0.2 increasing the F-score by 2% compared with the baseline. However, unlike the first evaluation, the MosMeg achieved better F-scores in only seven augmentations instead of fourteen augmentations. The CC-CCII achieved better F-scores with the Grid Distortion, Rotate, and Shift Scale Rotate. Also, MedSeg had better F-scores in nine augmentations with the probability of 0.1 and seven augmentations with the probability of 0.2. Data augmentations did not improve the F-score in the Zenodo and Ricord1a datasets. The statistical analysis pointed out that the CC-CCII rejected the null hypotheses in seven data augmentations in both probabilities. Training for 100 epochs made the MedSeg achieve better F-scores in eleven data augmentations with probability 0.1 and seven data augmentations with probability 0.2. The results achieved in the MosMed dataset got worse when compared with the results presented in Table 1 , with only seven data augmentations rejecting the null hypotheses. In the Ricord1a and Zenodo datasets, besides the average F-score of the data augmentations being very close to the average F-score without data augmentation, in the Ricord1a dataset, the null hypothesis was rejected in two data augmentations with probability 0.2, and in the Zenodo dataset, the null hypotheses were rejected in three data augmentations with probability 0.1 and two data augmentations with probability 0.2. In the third evaluation, the architecture was trained for 100 epochs with a learning rate of 0.0001. The learning rate was divided by 10 every 25 epochs. Two probabilities of applying the data augmentation were evaluated: 0.1 and 0.2. As presented in Table 3 , in the MosMed, the Shift Scale Rotate with the probability of 0.1 and the Rotate with the probability of 0.2 increased the F-score by 2% compared with the baseline. Also, the MosMed achieved the best F-scores in nine augmentations with probability 0.1 and eight augmentations with the probability of 0.2, pointing this training configuration with the highest effectiveness for this dataset. Also, this training configuration showed significant effectiveness in the Zenodo dataset, which achieved the best F-scores in six augmentations with probability 0.1 and seven augmentations with probability 0.2. However, this training configuration did not significantly affect other datasets. The MedSeg achieved higher F-scores in only three augmentations with a probability of 0.1 and five augmentations with a probability of 0.2. Also, the CC-CCII and Ricord1a datasets did not achieve improvements with any data augmentation. Statistical analysis also was performed and, besides the average F-score of the data augmentations being very close to the baseline without data augmentation, the Zenodo and Ricord1a datasets achieved the highest number of data augmentations with the null hypotheses rejected. The MosMed was the dataset with the most promising results, with the null hypotheses rejected in twelve data augmentations. The CC-CCII and MedSeg had fewer data augmentations with the null hypotheses rejected compared with previews experiments, showing that this training configuration is not proper for data augmentations in these datasets. These three experiments demonstrated that, although necessary, the generic data augmentation techniques evaluated did not majorly improve the results in the COVID-19 segmentation problem. The MosMed achieved the most significant improvements with data augmentation, with the F-score being up to 2% higher in comparison with baseline. This dataset was the most sensitive to data augmentation techniques due to its imbalance problem. Data augmentations also improved the CC-CCII and MedSeg dataset results, but it was necessary to train the network for more epochs, and the data augmentations only achieved 1% of improvements in the F-score. The Ricord1a and Zenodo datasets were the most challenging and did not show improvements with data augmentations. Although they achieved statistical significance to reject the null hypotheses in many data augmentations, the average F-score slightly improved. Another result of these data augmentation experiments is that spatial level transformations such as Elastic Transform, Flip, Grid Distortion, Piecewise Affine, Rotate, and Shift Scale Rotate were the operations that showed to be the most favorable data augmentations to this problem. These data augmentations improved the results in most of the experiments performed and thus are the most promising techniques for future experiments with data augmentation. The evaluation of more domain-specific data augmentation techniques was left for future works. Navier-stokes, fluid dynamics, and image and video inpainting Albumentations: Fast and flexible image augmentations A survey on image semantic segmentation methods with convolutional neural network Key to successful treatment of covid-19: accurate identification of severe risks and early intervention of disease progression. medRxiv Gridmask data augmentation Residual attention u-net for automated multiclass segmentation of covid-19 chest ct images Supermix: Supervising the mixing data augmentation Instaboost: Boosting instance segmentation via probability map guided copy-pasting Contour integration by the human visual system: Evidence for a local "association field Mask r-CNN AugMix: A simple data processing method to improve robustness and uncertainty Covid-19 ct lung and infection segmentation dataset Augmentation for small object detection Spark in the dark: Evaluating encoderdecoder pairs for COVID-19 CT's semantic segmentation Towards image-based automatic meter reading in unconstrained scenarios: A robust and efficient approach On the cross-dataset generalization in license plate recognition Few-shot unsupervised image-to-image translation Covid-19 ct segmentation dataset Mosmeddata: Chest ct scans with covid-19 related findings dataset. medRxiv Automated chest ct image segmentation of covid-19 lung infection based on 3d u-net Detection and severity classification of covid-19 in ct images using deep learning ADID-UNET-a segmentation model for COVID-19 infection from lung CT scans ANDA: A novel data augmentation technique applied to salient object detection IDA: Improved data augmentation applied to salient object detection Can giraffes become birds? an evaluation of image-to-image translation for data generation Review of artificial intelligence techniques in imaging data acquisition, segmentation, and diagnosis for COVID-19 Improved mixed-example data augmentation Medical imaging data resource center -rsna international covid radiology database release 1a -chest ct covid+ (midrc-ricord-1a) A novel coronavirus outbreak of global health concern Regnet: Self-regulated network for image classification Gasnet: Weakly-supervised framework for covid-19 lesion segmentation CutMix: Regularization strategy to train strong classifiers with localizable features Clinically applicable AI system for accurate diagnosis, quantitative measurements, and prognosis of COVID-19 pneumonia using computed tomography D2a u-net: Automatic segmentation of covid-19 lesions from ct slices with dilated convolution and dual attention mechanism Random erasing data augmentation UNet++: A nested u-net architecture for medical image segmentation The authors would like to thank the Coordination for the Improvement of Higher Education Personnel (CAPES) for the PhD scholarship. We gratefully acknowledge the founders of the publicly available datasets, the support of NVIDIA Corporation with the donation of the GPUs used for this research and the C3SL-UFPR group for the computational cluster infrastructure.