key: cord-0234436-mh9c395y authors: Roychowdhury, Sohini title: QU-net++: Image Quality Detection Framework for Segmentation of Medical 3D Image Stacks date: 2021-10-27 journal: nan DOI: nan sha: 91792697cc84600e257f4de90fc63ac0a983046f doc_id: 234436 cord_uid: mh9c395y Automated segmentation of pathological regions of interest aids medical image diagnostics and follow-up care. However, accurate pathological segmentations require high quality of annotated data that can be both cost and time intensive to generate. In this work, we propose an automated two-step method that detects a minimal image subset required to train segmentation models by evaluating the quality of medical images from 3D image stacks using a U-net++ model. These images that represent a lack of quality training can then be annotated and used to fully train a U-net-based segmentation model. The proposed QU-net++ model detects this lack of quality training based on the disagreement in segmentations produced from the final two output layers. The proposed model isolates around 10% of the slices per 3D image stack and can scale across imaging modalities to segment cysts in OCT images and ground glass opacity (GGO) in lung CT images with Dice scores in the range 0.56-0.72. Thus, the proposed method can be applied for cost effective multi-modal pathology segmentation tasks. Machine learning (ML) solutions for medical 3D image stacks rely on well annotated, high quality images to classify or detect regions of interest (ROIs) corresponding to pathology [1] . The recent trend of re-using previously trained and deployed models and fine-tuning for specific use-cases, also known as transfer learning, has significantly reduced the number of annotated image samples required to optimally train a ML model. However, there continues to be a need to identify a minimal training subset of medical images for model fine-tuning purposes since the process of annotating medical images is both cost and time intensive. In this work, we present a novel two-stage system that identifies a minimal subset of images from medical 3D image stacks useful for training semantic segmentation models. Most existing works so far [1] [2] rely on manual selection, random sampling, or previously trained ML models to identify batches of image data required to train a segmentation model. In this work, we present a novel framework shown in Fig. 1 that scales across medical imaging modalities to identify a small training subset data. First, we identify an initial subset of the medical images/slices to be annotated based on the quality of the images and the pixel variance captured within the annotated regions in images. Next, this image subset is used to train a 4-level multi-node U-Net++ model [3] , such that the resized outputs from each of each level is analyzed, and if high variance is detected between outcomes of the final two layers, then the input image is considered to have new unlearned qualities/characteristics. Such an input image then gets appended to the training subset of images that need further annotation to fully train a segmentation model. Thus, at the end of the proposed two steps, a minimal training subset of images is identified to fully train a U-net++ model for semantic segmentation of all remaining images. This paper makes two key contributions. First, we introduce a novel two-step image quality analysis method that starts from an initial set of 5-10 images to train/fine-tune a U-net++ model with deep supervision. The newly trained U-net++ model is then used to generate test segmentation masks for all remaining images from the 3D image stack. The resized test segmentation masks from the last two levels of the model are then used to detect images that represent a lack in quality training for the segmentation model. This process isolates 8-15% of overall images from 3D image stacks to achieve state-of-the-art segmentation performances for pathology segmentation on test stacks. Second, we demonstrate the scope of transfer learning of the proposed U-Net++ based image quality detection model (QU-Net++) across medical image modalities and observe that kernel weights scale across computer tomography (CT) to optical coherence tomography (OCT) image stacks, thereby reducing the overall number of training samples needed for fine-tuning. Descriptions of the Lung CT and OCT 3D image stacks and the proposed QU-net++ methods are presented below. The first 3D image stack under analysis here is Lung CT images for COVID-19 segmentation of ground glass opacity (GGO) taken from the Kaggle dataset [4] . In this dataset, 100 individual images are annotated for GGO with available lung masks, as the Lung-CT-med subset, and 829 images from a 3D volumetric scan are available as the Lung-med-rad subset. Each image/slice is [512x512] in dimension and are resized to [256x256] for the U-net++ model. The second 3D image stack is that of OCT images form the OPTIMA cyst segmentation challenge (OCSC) dataset as described in [2] . We use 3 stacks of images per vendor-type along with annotations from observer G 1 from the Spectralis, Nidek, Topcon and Cirrus vendors. This results in 647 OCT slices from 3D image stacks that are cropped for the intra-retinal regions and resized to [256x256] for the U-net++. Samples of the datasets used here are shown in Fig. 2 . It is noteworthy that the grayscale OCT slices need to be cropped to include the intra-retinal layers as shown in [2] . Also, we observe that the CT slices may include metadata writing on them. Since the CT 3D image stacks includes masks to isolate the lung regions where the GGO regions exist, we utilize the masked-lung CT images as inputs for the segmentation, as shown in Fig. 2 . As a first step for minimal training subset identification, we begin with an unsupervised process of detecting an initial subset of images that represent good quality of medical images with significant variations for the annotated regions. Here, the actual raw images are analyzed for blurriness using the variance of Laplacian method [5] . Blurriness is defined as the inverse of pixel variance upon applying the Laplacian operator on a grayscale image, such that a higher blurriness score indicates lower image focus and quality. Next, we evaluate the contrast in raw images using the inverse PSNR metric, which represents the ratio between variance of pixels in a difference image, produced by subtracting a median filtering image from itself, over the maximum pixel strength in the image. A high value of this PSNR-inv metric indicates high variance in pixel regions and low maximum foreground pixel strength, which indicates low contrast of the foreground regions. Thus, we isolate raw images that have less than average blurriness and PSNR-inv metrics as an initial training image set S 0 to be used to train/fine-tune a U-net++ model, as shown in Fig. 3 . By varying the thresholds for blurriness and PSNR-inv we can isolate about 10-20% of original samples for initial training. To further reduce the number of samples that require annotation from the initial set, we analyze the foreground region quality for each image in S 0 . Here, we work with the masked images of the intra-retinal layers in OCT images and the masked Lung regions in the CT images. We evaluate two metrics for the pixels within the masked regions, namely: the coefficient of variation (CoV ), defined as the ratio between variance and maximum pixel strength for all pixels within the ROI, and the mean pixel value within the ROI region. Image samples that are within 0 distance of others are considered to be similar to the other samples and are thus eliminated from the initial training set (S 0 ). The goal is to minimize the initial set to less than 10 images per set that need annotation. Starting from the initial training image set, we train a U-net++ model [3] and identify more images that represent a different image quality than the image set previously selected for model training. Here, we apply a 4-level U-net++ [3] model to evaluate quality of the resized segmented masks, where the compositions of the encoder (convolution and pooling), decoder (transposed strided convolutions) layers and skip connections are shown in Fig. 4 . For an optimal U-net++ model, we apply batch normalization to encoder layers only and dropout at layers X (4,1) , X (5,1) only 1 . The primary difference between a U-net model and U-net++ [3] model is the use of nested up-sampling layers and additional skip connections. For a U-net++ model the goal is to amplify signal strength at each transposed convolution layer (layers X (4,2) , X (3, 3) , X (2, 4) , X (1, 5) ) by concatenating with intermediate layers as shown in Fig. 4 . This process increases the number of trainable parameters from 7,767,457 in a U-net to 9,045,540 parameters in the U-net++ model. For our application, we train a U-net++ model with a (1), where, l p counts through all pixels in the segmented image, P represents the predicted segmentation at level-4 (L 4 ) and Y represents the annotated pathological ground-truth. Next, we analyze the outputs at levels 1-4 (L (1..4) ) from the dense layer (X (5,1) ), using deep-supervising settings. The outputs at levels 1-3 are converted to the original image dimensions using the resizing ( r ) operation. As the transposed convolutions move further away from the dense feature layer, only higher-order abstraction features at a global level get added to the semantic segmentation output. Thus, for a welltrained U-net++ model, the initial transposed convolution layers closer to the X (5,1) layer bring major value to the semantic segmentation task while the farther away layers (X (1,2) , X (1,3) , X (1, 4) ) have a lesser impact on the outcome. For this reason, we evaluate the intersection-over-union or Jaccard score (J) as representative of image content quality (Q i ) for the resized level 3 and level 4 outcomes from the U-net++ model. If Q i for a particular test sample i lies below a threshold q 0 , then the image is considered to be important for model fine-tuning and added to the training set S. Examples of resized outputs from levels 1-4 for a Lung-CT-med image is shown in Fig. 5 . Here, the Q i score is 0.96 (high), from the outputs of levels 3 and 4. Thus, this image is not used for further model fine-tuning. Algorithm 1 represents the steps for selecting the minimal training set of images S m needed to fine-tune a U-net++ model well. The input to this algorithm is the initial training image set (S 0 ) selected through raw image and annotation qualities described in Section II-B, and an empty set for S m . The U-net++ model is run with deep-supervision to return the resized outputs at levels 1 through 4 and the quality index (Q) becomes a decisive factor if the image must be used for further fine-tuning of the U-net++ model or not. Once the minimal training set (S m ) is identified with m samples, such that m >= 0, the U-net++ model is further trained with these samples and the L (4) level output per test image is considered to be the final prediction per test image thereafter. Here, I and Y represent the raw image and the annotated segmented mask, respectively, and n is the number of test images/slices. This work aims to optimally train a U-net++ segmentation model with the minimal number of training samples from 3D image stacks that can be identified based on image quality. To analyze the performance of the QU-net++ framework to isolate a minimal training set, we perform two experiments. First, we baseline the U-net and U-net++ models on the OCT and Lung CT stacks separately based on existing works in [6] [7] using randomly sampled training images. Second, we implement the proposed framework for minimal training set detection and analyze the segmentation performances of models trained on a fraction of images per stack on the remaining test images. The results and explanations are as follows. Based on existing works in [2] , where 5-10 images per 3D image stack have been shown to train a U-net model for semantic segmentation, we randomly sample 25% of the total number of slices per 3D image stack and use those images for training using U-net and U-net++ models. This process is repeated for 20 runs and averaged results are analyzed. The segmentation performances on the remaining test images are evaluated using the following metrics: precision (P r) that represents the fraction of correctly predicted regions over all predicted regions; recall Re that represents the fraction of correctly predicted regions over all actual ground-truth regions; Jaccard score (J) that represents intersection-overunion between the predicted and actual regions; Dice score (D) or negative of the loss function defined in (1); accuracy (Acc) that represents the ratio of correctly classified foreground and background pixels over all pixels. The numbers of training and test images are in first column and segmentation performances on the test images in comparison with existing works are shown in Table I . The proposed two-step framework is used for minimal training set detection and the performances of segmentation are shown in Table I . To fine-tune the U-net++ model, we apply data augmentations that include rotation, width, height and shear in range 0.2 with disabled vertical flipping. Adam optimizer with learning rate 10 −4 is then used to train the U-Net++ model for 60 epochs with batch sizes of 20 images each. Loss curves for training the U-net++ model with deep supervision on the minimal training set for the OCT stacks are shown in Fig. 6 . Here, we observe that upon training the loss curve trends for L (3,r) , L (4) are very similar, further supporting our hypothesis for expecting similar outcomes from levels 3 and 4 from the U-net++ model. Some examples of the finally trained model from the proposed framework are shown in Fig. 7 . The first 2 columns represent GGO and cyst segmentations with low J scores below 0.2. The last 2 columns represent segmentations with high J scores above 0.6. In this work we propose a novel image quality-based framework for isolating a minimal training set of images from 3D image stacks for semantic segmentation. We apply a trained U-net++ model with deep supervision and analyze the resized outputs from the final two levels to decide if the image under consideration should be used to further train the U-net++ model. This method extracts 8-15% of all samples for training and results in overall pathology segmentation performances with Dice scores in the range 0.56-0.72. Future works will be directed towards extending this QU-net++ model to support multi-class segmentations. V-net: Fully convolutional neural networks for volumetric medical image segmentation Few shot learning framework to reduce inter-observer variability in medical images Unet++: Redesigning skip connections to exploit multiscale features in image segmentation COVID-CT-dataset: a CT scan dataset about covid-19 Blur image detection using laplacian operator and open-cv Selfsupervised deep learning model for COVID-19 lung CT image segmentation highlighting putative causal relationship among age, underlying disease and COVID-19 Inf-net: Automatic COVID-19 lung infection segmentation from CT images