key: cord-0197183-v6mxeyap authors: Bdair, Tariq; Navab, Nassir; Albarqouni, Shadi title: ROAM: Random Layer Mixup for Semi-Supervised Learning in Medical Imaging date: 2020-03-20 journal: nan DOI: nan sha: dca2f3fdf67c2971ad027a1d5fe7836e62f5462d doc_id: 197183 cord_uid: v6mxeyap Medical image segmentation is one of the major challenges addressed by machine learning methods. Yet, deep learning methods profoundly depend on a huge amount of annotated data which is time-consuming and costly. Though semi-supervised learning methods approach this problem by leveraging an abundant amount of unlabeled data along with a small amount of labeled data in the training process. Recently, MixUp regularizer [32] has been successfully introduced to semi-supervised learning methods showing superior performance [3]. MixUp augments the model with new data points through linear interpolation of the data at the input space. In this paper, we argue that this option is limited, instead, we propose ROAM, a random layer mixup, which encourages the network to be less confident for interpolated data points at randomly selected space. Hence, avoids over-fitting and enhances the generalization ability. We validate our method on publicly available datasets on whole-brain image segmentation (MALC) achieving state-of-the-art results in fully supervised (89.8%) and semi-supervised (87.2%) settings with relative improvement up to 2.75% and 16.73%, respectively. Images segmentation plays an essential role in the medical field since it provides the mechanism to analyze and quantify the human organs [29, 25] . Still, manual segmentation is a tedious task, requires highly skilled physicians, and subject to intra/inter-observer variability [5, 16] . Therefore, a fully automatic segmentation framework is of high importance to tackle such challenges. Deep learning-based methods have achieved state-of-the-art performance in medical image segmentation [6, 13, 28 ]. Yet, this success is matched by the need for a large amount of annotated data which is oftentimes not available in medical imaging. Fortunately, semi-supervised learning (SSL) paradigm allows us to mitigate this problem by leveraging a huge amount of unlabeled data along with a few annotated ones. SSL methods can be categorized into entropy minimization [1, 12] , generative model [33] , consistency regularization [30, 7] , and graph-based methods [2, 9] . Next, we briefly introduce the SSL works in Medical Imaging. Baur et al. [2] introduced a regularization term, based on the Laplacian graph, to the cost function for MS Lesion Segmentation. The term minimizes the distance, in the feature space, between similar unlabeled and labeled data points. Entropy minimization enforces the decision boundary to lie in low-density regions. One way to achieve this in SSL is to generate pseudo labels for the unlabeled data using a model trained on the labeled data, then repeat the training process using both labeled and pseudo-labeled data. This idea has been utilized by Bai et al. [1] for cardiac image segmentation, where the pseudo labels were further fine-tuned using CRF [21] . Zhang et al. [33] utilized the adversarial learning [11] for gland image segmentation by encouraging the discriminator to distinguish between the segmentation results of unlabeled and labeled images while encouraging the segmenter to produce results fooling the discriminator. Following the success of consistency regularization methods, Mean-Teacher [30] , Cui et al. [7] introduced a segmentation consistency loss to minimize the discrepancy between the output segmentation of an unlabeled data under different perturbations. Modern regularization methods have been recently introduced to avoid overfitting by encouraging the model to be less confident for interpolated data points at the input space, i.e. MixUp [32] , or at the latent space, i.e. Manifold mixup [31] . Both have been successfully employed in fully-supervised fashion for cardiac image segmentation [4] , brain tumor segmentation [8] , knee segmentation [24] , and prostate cancer segmentation [18] . Eaton et al. [8] and Panfilov et al. [24] have shown the effectiveness of MixUp over standard data augmentation methods in medical imaging. Recently, MixMatch [3] , which is closely related to our work, introduced MixUp to the SSL paradigm achieving state-of-the-art (SOTA) results in image classification. In contrast to our work, MixMatch augments the model with interpolated data at the input space only. We hypothesize that performing the mixup operation at randomly selected hidden representations provides the network with novel representation and additional training signal that suits the complexity of medical image segmentation. Thus, our contributions are threefolds; First, we introduce the RandOm lAyer Mixup (ROAM) to overcome the limitation of MixMatch by encouraging the network to be less confident for interpolated data points at randomly selected layer, hence reducing over-fitting and generalizing well to unseen data. Second, we demonstrate such method can be effective on Whole Brain Image Segmentation achieving state-of-the-art results in both supervised and semi-supervised settings. Third, we perform an extensive experiment and ablation study showing the importance of our design choices. Given a set of labeled . , x L+U } are input images, x ∈ R H×W , and Y L = {y 1 , . . . , y L } are the segmentation maps for C organs, y ∈ R H×W ×C , our goal is to build a model F(x; Θ) that takes input image x and outputs Fig. 1 : Illustration of our proposed method. (a) First, initial labels for the unlabeled batch is produced from a pre-trained model, then, a sharpening step is applied to fine-tune the labels. (b) Second, the labeled and unlabeled batches are concatenated before they fed to the network, and mixed at a random layer,e.g., Input layer in this figure. its segmentation mapŷ. The model is initially trained on the labeled data, for several epochs, to minimize the cross entropy loss as (1) Next, we leverage the unlabeled data along with the labeled one using two steps; i) calibrate the initial guess for unlabeled data, and ii) mixup the labeled and unlabeled data at random layers as described below and summarized in Alg. 1. First, the unlabeled data are fed to the model to output the initial guesŝ before post-processed by a sharpening operation, parameterized with T , which is highly inspired by the entropy minimization and calibration literature [14] . The pseudo label set is then defined as Given the unlabeled data X U and its pseudo labelsỸ U , along with the labeled data X L and its one-hot encoding labels Y L , we concatenate the two sets as X = To enable running the mixup operation at randomly selected latent space, we define H as Require: pre-trained model F(·; Θ (0) ), labeled dataset SL, unlabeled dataset SU , batch size B, number of iteration K, The hyper-parameters {T, α, β} BL ∼ (XL, YL); BU ∼ XU //sample labeled and unlabeled batches 3:ŷ i = F(xi; Θ); xi ∈ BU //initial labels for XU ; Eq.2 4:ỹ i = Sharpening(ŷ i , T ) //pseudo labels; Eq. where F κ (·) is the hidden representation of the input data at layer κ. Note that, when κ = 0, we select the input data. To introduce a noisy interpolated data, a permuted version of the original data is createdH,Ỹ = Permute(H, Y), and fed to the MixUp operation as where H and Y are the interpolated mixed-up data. To favour the original data over the permuted one, we set λ = max(λ, 1 − λ), where λ ∈ [0, 1] is sampled from a Beta(α, α) distribution with α as a hyper-parameter. To this end, the mixed-up data H are fed to the model from layer κ to the output layer at which the segmentation maps are predicted P. Eventually, P is split back into labeled and unlabeled predictions P = {P L , P U }, and similarly Y into Y L and Y U . Our overall objective function is the sum of the cross entropy loss L CE on the mixed-up labeled data, and β-weighted consistency mean squared loss L M SE on the mixed-up unlabeled data as arg min First, we compare our method with other SSL methods for medical images segmentation (Sec. 3.1). Then, we compare our method with the state-of-the-art method for whole-brain segmentation in fully-supervised fashion (Sec. 3.1). Also, we conduct an ablation study to observe the effect of each component of our method on the performance (Sec. 3.2). Finally, we perform extensive experiments following the recommendations by [23] (Sec. 3.3). Datasets: We opt for three publicly available datasets; (i) MALC [22] , which consists of 30 T1 MRI volumes; 15 volumes split into 3 labeled (∼500 slices), 9 unlabeled (∼1500 slices), 3 validation (∼500 slices), and 15 testing volumes (∼2500 slices), (ii) IBSR [26] , which consists of 18 T1 MRI volumes (∼2000 slices), and (iii) CANDI [19] which consists of 13 T1 MRI volumes (∼1500 slices). Implementation details. We employ U-Net [27] as backbone architecture. The weights are initialized with Xavier initialization [10] , and trained using Adam optimizer [20] . The learning rate is set to 0.0001, and weight decay, and batch size to 0.0001, 8, respectively. All input images have a dimension of 256×256 with a resolution ranges from ∼0.86 to 1. Evaluation Metrics: We report the statistical summary of dice score, in addition to the Hausdorff distance (HD), and the Mean Surface Distance (MSD). A Relative Improvement (RI) w.r.t the baseline is also reported. Baselines: One baseline is the initial model, denoted lower bound, and trained on 3 labeled volumes. All SSL models are trained using the same 3 labeled data and additional 9 unlabeled volumes. To compare SSL methods with fully-supervised model, we also define an upper bound, where the lower bound is further trained on the same additional 9 volumes, however, their labels are revealed. Note that, we use the MALC dataset for training and testing. For our method, we examine various choices for the mixed layers, among them MixMatch [3] when κ = 0. Comparison with SSL methods. We compare our method with the recent SSL methods applied to medical imaging, namely Bai et al. [1] , Baur et al. [2] , Cui et al. [7] , and Zhang et al. [33] . Table. 1 shows that our method outperforms the lower bound and all previous works with a statistical significance (p < 0.001). The best results obtained when mixing data at the second hidden layer; Ours(κ = 2) with reported average Dice of 87.2% and RI about 16.73%. Note that the best results of our method outperforms its variant,i.e. Ours(κ = 0), which is similar to MixMatch [3] , and outperforms the upper bound in some metrics. This could be due to the fact that our approach avoids over-fitting by generating novel data points that never seen before introducing a lot of variations. Similar performance is reported for our model at κ = {0, 1, L}. Also, our method achieves the best HD and MSD with values of 3.78 and 0.99, respectively. To provide insights on the performance at the internal structure level, our method significantly outperforms all other SSL methods in most structures, and even outperforms the upper bound in the Right Hippocampus and 3rd Ventricle (cf. Fig.2) . Besides, the performance of our method is consistent among different structures, this clearly shown in the Left Pallidum, 3rd Ventricle, Left Amygdala, and Right Hippocampus. Despite the fact our model achieves lower performance in Left Cortical GM, yet the difference is not statistical significant. The qualitative results with statistical significance (*) is presented in Fig.4 . Comparison with SOTA for Whole Brain Segmentation. We also run our method in a fully-supervised fashion to see the effectiveness of our random mixup using only labeled data. To realize this, we train our model on the whole MACL training set (15 volumes) , and evaluate the model on the testing set (15 volumes). We compare our method with SOTA; Unet [27] and QuickNAT [28] . In contrast to our models and Unet which are only trained, from scratch, on the MALC training set, the QuickNAT is pre-trained using 581 labeled volumes from IXI Table. 2 shows that our models significantly outperform both Unet and QiuckNAT without sophisticated pre-training mechanism. Again, our models achieve lower standard deviations compared to other methods. We show that our simple but elegant ROAM operation leads to a state-of-the-art results without the need for large datasets, complexity increase, which inline with the findings of Isensee et al. [17] . Our ablation involves κ, the sharpening, and the concatenation. Fig.5 includes the results for models trained to mix the data at different layers. We notice that mixing the data at different random layers achieves better results than using only one fixed layer except for κ = 2. This emphasizes the importance of alternating i.e. mixing labeled with unlabeled data, and the sharpening steps, we run an ablation and report the average dice in Fig.6 . A drastic drop in the dice scores is observed when removing one or both steps. The worse result is obtained when mixing the data without applying the sharpening step, this because mixing the initial labels without minimizing its entropy harms the quality of the mixed-up data. Finally, we study the influence of the hyper-parameters α and β on the performance (cf. Table 3 ). As expected, our ablation study shows the essential role of each component in our method justifying its design selection. Changing amount of data. At first, we fix the number of unlabeled data at 1500 slices while changing the amount of labeled data from 100 to 500 slices. The more labeled data, the higher performance and confidence of our model compared to others (cf. Fig. 7 .a). Yet, the confidence level is inconsistent in other models. Next, we fix the number of the labeled data at 500 slices while reducing the amount of the unlabeled from 1500 to 500 slices. In contrast to other methods, our model shows its superior w.r.t variable amount of unlabeled data. Yet, Cui et al. [7] achieves insignificant higher dice at 1000 unlabeled slices. Domain shift results. Finally, we test all models in the presence of the domain shift. We pick the already trained models and test them on IBSR and CANDI datasets. The results in Fig.3 show a drastic drop in all models including the baseline ones. The drop is larger on the ISBR dataset, however, our model at κ = {0, 1, L} performs just well in both cases, and less sensitive to the domain shift problem compared with other models and our variants, in particular, when the model is trained to mix only the input space κ = 0, or the hidden spaces κ = 2. Despite that our model at κ = 2 achieves the STOA on MALC dataset, yet it has less generalization ability than our model at κ = {0, 1, L}. In this paper, we introduce a random layer mixup for semi-supervised learning in medical images segmentation. We show that our method is less prone to over-fitting and has better generalization property. Our experiments show a superior and SOTA performance of our method on the whole brain image segmentation in both supervised and semi-supervised settings. Our comprehensive ablation study shows that our method utilizes both labeled and unlabeled data efficiently, proofing its stability, superiority, and consistency. So far, the quality of the pseudo labels mainly depends on the initial guess and the mixup coefficient λ, however, one could think of modeling this coefficient as a function of uncertainty measures. Also, to generate more realistic mixed-up data, one could think of performing the mixup operation on disentangled representations [15] . Our experiment demonstrates a robust performance of our method under domain shift. Nevertheless, domain invariant SSL methods should be further investigated. Semi-supervised learning for networkbased cardiac mr image segmentation Semi-supervised deep learning for fully convolutional networks Mixmatch: A holistic approach to semi-supervised learning Semi-supervised and task-driven data augmentation Automatic segmentation of seven retinal layers in sdoct images congruent with expert manual segmentation Assemblynet: A novel deep decision-making process for whole brain mri segmentation Semi-supervised brain lesion segmentation with an adapted mean teacher model Improving data augmentation for medical image segmentation Semi-supervised learning for segmentation under semantic constraint Understanding the difficulty of training deep feedforward neural networks Generative adversarial nets Semi-supervised learning by entropy minimization Ce-net: context encoder network for 2d medical image segmentation On calibration of modern neural networks beta-vae: Learning basic visual concepts with a constrained variational framework Automatic lung segmentation for accurate quantitation of volumetric x-ray ct images nnu-net: Selfadapting framework for u-net-based medical image segmentation Prostate cancer segmentation using manifold mixup u-net Candishare: a resource for pediatric neuroimaging data Adam: A method for stochastic optimization Efficient inference in fully connected crfs with gaussian edge potentials Miccai 2012 workshop on multi-atlas labeling. In: Medical image computing and computer assisted intervention conference Realistic evaluation of deep semi-supervised learning algorithms Improving robustness of deep learning based knee mri segmentation: Mixup and adversarial domain adaptation Current methods in medical image segmentation Image similarity and tissue overlaps as surrogates for image registration accuracy: widely used but unreliable U-net: Convolutional networks for biomedical image segmentation Quicknat: A fully convolutional network for quick and accurate segmentation of neuroanatomy Automated medical image segmentation techniques Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results Manifold mixup: Learning better representations by interpolating hidden states mixup: Beyond empirical risk minimization Deep adversarial networks for biomedical image segmentation utilizing unannotated images T.B. is financially supported by the German Academic Exchange Service (DAAD). S.A. is supported by the PRIME programme of the German Academic Exchange Service (DAAD) with funds from the German Federal Ministry of Education and Research (BMBF).