key: cord-0266327-zsyptoef authors: Basak, Hritam; Hussain, Rukhshanda; Rana, Ajay title: DFENet: A Novel Dimension Fusion Edge Guided Network for Brain MRI Segmentation date: 2021-05-17 journal: nan DOI: 10.1007/s42979-021-00835-x sha: e8a8f1eb48e132da3396bfafbdfe14205b7a5834 doc_id: 266327 cord_uid: zsyptoef The rapid increment of morbidity of brain stroke in the last few years have been a driving force towards fast and accurate segmentation of stroke lesions from brain MRI images. With the recent development of deep-learning, computer-aided and segmentation methods of ischemic stroke lesions have been useful for clinicians in early diagnosis and treatment planning. However, most of these methods suffer from inaccurate and unreliable segmentation results because of their inability to capture sufficient contextual features from the MRI volumes. To meet these requirements, 3D convolutional neural networks have been proposed, which, however, suffer from huge computational requirements. To mitigate these problems, we propose a novel Dimension Fusion Edge-guided network (DFENet) that can meet both of these requirements by fusing the features of 2D and 3D CNNs. Unlike other methods, our proposed network uses a parallel partial decoder (PPD) module for aggregating and upsampling selected features, rich in important contextual information. Additionally, we use an edge-guidance and enhanced mixing loss for constantly supervising and improvising the learning process of the network. The proposed method is evaluated on publicly available Anatomical Tracings of Lesions After Stroke (ATLAS) dataset, resulting in mean DSC, IoU, Precision and Recall values of 0.5457, 0.4015, 0.6371, and 0.4969 respectively. The results, when compared to other state-of-the-art methods, outperforms them by a significant margin. Therefore, the proposed model is robust, accurate, superior to the existing methods, and can be relied upon for biomedical applications. Cerebrovascular accident is among the most common diseases, prevailing amongst the community in age groups of 40-60 contributing to a huge percentage of death worldwide every year [11] . Studies have shown that they can even cause disabilities in adults for 2-5 years in about 37-71% of reported cases globally [31] . Rehabilitation may be constructive for an eventual recovery in acute conditions with its effectiveness depending on the neurological developments and damages caused by a stroke in the patients. However, significant improvements in neuro-imaging including the brain image analysis and the T1-weighted magnetic resonance imaging (MRI) images have proven to be contributory for researchers in the diagnosis of patients, improvements in treatment procedures or the likelihood of gaining back some functionality like the motor speech [28] . Brain MRI segmentation is an important task because it influences the entire diagnosis with the processing steps depending on accurate segmentation of various anatomical regions. MRI combined with other diagnostic procedures thus helps in the detection of minor strokes and the presence of Ischemia that may result in Ischemic strokes. In recent days, convolutional neural networks abbreviated as CNN and deep neural networks (DNN) models have proven to be remarkably convenient for classification and segmentation purposes such as [5, 14, 18] . These methods differ from classical image processing methods in several aspects including the automated feature extraction framework [7] . Deep-learning based methods, as compared to the traditional ones, are efficient, robust and effective for clinical applications [6] . The 2-D CNN is used to convert the volumetric data of MRI images into twodimensional slices, predicting the result. With each iteration, the network improves its result by minimizing these losses. However, as considerable spatio-temporal information is obscured in this approach as shown in the recent studies, the researchers have been shifting towards 3-D CNNs. 3-D CNNs are trained to utilize this crucial information in volumetric data and to segregate and outline any abnormalities in the domain of medical imaging. Nonetheless, the memory and computational requirements used for 3-D CNNs are difficult to meet. Hence, they have been mostly avoided even though they might be useful to extract important volumetric information. The U-net architectures have been quite instrumental in present days for the stated purpose [3, 27, 34] , however, they are unable to extract enough information through 2D operations from 3D MRI data. For solving the problem in segmentation of the areas of strokelesion precisely, in this paper we propose a novel dimensionally fused U-net framework which, despite having a 2-D framework, can also associate with relevant spatial 3-D information besides the 2-D information with considerably lower resources as for memory requirements and dataset are concerned. Besides, instead of generating a segmentation map through the traditional upsampling layers, we propose a parallel partial decoder (PPD) module, that aggregates the features from different layers of the CNN for a better saliency map. It takes in the contextual information to generate a global segmentation map acting as the prediction of the proposed network. MRI images often contain regions containing a certain level of similarity between two adjacent classes (foreground and background). Hence, accurate prediction of class boundary is somehow difficult for regions having higher-level semantic information from both of the classes. To mitigate this problem, an edge attention module is associated for better representation of the respective edge information of the stroke lesions. Also, we propose an enhanced mixing loss function that combines the standard Binary Cross Entropy (BCE) loss function along with the weighted Intersection over Union (IoU) loss function that helps in enhancing the gradient propagation and achieving faster convergence. As a result of this enhanced deep supervision, the proposed model learns sufficient and important gradient information of pixel intensity and lesion boundary, leading to an accurate and improved segmentation map. This paper proposes a novel segmentation framework that utilizes both 3D and 2D CNN information for accurate segmentation of stroke lesion. The contribution of the paper is described as follows: 1. We propose a novel dimension fusion block to integrate the features from 2D and 3D CNNs for better spatiotemporal feature representation. The proposed model is lightweight as compared to 3D CNNs, providing superior performance than 2D CNNs. 2. Instead of simple upsampling layers, we propose a parallel partial decoder (PPD) module to associate features from deep layers of CNN, containing high-level semantic information for a better saliency map. 3. The shallow features obtained from the first convolution layer of the CNN is used for edge guidance for accurate mapping of image regions near lesion boundary. 4. We propose an enhanced mixing loss function integrating the traditional weighted IoU and BCE loss function for better supervision and faster convergence. Recent researches in the literature mostly address the problem of lesion segmentation from brain MRI as a semantic segmentation by producing dense pixel-wise predictions for every slice of image. Quite noteworthy outcomes have been attained by researchers using handcrafted features for brain MRI segmentation purposes in the last few years [33] . A multivariate CTP based segmentation technique of MRI images with a chance of infarct voxelwise was presented by Kemling et al. [19] whereas Nabizadeh et al. [26] proposed a histogram improvement algorithmic law using DWI for ischemic lesion segmentation. A multimodal MRI image localization method was then projected by Mitra et al. [25] for the feature mining followed by a Random Forest method for the lesion segmentation which decreased the false positive rate considerably. Recent advancements in deep learning algorithms have inspired researchers to effectively utilize CNN based approach for supervised lesion segmentation. UNet [32] is often considered as the state-of-the-art for segmentation of biomedical images, however several modifications have been proposed in recent years to improve the overall learning and feature representation [2, 4, 20, 21] . Lyksborg et al. [24] used a 3-path ensemble of CNN networks, each for the canonical axial, sagittal and coronal views. Chen et al. [9] proposed a novel ensemble approach of multi-scale convolutional label evaluation net (MUSCLE Net) and DeconvNet that outperformed existing methods on a private MRI dataset. Badrinarayanan et al. [1] proposed SegNet architecture, consisting of an encoder-decoder architecture followed by a pixel wise classification layer. The proposed model, when compared with widely adopted segmentation frameworks, was proven to be exceptionally better. Multi-path U-Net was proposed by Dolz et al. [10] initially to address the variability of ischemic strokes' location and shape, but later was adopted in several other medical image segmentation tasks. However, all these methods fail to extract the important 3D context information in the 3D volume data, thus the prediction might lose continuity due to the limitations of 2D slices. To address the shortcomings of these methods, 3D CNNs have been proven to be of great potential recently. Fully connected 3D DenseNet was proposed by Zhang et al. [36] with comparatively deep architecture and tight connections for improved performance. A two-path 3D CNN structure consisting of fully connected 3D conditional random field was proposed by Kamnistas et al. [18] which was implemented on ISLES2015 dataset with competitive results. A 3D Seg-Net architecture was proposed by Hu et al. [15] consisting of 3D residual framework with 3D voxel-wise segmentation pipeline. The proposed method produced promising results on ISLES2017 dataset. Later Feng et al. [12] proposed a method that extracts both spatial and temporal information using 3D convolution operations, and is able to capture important dynamic semantic features from adjacent frames. However, all these models require huge computational cost and very long time to train. Therefore it is very difficult to fine-tune the training hyperparameters and therefore, this models are prone to overfitting for small datasets. As a remedy, models with both 2D and 3D methods, connected in cascaded manner, were proposed. A hybrid densely connected U-Net (HDenseU-Net) was proposed by Li et al. [21] where a 2D densely connected U-Net was used for initial segmentation, followed by a 3D CNN for correction of spatial continuity. However, these methods often suffer from loss of contextual information of finegrained boundary regions through a series of downsampling and pooling operations in the case of deep CNNs. Secondly, training traditional CNN models requires lots of labelled data which is often not present in brain MRI dataset. These challenges lead to shallow networks with intelligent orientations of layers, that requires fewer parameters and small contextual information for superior performance. In this section, we explain the workflow of our proposed method, the network architecture, different modules and blocks used, the proposed loss function and the hyperparameters involved. The primary architecture of the proposed network is built upon the U-Net as the base model with a few additional modifications. The overall framework has been developed to extract both high-level semantics and the low-level surface information from the image data in the architecture presented in Fig. 1 . The proposed network consists of two parallel branches used for extraction of different dimensional spatial information from the volumetric MRI dataset through a series of 2D and 3D convolutions, and later merged using a dimension transfer block, as shown in Fig. 2 . This fusion scheme enables the network learn refined edge information and facilitates the model identify small stroke regions. As the network deepens, the total number of trainable parameters increases extremely due to the 3D convolution operations. Hence, we have used the fusion scheme only in the early stages of the network. The dimension transfer block performs three major operations: (1) 3D dimensionality reduction, (2) squeeze and excitation operation, and (3) feature fusion. The idea was adopted from the recently developed Squeeze and Excitation (SE) network by Hu et al. [16] . The parameter r inside the SE block is defined as the reduction ratio that regulates the computational cost and capacity of the block. This special architectural block specifically activates the channel-wise dynamic re-calibration of 3D and 2D feature branches and enhance the fusion effect of two different dimensional features. The excitation part of the SE block is the weighting of every feature map present in the side network but at a relatively lower computational cost. Thus, even smaller regions of stroke lesions, considered of utmost importance in various applications of medicine, can be easily detected following the given approach. Let F 2d and F 3d are used to denote the 2D and 3D feature maps respectively, which are the inputs for the dimension transfer block, whilst the depth, width, height, the number of channels and the batch size of feature map are represented by D, W, H, C, N respectively. Firstly there is a conversion in the dimension of the F 3d from which is done with the help of the convolution operation utilizing a 3D 1 × 1 × 1 convolutional block. Following this, there is a reduction in Fig. 2 Architecture of the Dimension transfer block, with two branches as input from two parallel 2D and 3D networks. First the feature channel of 3D network is compressed to 1 followed by a squeeze block and 2D 3 × 3 convolution operation, resulting in the feature dimension consistent with the 2D feature branch. Finally the features are passed through SE block where r represents the reduction ratio. Finally the two branches are merged together to form the fused output the dimensionality of F * 3d from N × H × W × D × 1 to N × H × W × D using squeeze block. However, to maintain the consistency in dimension with the two-dimensional feature maps, the F * 3d is converted from having a dimen- where f d represents the dimensionality reduction operation. Next, the SE block is used to aggregate the feature channels of two different dimensions where channel outputs are weighted and thus considered for better proficiency in feature expression before fusion. Mathematically, F (where F is having a dimension of N × H × W × C ) is denoting a fused feature map that is obtained in the form of an output of the SE block. Further parameters existing in the entire network and its architecture are described in Fig. 2 . In this step, the features from two different dimensions are fused, where f s is representing the SE block that squeezes as well as excites. Accurate segmentation of regions near lesion boundary from biomedical images is quite challenging due to the high-level semantic information shared between two adjacent classes (foreground and background) near lesion boundaries. Zhao et al. [39] proposed that fine-grained boundary constraints can provide useful supervision over the feature extraction task for image segmentation, leading to accurate localization of ROI. Later, similar claims were made by Et-Net [38] here the authors utilized edge attention representations in the early encoding stage and later transferred them to multistage decoding layers for biomedical image segmentation. Hence, being inspired by the original work of DANet [13] we introduce a supervised edge attention module to the segmentation framework to effectively learn the edge information from the instances. Here position attention module, as suggested by Ref. [13] , enhances the representation capability of the local features by aggregating a wide range of contextual information into them. In addition to the original DANet, we add additional convolution layers after the position attention module to obtain edge attention features of the same depth as the feature. Let F A ∈ ℝ h×w×c be a local feature, where h, w, and c represent the height, width and channel respectively, be fed to three convolution layers to produce three new feature maps F B , F C and F D , where {F B , F C , F D } ∈ R h×w×c . F B , F C and F D are then reshaped to n × c , where n = h × w represents the number of pixels. Then F B is multiplied with transpose of F C , followed by a softmax to obtain edge attention map where M E (j, i) represents the impact of the ith position pixel's on that of the jth position's. In parallel, F D is multiplied with transpose of M E and the output is reshaped to h × w × c . Finally, a multiplication factor is multiplied to it and the result is element-wise summed up with F A to produce the final output of EA module O EA , given by the following equation: The output from the EA module is then fed to the PPD block for better edge supervision of the overall segmentation process. The workflow of the EA block is shown in Fig. 3 . Traditional segmentation models in the U-Net family utilizes symmetrical encoder-decoder architecture, providing similar importance to high-level and low-level features to produce the final segmentation map. However, [35] suggested that low-level features contribute very little towards the final prediction map, leading to unnecessary usage of computational resource due to their high spatial resolution. To mitigate this problem, instead of aggregating features from all level of CNN followed by sequential upsampling, we have used a partial decoder module that utilizes inputs from selected CNN layers only. Being inspired by the Receptive Field Block (RFB) [23] , our proposed PPD module captures global contextual information to produce an accurate segmentation map. The initial convolution layers of the CNN are considered the shallow layers and provide information with very little significance towards the final prediction. Hence, they are discarded from the inputs of the PPD module. Specifically, the features from the two-dimension transfer blocks are considered as the major inputs of the PPD module because of their richness in important spatio-temporal information, both in 2D and 3D space. Additionally, the feature from the final downsampling layer and the output from the EA module is also fed to the PPD module. To accelerate the feature propagation, we add a series of convolution and batch normalization operations as shown in Fig. 4 . Short skip connections are added in the PPD module, similar to the original RFB module. After obtaining different discriminating features from different layers, we finally multiply them to reduce the gap between multiple feature levels. Thus, the PPD module produces a global segmentation map through a series of element-wise multiplication and concatenation operations. The overall architecture of the PPD module is shown in Fig. 4 . To supervise the overall learning and segmentation process of the proposed network, we have used an enhanced mixing loss function to mitigate the problems of standard Least Absolute Deviations (L1) and Least Square Errors (L2) loss functions. This hybrid loss function incorporates the features of weighted Binary CrossEntropy (BCE) and weighted Intersection over Union (IoU) loss functions, leading to fast convergence and efficient global and local supervision. The result from the EA module O EA is supervised with respect to the actual edge map G E , obtained by calculating the gradient of the ground truth. The standard BCE loss function is used to calculate the dissimilarity between these two as follows: where i indicates the pixel value of the segmentation map. The weighted BCE loss function, commonly known as WCE [29] , is a modification of standard BCE loss, useful in biomedical application where class imbalance between foreground and background pixel is evident. The formulation of WCE loss in our case is as follows: where is the correcting factor, used to tune the false positive and false negative predictions, G is the ground truth, O S being the prediction map. Similarly, we calculate the weighted IoU loss function L WIoU and define the overall loss function L as: In this section, we describe the results obtained, both quantitatively and qualitatively, the experimentations performed, ablation studies and the comparison of our results with those of the state-of-the-art, to evaluate the comparative performance of the proposed method. To evaluate the performance of the proposed model on the supervised stroke lesion segmentation task, we have utilized four standard and widely used evaluation metrics, described as follows: It is a spatial overlap metric that is computed as in Eq. 8 for predicted image S and ground truth G. Also known as Jaccard Index (JI), it measures the accuracy of segmentation by computing the ratio of the intersection of objects and their union when projected on the same plane. Mathematically it is expressed as in Eq. 9, where S is the predicted segmentation mask and G is the original ground truth mask of the image. It adequately refers to the unadulterated positive detections concerning the actual ground truth. Precision addresses how many of the pixels in the segmentation map matches the verified ground truth observations. Mathematically it is expressed as in Eq. 10, where TPs and FPs are the true positive and the false positive respectively. False Negative error arises if a pixel inside a stroke region is misclassified as a normal region. Likewise in case of False Positive, a pixel belonging to non-lesion class is misclassified into lesion class. It effectively expresses, how complete the positive predictions are concerning the actual ground truth. Amongst the total pixels annotated in the verified ground truth, it refers to the number of pixels that were captured as positive predictions. Mathematically it is expressed as in Eq. 11, where TP and FN indicate true positive and false negative respectively. The proposed method was implemented in Python, utilizing Nvidia K80 GPU with 12 GB available RAM. A Stochastic Gradient Descent (SGD) optimizer with reduced learning is utilized where the learning rate is reduced by a factor of 0.1 on the performance metrics plateaus on the validation set, whereas the initial learning rate was set to 1e − 4 . The proposed method was evaluated on Anatomical Tracings of Lesions After Stroke (ATLAS) [22] dataset, consisting of 229 T1 weighted 3D MRI images for stroke lesion segmentation, collected from 11 cohorts worldwide. Each of the images consists of 189 slices, manually segmented by expert practitioners for stroke lesions. The original images have dimensions of 233 × 197 , resized to 192 × 192 . Several image augmentation methods including random flip, random rotate, colour augmentation methods were incorporated to address the possibility of over-fitting. The reduction factor r of the SE block is set to 16, the batch size is set to 8, maximum iteration is set to 200, early stopping is included. The dataset was divide into train-test-validation split in a ratio of 8:1:1. 5-fold cross validation was performed and the average value was reported throughout the experiments. Figure 5 represents the variation of different segmentation metrics with respect to epochs, both for training and validation sets. We have shown the mean value of 5 folds as well as ±1% standard deviation value in the figure. To study the importance of different modules of the proposed DFENet, we have performed ablation studies by removing a particular module or part of the architecture, keeping all the other learning parameters and network unchanged. Table 1 ) by aggregating sufficient contextual and spatio-temporal information, which can not be explored using mere 2D UNet. Replacing the upsampling block of UNet with the proposed PPD can effectively increase the Here 'TPR' and 'TNR' represents the true positive and true negative rates respectively, 'DSC' represents dice similarity score To validate the effectiveness of the proposed DFENet, we have compared our results quantitatively with several other existing methods that have been successfully used for accurate segmentation of stroke lesions on the ATLAS dataset. Table 2 depicts this comparison of our results with UNet [32] , SegNet [1] , ResUNet [37] , PSPNet [40] , DeepLab V3 [8] , XNet [30] , 3D CNN+CRF [18] , and Brain SegNet3D [15] . To analyze the visual results of the superiority of the proposed method over most of the existing ones, the visual comparison of some selected segmentation instances are shown in Fig. 6 . It is evident from the figure that, SegNet often misses important and minor lesion regions (row 1), 6 Visual Comparison of the segmentation results from our proposed method along with different existing methods. The first column represents the original input image of the Apparent Diffusion Coefficient (ADC) MRI sequence in 2D format along with their cut coor-dinate alongx-xis included, the second column is the ground truth, the third column represents the segmentation map from our result. Along with visual comparison, we have also included the DSC scores to identify the accuracy of segmentation leading to poor overall performance as shown in Table 2 . Though UNet performs quite well in other biomedical segmentation applications and is considered as a state-of-the-art model, it fails to detect sufficient information for accurate lesion segmentation from the ATLAS dataset as compared to the proposed DFENet, shown both in Tables 1 and 2 . XNet and ResUNet on the other hand consistently produced a segmentation map, very close to the proposed DFENet, which is also reflected in Table 2 . The 3D segmentation frameworks like 3D CNN+CRF [18] and Brain SegNet3D [15] outperforms the proposed method marginally, as shown in Table 2 . But, these methods are computationally expensive as compared to the proposed method. Thus we can conclude that the proposed DFENet outperforms all the 2D segmentation methods in this particular segmentation task and produces comparable results to 3D CNNs with less computational cost. We have also compared the total number of model parameters of our proposed method with different other existing frameworks in literature, shown in Table 3 . It is evident from the table that our proposed method contains only 9.64M parameters as compared to 22.M parameters of 3D UNet. Existing state-of-the-art methods like SegNet, XNet, ResU-Net, and DeepLab V3 suffer from extremely high computational cost as compared to our proposed method with 14.7M, 15.1M, 12.5M, and 21.3M parameters respectively. UNet, with only 7.77M parameters, is lightweight as compared to DFENet, but produces poor segmentation map as shown in Table 2 . To gracefully address the shortcomings of existing segmentation methods, in this paper we present an end-to-end segmentation framework. The proposed DFENet can effectively extract information-rich contextual features and can predict accurate segmentation map by fusing 2D and 3D features. The proposed method is robust, effective and when compared to traditional segmentation models, outperforms them quite significantly, adding to its reliability and clinical importance. Therefore our main focus lies in developing a novel and effective segmentation framework. As it is clear from the discussion that our approach is suitable for volumetric image segmentation (as we are capturing 3D features also), and brain MRI being one such volumetric data, we have evaluated the model performance on this particular dataset. However, in future we would like to extend the experimentations on other datasets to analyze the model performance. In future, we plan to extend the model for semi-supervised segmentation to address the problem of the insufficient labelled dataset. This paper can be used as a testbed for further experimentations and the development of diverse segmentation frameworks for other biomedical applications as well. Funding The authors did not receive support from any organization for the submitted work. No funding was received to assist with the preparation of this manuscript. Conflict of Interest Hritam Basak declares that he has no conflict of interest. Rukhshanda Hussain declares that she has no conflict of interest. Ajay Rana declares that he has no conflict of interest. Ethical Approval This article does not contain any studies with human participants or animals performed by any of the authors. Segnet: a deep convolutional encoder-decoder architecture for image segmentation F-unet: a modified u-net architecture for segmentation of stroke lesion Single image super-resolution using residual channel attention network Cervical cytology classification using pca & gwo enhanced deep features selection Comparative study of maturation profiles of neural cells in different species with the help of computer vision and deep learning Deep convolutional neural networks for brain image analysis on magnetic resonance imaging: a review Multi-scale attention u-net (msaunet): a modified u-net architecture for scene segmentation Encoderdecoder with atrous separable convolution for semantic image segmentation Fully automatic acute ischemic lesion segmentation in DWI using convolutional neural networks Dense multi-path u-net for ischemic stroke lesion segmentation in multiple image modalities Stroke in the 21st century: a snapshot of the burden, epidemiology, and quality of life A deep learning approach for targeted contrast-enhanced ultrasound based prostate cancer detection Dual attention network for scene segmentation White matter hyperintensity and stroke lesion segmentation and differentiation using convolutional neural networks Brain segnet: 3d local refinement network for brain lesion segmentation Squeeze-and-excitation networks net for skull stripping in brain MRI Efficient multi-scale 3d CNN with fully connected CRF for accurate brain lesion segmentation Multivariate dynamic prediction of ischemic infarction and tissue salvage as a function of time and degree of recanalization Fuzzy rank-based fusion of CNN models using gompertz function for screening COVID-19 CT-scans H-denseunet: hybrid densely connected U-net for liver and tumor segmentation from CT volumes A large, open source dataset of stroke anatomical brain images and manual lesion segmentations Receptive field block net for accurate and fast object detection An ensemble of 2d convolutional neural networks for tumor segmentation Lesion segmentation from multimodal MRI using random forest following ischemic stroke Automatic ischemic stroke lesion segmentation using single mr modality and gravitational histogram optimization based brain segmentation Automated segmentation of acute stroke lesions using a data-driven anomaly detection on diffusion weighted MRI Interrater agreement for final infarct MRI lesion delineation Weighted rank aggregation of cluster validation measures: a monte carlo cross-entropy approach X-net: Brain stroke lesion segmentation based on depthwise separable convolution and long-range dependencies Stroke mortality and trends from 1990 to 2006 in 39 countries from Europe and central Asia: implications for control of high blood pressure U-net: Convolutional networks for biomedical image segmentation Querying representative and informative super-pixels for filament segmentation in bioimages Nas-unet: neural architecture search for medical image segmentation Cascaded partial decoder for fast and accurate salient object detection Automatic segmentation of acute ischemic stroke from DWI using 3-d fully convolutional densenets Road extraction by deep residual u-net Et-net: A generic edge-attention guidance network for medical image segmentation Egnet: edge guidance network for salient object detection Pyramid scene parsing network