key: cord-0043195-q01kjv8s authors: Huang, Yongfeng; Li, Xueyang; Yan, Cairong; Liu, Lihao; Dai, Hao title: MIRD-Net for Medical Image Segmentation date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_16 sha: 12112efc0b97a588ae6740e952f26807a2a8280e doc_id: 43195 cord_uid: q01kjv8s Medical image segmentation is a fundamental and challenging problem for analyzing medical images due to the approximate pixel values of adjacent tissues in boundary and the non-linear feature between pixels. Although fully convolutional neural networks such as U-Net has demonstrated impressive performance on medical image segmentation, distinguishing subtle features between different categories after pooling layers is still a difficult task, which affects the segmentation accuracy. In this paper, we propose a Mini-Inception-Residual-Dense (MIRD) network named MIRD-Net to deal with this problem. The key point of our proposed MIRD-Net is MIRD Block. It takes advantage of Inception, Residual Block (RB) and Dense Block (DB), aiming to make the network obtain more features to help improve the segmentation accuracy. There is no pooling layer in MIRD-Net. Such a design avoids loss of information during forward propagation. Experimental results show that our framework significantly outperforms U-Net in six different image segmentation tasks and its parameters are only about 1/50 of U-Net. Medical image segmentation is the key to determining whether medical images can provide a reliable basis for clinical diagnosis and treatment. However, the borders between tissues in medical images may be blurred by the imaging acquisition, which increases the difficulty on segmentation. The classical CNN (nonfully convolutional networks) such as [18] and Residual connections network (Res-Net) [6] can only classify separate examples and not a whole segmented pixel, because the fully connected layers are used at the end of the network, which can only mark the category of the whole image and not per pixel. Nevertheless, in many medical imaging tasks, especially in medical segmentation, a class label is desired to be assigned to each pixel. The breakthrough by Ciresan et al. [3] was due to sliding-window setup which can predict the class label of each pixel by providing a local context (neighbor region) around that pixel as input. They won the EM segmentation challenge at ISBI 2012. However, the approach proposed by Ciresan et al. [3] has some limitations: (1) it needs a long time to process the training image because the network must be run for each neighbor region, and there is a slight redundancy due to overlapping between the neighboring regions. (2) it is hard to keep balance in context and localization accuracy. Smaller neighbor regions make the network see context weakly, while larger neighbor regions need more pooling layers that reduce the localization accuracy. Fully convolutional network (FCN) uses the convolutional layer to replace the fully connected layer, getting the probability of each pixel rather than the scalar of the whole image, which improves the accuracy of segmentation [17] . Moreover, the advantage of FCN is indeed the possibility to have a whole image and its segmentation as training inputs, rather than feeding all possible separate sub-images centered on each labelled pixel like the strategy in Ciresan et al. [3] . Inspired by FCN, U-Net, a symmetrical and fully convolutional network, was proposed [16] and widely used because of its elegant architecture. The network has a contracting path and an expanding path that is more or less symmetric to the contracting path, yielding an architecture like letter U. In the expanding path, pooling operators are replaced by upsampling operators to increase the resolution of the output, making the high resolution features from the contracting path combined with the upsampling output through the skip connections, which allows the network to learn more precise feature based on this information. However, the U-Net architecture has one drawback that is difficult to improve performance by shallowing or deepening its depth. Technically, the network with deeper depth is supposed to learn more features and results in better segmentation, while gradients may vanish during the training period, making the network hard to train [8, 20] . In recent years, some variants of U-Net have been proposed [9, 13, 25] . And these network contain the approximate backbone consisting of downsampling layers, upsampling layers, and skip connection (see Fig. 1 ). The differences among them are the use of different modules and the connected way between layers. Residual Block (RB) [6] and Dense Block (DB) [7] are widely integrated into U-Net due to their scalability. And they can also make it is easy to train the network with deep depth, enabling the network to learn more represented information. Furthermore, the concatenation in Dense-Net makes the final classifier use features from all previous layers (different from classical CNN approaches), resulting in better performance of classification. The challenge is to create a network that excels in accuracy without gradient vanishing and with fewer parameters. Motivated by previous work and existing problem in U-net, we propose a new symmetrical network named MIRD-Net. It integrates Residual Block (RB) [6] and Dense Block (DB) [7] into the inception architecture [20] , aiming to excel in accuracy with fewer parameters. The exploration of our network consists of four steps. First, we choose ten layers (including the pooling layers, the convolutional layers, the upsampling layers) as a backbone of the network. Secondly, we try to add RBs as functional modules to the up-down sampling path, and the positions of blocks are discussed through experiments. Thirdly, the backbone is equipped with two DBs when the best positions of RBs are determined. We combine two DBs with RBs, getting two Mini-Inception-Residual-Dense Blocks (MIRD) to replace two RBs which are located in downsampling path. Finally, the pooling layers are replaced with 3× 3 convolutional layers. The main contributions are as follows: (1) a shallower backbone to decrease the number of parameters. The combination of inception architecture, RB and DB makes the network learn more represented features; (2) simple and flexible implementation of our proposed network architecture; (3) great performance for challenging medical image segmentation tasks. CNNs have reached the-state-of-the-art in medical segmentation after FCN was proposed, consisting of symmetrical backbone with downsampling path and upsampling path, which allows combining the feature extracted by downsampling with the feature recovered by upsampling through skip connections. Korez et al. [10] proposed a 3D version of FCN to process the MRI image of the human spine. Zhou et al. [24] combined 2D FCN with 3D Majority voting algorithm, achieving great performance in Three-Dimensional segmentation task of human torso CT. Olaf Ronneberger et al. [16] extended FCN to a symmetrical U-Net and won the first prize on the ISBI cell tracking challenge 2015. Comparing U-Net with FCN, one important modification in U-Net is skip connections, making the network to fuse the information of the up-down sampling path, which can generate high resolution and more accurate mask. In addition, the U-shaped architecture can be straightened into Line-shaped network approximately, which is similar to the Dense-Net where skip connections are used [7] . Inspired by Dense-Net, Z. Zhou et al. [25] altered U-Net by transforming skip connections into dense skip connections, which makes each node connected with all previous nodes like Dense-Net. Drozdzal et al. [4] demonstrated the importance of skip connections in U-Net and combined cross entropy and dice coefficient as a loss function. Cicek et al. [2] proposed a 3D version of U-Net to implement 3D image segmentation by inputting continuous 2D slices. Fausto et al. [14] converted the 3D version of U-Net to V-net and used dice coefficient instead of binary cross entropy as a loss function to segment the prostate MRI image. Brosh et al. [1] added skip layers to the first downsampling layer and the last upsampling layer in U-Net individually, which can discover the lesion of brain MRI precisely. X. Li et al. [12] proposed H-DenseUnet with mixed dense connections, reducing the memory consumption of GPU during the training step and excelling in Liver MICCAI 2017. Steven Guan et al. [5] designed FD-Unet to remove artifacts of 2D PAT images reconstructed from sparse data and compared FD-Unet with the standard U-Net in terms of reconstructed image quality. In addition to the improvements in architecture, advances are being made in some functional operations. Pooling layers as a basic module are widely used in CNNs, which can enlarge the Receptive Field (RF) to make network get more effective information during the training period. However, Pooling operations also lose some spatial information due to reducing the size of images. Theoretically, we cannot remove pooling layers and enlarge the size of convolutional kernels directly, because the larger kernel would result in increasing computational consumption. The larger kernel can be replaced by multiple smaller kernel, keeping the parameter low, which can be seen as imposing a regularization on the larger kernel [18] . Assuming that now we have the 3 × 3 kernel and the 7 × 7 kernel, and separately implementing the 3 × 3 kernel three times, the 7 × 7 kernel once on the same image. According to (1), we can get the same size of output if other conditions (S and P) are consistent. Moreover, F is assumed to be the channels both of input and output, then a single 7 × 7 convolution would require where W is the size of an input image, N is the size of an output image through convolutional operations, F is the size of kernels, P is padding size and S is sliding step. Yu. F et al. [22] used the dilated convolution to replace the pooling operation, which has two advantages. First, it can enlarge the RF without losing information like the pooling operation. Secondly, it can be applied in well situations where the image requires global information. Conditional Random Field (CRF) has been used in the field of image segmentation since 2011 [11] . Later, the CRF was added as a functional module to the back end of the neural network to optimize the segmentation result [23] . Fig. 3, Fig. 4 and Fig. 5 respectively. And (3, 5, 8, 10) , (2, 5, 8, 11) , (3, 5, 8, 12) , (2, 5, 8, 12) represent the positions of Residual Block in the Residual-Shallow U-Net. The MIRD-Net proposed by us is briefly shown in Fig. 2 Residual Block. Experiments have shown that the extraction of features is affected by the depth of the network [19, 20] . Increasing the layers of a network can make it learn more features, but it can also be accompanied by over-fitting, gradients vanishing and other issues, which leads to the extracted features not being fully used. K. He et al. [6] proposed a residual network, which can reuse the feature from the previous layer (see in Fig. 3 ) and ease the training of deeper networks. where x l represents the output of the current layer, x l−1 is the output of the previous layer, and H l (·) is the non-linear calculation including Conv, ReLU [21] , BN [8] in the Residual Block. Inspired by that, we first reduce the number of layers of U-Net [16] to keep the parameters low, then depositing four Residual Blocks (RB) on up-down sampling path (two RBs on upsampling path and another two RBs on downsampling path) to optimize the performance of the network. Theoretically, the number of Residual Blocks can be chosen alternatively but guided by the target of low parameters and good performance, four Residual Blocks are a more reasonable choice. In the case of four Residual Blocks, we have a further discussion on the position where the Residual Blocks are located (see Fig. 2(a-d) ). And after the position determined, we optimize the Residual Block to get a more elegant block in the same position. Dense Block. Within the Dense Block [7] , each layer is connected to all previous layers through concatenation as used in U-Net [16] , which has several advantages: (1) it strengthens feature propagation; (2) it alleviates the gradient vanishing during the training period; (3) it makes the feature reused. Figure 4 shows the layout of a Dense Block. Formally, the x l layer are connected with all previous layers (x l−1 , x l−2 , . . . , x 0 ): where x l represents the output of the current layer, x l−1 , x l−2 , . . . , x 0 are the output of all previous layers connected to x l , * (·) is the concatenation operation, f [·] is the non-linear calculation including Conv, ReLU [21] , BN [8] in the Dense Block. Fig. 4 . The architecture of a Dense Block with m convolution layers, c0 is the channel of input image and li (growth rate) is the channel of the convolved image. ReLU [21] and BN [8] are attached to each convolution layer in a Dense Block. The Dense Block is effective for our proposed network, which mainly leads to three major advantages: (1) the parameter space can be managed simply through the l i (growth rate); (2) generally, it is hard to make sure that gradients flow smoothly in back propagation. But the dense connections in Dense Block can alleviate the gradient vanishing; (3) the datasets used in our experiments are small. Therefore, it is important to reuse the features, which can make the network get more information. The dense connections comprehensively utilize features from previous layers (instead of only the last layer), thus making it easier to get a smooth decision function with better performance. Motivated by Residual Block and Dense Block, we integrate them into the inception architecture [20] , which is named Mini-Inception-Residual-Dense Block (see in Fig. 5 ). And depositing two MIRD Blocks on downsampling path where two Residual Blocks are located to replace them, while removing pooling layers. The reason that drives us to remove the pooling layers is because pooling operations could discard some pixel-level information. Let us assume the x l is the output of MIRD Block, and the x l−1 is the input of MIRD Block, the relation between x l and x l−1 is defined in (4): where G(·) is the function of Inception Block, H(·) is the calculation in Dense Block, F (·) is the calculation in Residual Block. We use the number of parameters of each network and a well-known Dice coefficient for evaluation. The size of each dataset used in our experiment is small like the cells dataset used in U-net (only 30 images), which is inappropriate to divide them into three parts including training set, validation set and test set. Therefore, we split each dataset into five subsets (F1-F5) equally and run a 5-fold cross-validation used in [15] . The MDice (Mean Dice coefficient) and StdDice (Std of Dice coefficient) are defined in (5) (6): where A ij is the predicted image, B ij is the ground truth corresponding to A ij , and m is the number of images in one subset, r is the fold used in cross-validation. The medical segmentation tasks in our experiments are binary classification problem, so the ground truth B ij is the 0-1 matrix. The experiment was conducted on a computer with Intel(R) Core (TM) i7-7700 CPU @ 3.60 GHz, Nvidia GeForce GTX 1080 Ti, 16 GB RAM, and Samsung SSD 850 EVO 500 GB. The operating system is Windows 10(1801). All experiments were run under the Keras framework. Electron microscope image of cells dataset used in U-Net contains 30 images [16] . The size of each image is 512 × 512 pixels. To compare with U-Net, we choose 30 images in other five datasets (Retinal extraction vessel, Nuclei, Lung, Cervical Cytology, Skin Lesion) respectively, which makes the size of datasets consistent. The detailed information about datasets is presented in Table 1 . To start with, we explore the impact of depth on U-Net [16] by reducing and increasing the layers of U-Net. Secondly, based on shallower U-Net, we reduce more layers to get smaller backbone and add four Residual Blocks (RB) into up-down sampling path (two RBs on upsampling path and another two RBs on downsampling path) and have a discussion on the position of Residual Block. Thirdly, based on the best position where the RBs are located; Inception, Dense Block and Residual Block are incorporated into Mini-Inception-Residual-Dense Block to replace the RBs in downsampling path, while the pooling layers also are removed. For hyperparameters, each convolution in the block is followed by BN [8] and ReLU [21] , using Adam optimizer with the following parameters: β 1 = 0.9, β 2 = 0.999, = 1e−8. The sigmoid function is used in the last layer because our target was a binary classification problem. Due to the small size of Computer's graphics memory, a batch size of 3 was used while setting the epochs to 30. The training image and its corresponding labels are simultaneously rotated counterclockwise by 90 • , 180 • , and 270 • to enlarge the dataset, the kernel size is 3 × 3 and the stride is 1 in convolutional layers except the specific layer in the block. For the block we used in the experiment, 1 × 1 convolutional layer is attached to the output of MIRD Block. And f (·) in the Dense Block (Eq. (2)) actually includes BN-ReLU-Conv(1×1)-BN-ReLU-Conv(3×3). The cross-entropy is used as the loss function for all the networks. We apply deeper U-Net (DU-Net), U-Net, shallower U-Net (SU-Net), Residual-Shallow U-Net with different positions of Residual Blocks (see Fig. 2(a-d) ) and MIRD-Net on six segmentation tasks (see Table 1 ). The Dice coefficient and the parameters of networks discussed are reported in Table 2 and Table 3 . The segmented results on some example images are shown in Fig. 6 . Table 2 . Average Dice coefficient and its standard deviation for 5-fold cross validation. Table 2 shows average Dice coefficient and its standard deviation for 5-fold cross validation. When compared to U-Net, DU-Net decreases the accuracy, but SU-Net has better performance on Nuclei and Vessel. It shows that DU-Net is likely to overfit. The Residual Blocks in different positions of up-downsampling path can affect the performance of the network. RSU(35810) (Fig. 2(a) ) performs best in all four RSU-Nets we discussed and outperforms U-Net in six datasets. Moreover, it can be seen that there is obvious improvement by MIRD-Net, which achieves elegant results. The parameters of MIRD-Net are only about 1/50 of U-net (Table 3) , which saves the storage memory. For the slight differences which are hard to see directly, we use red and green circles to highlight each of them (Fig. 6) . The region in red circles represents incomplete correct mask which is compared to the label, the green circles in the results of MIRD-Net show the better performance than that of other networks in the same region. Despite a few incomplete correct masks still exist in the final results, MIRD-Net outperforms the other networks discussed by us in segmenting tiny structure and the edge of target. The reasons why MIRD-Net has a better segmentation result than that of U-Net are as following: (1) there are no pooling layers in MIRD-Net, such a design helps alleviate loss of information during forward propagation; (2) the different kernels (1 × 1 and 3 × 3) used in MIRD-Block can make the network obtain large-structure information and tiny-structure information simultaneously; (3) MIRD-Net not only use the standard skip connections used in U-Net but also reuse the feature from previous layer in MIRD-Block, which results in more represented features learned by the network; (4) the connections used in MIRD Block can alleviate the gradient vanishing during the training period. In this paper, we propose a new symmetric deep neural network for medical image segmentation. The new network takes advantage of Inception, Res-Net and Dense-Net, outperforming U-Net in six different image segmentation tasks. Its parameters are only about 1/50 of U-Net. Furthermore, the MIRD Block of our proposed architecture can also be simply added to other backbones as a functional module. The shortcoming is the way to select the position of MIRD Block, and we have not proven that the position of MIRD Block is the best choice in theory. In the future, the research would focus on the relevance between performance and the position of MIRD Block in different backbones, finding a better strategy to determine the position of MIRD Block and simplifying this structure. Deep 3D convolutional encoder networks with shortcuts for multiscale feature integration applied to multiple sclerosis lesion segmentation 3D U-Net: learning dense volumetric segmentation from sparse annotation Deep neural networks segment neuronal membranes in electron microscopy images The importance of skip connections in biomedical image segmentation Fully dense UNet for 2D sparse photoacoustic tomography artifact removal Deep residual learning for image recognition Densely connected convolutional networks Batch normalization: accelerating deep network training by reducing internal covariate shift The one hundred layers Tiramisu: fully convolutional DenseNets for semantic segmentation Model-based segmentation of vertebral bodies from MR images with 3D CNNs Efficient inference in fully connected CRFs with Gaussian edge potentials H-DenseUNet: hybrid densely connected UNet for liver and tumor segmentation from CT volumes Y-Net: joint segmentation and classification for diagnosis of breast biopsy images V-Net: fully convolutional neural networks for volumetric medical image segmentation Mitosis detection for invasive breast cancer grading in histopathological images U-Net: convolutional networks for biomedical image segmentation Fully convolutional networks for semantic segmentation Very deep convolutional networks for large-scale image recognition Rethinking the inception architecture for computer vision Going deeper with convolutions Deep sparse rectifier neural networks. In: International Conference on Artificial Intelligence and Statistics (AISTATS) Multi-scale context aggregation by dilated convolutions 2015 IEEE International Conference on Computer Vision (ICCV) Three-dimensional CT image segmentation by combining 2D fully convolutional network with 3D majority voting UNet++: a nested U-Net architecture for medical image segmentation