key: cord-0058285-xvcvv32t
authors: Pham, Duc Duy; Dovletov, Gurbandurdy; Pauli, Josef
title: A Differentiable Convolutional Distance Transform Layer for Improved Image Segmentation
date: 2021-03-17
journal: Pattern Recognition
DOI: 10.1007/978-3-030-71278-5_31
sha: c825769845c588f846766e4998ba549cf5df89bc
doc_id: 58285
cord_uid: xvcvv32t

In this paper we propose using a novel differentiable convolutional distance transform layer for segmentation networks such as U-Net to regularize the training process. In contrast to related work, we do not need to learn the distance transform, but use an approximation, which can be achieved by means of the convolutional operation. Therefore, the distance transform is directly applicable without previous training and it is also differentiable to ensure the gradient flow during backpropagation. First, we present the derivation of the convolutional distance transform by Karam et al. [6]. Then we address the problem of numerical instability for large images by presenting a cascaded procedure with locally restricted convolutional distance transforms. Afterwards, we discuss the issue of non-binary segmentation outputs for the convolutional distance transform and present our solution attempt for the incorporation into deep segmentation networks. We then demonstrate the feasibility of our proposal in an ablation study on the publicly available SegTHOR data set.

In medical image computing, semantic segmentation of anatomical structures from various imaging modalities is a crucial task to aid in image based diagnostics. Therefore, research on automated segmentation methods is a major topic in the medical computing domain, since manual segmentation is expensive and time consuming. Especially deep learning strategies have become popular approaches to achieve state of the art results. In supervised settings, these usually require a desired ground truth segmentation, used to calculate a loss function, which is minimized during training. Mean squared error, categorical cross entropy and dice loss are common error functions, which directly make use of the ground truth and are applied for segmentation tasks.

These kind of error functions, however, usually employ a pixel-to-pixel comparison and therefore reduce the segmentation task to a pixel-wise classification task. They do not directly leverage information about higher order features, such as shape or texture. Specifically, pixels are considered independent of each other, thus, error correction for one pixel does not influence the error of another pixel. Consequently, recent research aims at incorporating shape priors into the segmentation process by means of an additional regularization term within the loss function, that also captures higher order information.

One possible option is to infer shape information by means of a learned latent representation of the ground truth segmentation. Oktay et al. [9] utilize a pre-trained autoencoder for shape preservation. The autoencoder's encoding component is used to regularize the weight adaptation process of a generic segmentation network during training. This is motivated by Girdhar et al.'s work on establishing 3D representations of objects from 2D images [5] . Pham et al. [11] present a 2D end-to-end architecture, in which an autoencoder, trained for shape representation, is imitated in latent space by a separate encoder, to directly leverage the autoencoder's decoder for shape consistent segmentation.

Another strategy is to include a well-established shape representation, particularly the distance transform, into the learning process. Comparing a onehot encoded segmentation mask with its corresponding distance transform in each channel, the latter contains distance information about the closest object boundary in every pixel, whereas a binary mask only holds binary information of whether the structure of interest is present or not. In Fig. 1 the differences in binary images and in (Manhatten) distance transforms are illustrated on a simple toy example, in which two pixel values are swapped. For the binary representation, one can notice that only the affected pixels yield a difference, whereas in the corresponding distance transforms, the simple swap has a larger impact on the distance transform's landscape.

Rousson et al. [13] leverage the idea of shape representation by means of signed distance transforms, proposed by Paragios et al. [10] , to incorporate shape priors into the level set framework. Cremers et al. [3] also base their work on the distance transform's shape representation to enforce shape priors. Naturally, incorporating the distance transform into deep neural networks is a plausible step to model inter-pixel relationships, as also noted in Ma et al.'s work [7] . Dangi et al. [4] apply distance map regression in a multi-task learning setting for cardiac MR image segmentation. They propose a regularization framework by formulating an Euclidean distance map regression objective, that is pursued by a sub-network of their segmentation architecture. In a related fashion Bai and Urtasun [1] incorporate the watershed transform by fundamentally learning the distance transform within image objects for instance segmentation. Bui et al. [2] propose a similar multi-task approach, in which the geodesic distance is approximated as a learning task for neonatal brain segmentation. Similarly, Navarro et al. [8] also include the learning task of distance transform approximation in their multi-task segmentation approach. In these contributions, however, the distance transform needs to be learned, since the implementation of the distance transform is often not differentiable.

In this work we propose using Karam et al.'s [6] derivation of a convolutional distance transform approximation for the application in a deep learning context. To the best of our knowledge, this is the first time an adhoc differentiable convolutional distance transform layer is proposed for deep segmentation networks. In Sect. 2 we will discuss the underlying methods, starting with Karam et al.'s [6] Convolutional distance Transform, subsequently proposing a cascaded variant for large images, and posing an embedding suggestion into deep learning frameworks. In Sect. 3 we present our experimental setting and discuss our results in Sect. 4 before concluding with Sect. 5. For the corresponding (Manhatten) distance transforms (c) and (d), differences propagate to further pixels, emphasized in (d), as these change the foreground shape and thus the distance landscape.

A distance map of a binary image yields the distance of each pixel to its closest foreground boundary pixel. Common applicable distances are the Manhatten and the Euclidean distance. A major advantage of this type of image representation is the provision of information about boundary, shape, and location of the object of interest. We denote the distance between two pixel positions p i and p j in an image as d(p i , p j ). Then a distance transform D I : M × N → R + 0 for a binary image I of resolution M × N can be defined pixel-wise as:

To incorporate an adhoc distance transform into the deep learning setting, we follow Karam et al.'s [6] derivation and only consider translation invariant dis-

for any image positions p i , p j ∈ M ×N and any translation p k ∈ R×R. Although most distances are translation invariant, this restriction needs to be mentioned, as there are also counter examples, such as the SNCF distance.

For the computation of the distance transform, Eq. (1) shows, that we need to find the minimal distance to a boundary pixel from a given point. The minimum function can be approximated by a log-sum-exponential. Let d 1 , . . . , d n denote n distances, then the minimum function can be reformulated as

for 1 > λ > 0. The idea is, that the exponential yields very small values, the larger the distances are, as these are artificially increased by λ and negated in the argument. Therefore, larger distances have a significantly smaller impact on the sum than small distances. In the extreme case the exponential of large distances seek zero, leaving only the exponential of the smallest distance in the sum. The subsequent logarithmic function then reverts the exponential operation, leaving an approximation of the minimum function. With Eq. (3), it is possible to reformulate the distance transform of Eq. (1) to

Since a translation invariant distance is assumed, the distance between two points can be rewritten to

Therefore, the distance transform can be formulated as

which is the definition of a convolution. Thus, for a small λ > 0, the distance transform can be approximated by means of a convolution of the binary image I with a kernel exp − d(·,0) λ , i.e.:

where * is the convolutional operator. Since all operations are differentiable, this approximation may be integrated as a differentiable convolutional distance transform layer into current deep learning frameworks. It shall be noted, that Karam et al.'s work [6] also proposes variants of this convolutional distance transform. Initial experiments however showed most promising transforms for the presented formulation.

A major drawback of the convolutional design of the distance transform (in all variants) is that the kernel size theoretically needs to be as large as the diagonal of the input image. This is to ensure that even very sparse binary images can be distance transformed by the proposed method. Otherwise background pixels that are not within the kernel size reach of a foreground pixel would be assigned a distance of 0. This circumstance, however, yields the following two issues:

• The large kernel size leads to an increased computational complexity for the convolutional operation. • For very large distances the exponential term for the kernel design in Eq. (7) may approach zero, decreasing the numeric stability of the logarithmic expression within the convolutional distance transform (CDT). This issue particularly arises for large images with only few foreground pixels. Figure 2 (c) shows the CDT of a toy example image ( Fig. 2(a) ). It is clearly visible, that the CDT was only capable to calculate the distances for a specific range, before becoming unstable, in comparison with a standard Manhatten distance transform implementation in Fig. 2 

We address these issues by proposing a cascade of local distance transforms to reduce the computational complexity and overcome the numerical instability. Instead of directly computing the distance transform with a large kernel, we suggest cascading distance transforms with smaller kernels to approximate the actual transform. Since the kernel size determines the maximal distance that can be measured, it is necessary to accumulate the calculated distances to form the final distance transform approximation.

Let k denote the kernel size. Then the maximal distance to a foreground point that can be captured by the CDT is limited to a range of k 2 . For all background points that are further away than k 2 from a foreground point, Eq. (7) yields a distance of 0, as within the kernel range, there are only background points. The idea is to iteratively extend the binary input image by the area, for which a distance calculation was possible by the locally restricted CDT, i.e. by all points which fulfill the condition that the calculated distance is greater than zero. This extended binary image can then be used to compute a new locally restricted distance transform by means of the small kernel. The calculated distances can then be utilized with the distances of the previous iterations to form the final distance transform. For the i-th iteration, let I (i) denote the extended binary image, and let D (i) I denote the local CDT of I (i) . For the i-th iteration we assume that the original foreground area has been extended by a margin of i· k 2 . Therefore this offset distance is additionally added to the current distances to compensate the lower kernel size. Thus, the cascaded distance transform D * I is updated by the current distances by adding i · k 2 + D of such local distance transforms are necessary to cover the whole image. Algorithm 1 summarizes this suggested procedure. Let w, h denote width and height of the input image and k the kernel size used to compute the CDT. Then in general its computational complexity is given by O(w · h · k 2 ) operations. Our proposed procedure can drastically reduce the number of operations from initial O(w · h · diag 2 ) (for a naive implementation without using separable kernels) to O(w · h · k · diag)(also for a naive implementation without using separable kernels), if the kernel size is chosen much smaller than the image diagonal, i.e. k << diag.

Since the maximally possible measured distance of d(·, 0) in Eq. (7) is restricted by the kernel size, a small kernel size additionally yields a more stable Fig. 2(b) , it becomes apparent that the offset assumption after each iteration yields an error that is propagated to points with further distances. This error decreases with increasing kernel size. Thus, with our proposed procedure there is a trade-off between numerical stability by means of smaller kernel sizes and accuracy through larger kernel sizes that needs to be considered. We argue that for the purpose of considering inter-pixel relationships in the weight optimization process of training a convolutional neural network this approximation of the distance transform suffices. Karam et al. [6] also address this issue by multiplexing multiple λ values, however initial experiments showed that the choice of these values heavily influence.

The previous section describes an adhoc cascaded convolutional approximation method of the distance transform for binary images. We propose using this approximation to extend common segmentation networks, such as Ronneberger et al.'s U-Net [12] , in order to equip the segmentation loss with an additional regression loss, which compares the distance transform of the network's prediction with the distance transform of the ground truth. Since distance transforms are particularly sensible to distortions, the comparison of distance transformed ground truth to the distance transformed prediction may lead to less noisy segmentation results. Figure 3 shows the general idea, of how to extend the U-Net segmentation network with the proposed distance transform layer. In addition to the usual segmentation loss, e.g. the Dice loss, the predicted segmentation and the ground truth segmentation are both passed through the cascaded CDT layer to achieve the distance transforms of prediction and ground truth, respectively. These distance transforms contribute to a regression loss, e.g. the mean squared error, that considers inter-pixel relationships through the distance transforms. However, it needs to be noted that the segmentation's output is usually not binary. Assuming a final softmax or sigmoid layer, the output values for each channel vary between 0 and 1. It was necessary to assume a binary image to be able to restructure Eqs. (4) to (6) . A major disadvantage of Eq. (6) is that for gray scale images, I(p i ) may be a lot larger than the exponential for large distances, even if I(p i ) is rather small. Therefore, even small probabilities of the segmentation output are considered in the sum and may be depicted as foreground pixels, distorting the actual distance map, as can be seen in Fig. 4 . Here a toy example of a gray scale image is shown (Fig. 4(a) ), in which the lower left square is set to a very low intensity of 0.001. The computed CDT (Fig. 4(c) ), however, appears nearly identical to the CDT of the corresponding binary toy example ( Fig. 4(b) ). As can be observed in Fig. 4(d) , the only differences of the computed distance transforms occur within the low intensity pixels, whereas the remaining distance transform's landscape does not show any changes. This is very problematic, as in the segmentation prediction even pixels with very low probabilities would then be considered as foreground pixels in the CDT. We address this problem by proposing the following soft-threshold work around. Let C denote the number of classes, and let y c (p i ) be the prediction for class c at position p i . Then we can soft-threshold the prediction bỹ

This soft-threshold sets any prediction score below C−1 C to zero. Therefore, we enforce strong and correct predictions, as weak correct prediction scores are not registered for the distance transform and negatively impact the regression loss. Figure 4 (e) shows the resulting CDT after applying the proposed soft-threshold, assuming two classes. Low intensity pixels are considered as background, so that a significant change in the distance transform landscape compared to the reference can be observed in Fig. 4 (f).

We conducted ablation studies on the example of thoracic segmentation from CT scans of the thorax, using the publicly available SegTHOR CT data set [14] consisting of 40 CT volumes with corresponding ground truths of esophagus, heart, trachea, and aorta, for training and 20 CT volumes for testing. The goal of our ablation study is to investigate the influence of our proposed distance transform layer on the segmentation output. Therefore, we did not aim to outperform optimized ensemble methods within the SegTHOR challenge, but set value on a valid comparison. Thus, we trained a U-Net architecture with and without our proposed layer on the same training and validation set to ensure fair comparability. In a hold-out validation manner, we trained both models on the 40 available training volumes, and submitted the predictions. For evaluation the Dice Similarity Coefficient (DSC) and the Hausdorff Distance (HDD) were considered as evaluation metrics, which were both provided by the challenge's submission platform.

We implemented a 2D U-Net and the convolutional distance transform layer in Tensorflow 1.12 with an input size of 256 × 256. Thus, we resized the CT slices to the corresponding image size. Our U-Net implementation yields 5 image size levels with 2 convolutional layers, batch normalization and a max-pooling layer in each level. Starting with 32 kernels for each convolutional layer in the first size level of each contracting path, we doubled the number of kernels for each size level on the contracting side and halved the number of kernels on the expansive side. We used a kernel size of 3×3 for every convolutional layer and 2×2 max-pooling in each architecture. We used the standard dice loss L dice as loss function for U-Net and used mean squared error for the regression loss L dist of the distance transforms. We constructed a total loss function L total := L dice + w dist L dist with weight w dist := 0.5 to train the U-Net, equipped with our additional distance transform layer. The optimization was performed with an Adam Optimizer with an initial learning rate of 0.001. The training slices were augmented by means of random translation, rotation and zooming. With a batch size of 4, we trained both models for 200 epochs and chose the model with best validation loss for evaluation. For the distance transform layer, we chose λ := 0.35, as suggested by Karam et al. [6] . The experiments were conducted on a GTX 1080 TI GPU.

It should be noted that the main focus is the ablation study and the image resizing to 256 × 256 canonically decreases the segmentation quality in comparison to methods that use the full image size. Table 1 shows the achieved DSC scores of the trained models for esophagus, heart, trachea, and aorta. It is observable, that with the extension of the convolutional distance transform the scores increase for all organs, except for the aorta. In this case the standard U-Net yields marginally better results. While the DSC score improves by more than 1% for the other organs, for the aorta the additional layer does not seem to bring any benefit. We applied Wilcoxon significance tests with a significance level of 5% on our evaluation results. We found that the improvements in DSCs for Trachea and Esophagus are significant with p-values of 0.0169 and 0.0040, respectively. For the DSC improvement in heart segmentation and the DSC difference in aorta segmentation, we could, however, not find any significance. The improvements can also be noticed in Table 2 , in which the Hausdorff distances are depicted. For esophagus, heart, and trachea the distances decrease with our proposed layer, showing an improvement by approximately 25-30%. However, for the aorta a slightly worse mean distance is observed. This may be due to the fact, that the aorta seems to be a rather simple structure, that U-Net can already easily extract. The improvements in Hausdorff distance are especially noteworthy, as we use a distance based regularization technique to improve the segmentation. Regarding the Wilcoxon significance test we could observe significant improvements for Trachea, Esophagus and Heart with p-values of 0.0072, 0.0008 and 0.0400, respectively. This underlines our assumption that our proposed layer adds significant value to more complex shapes, whereas simple structures as the almost circular aorta and heart slices are already well extracted by a standard U-Net. Figure 5 shows exemplary segmentations of both models on test data slices. The top images indicate better performance for esophagus segmentation with the proposed layer, while the bottom images show superior segmentation results with our layer for the trachea. In both top and bottom row, the segmentations of the aorta do not show much difference. 

In this paper we propose a novel differentiable convolutional distance transform layer for segmentation networks, that can be used adhoc without prior training. We present a cascaded procedure to reduce the computational complexity and to overcome the numerical instability. Additionally we suggest a soft-threshold work around to address the demonstrated issue regarding non-binary segmentation outputs. The proposed layer is used to regularize the training process. We conducted ablation studies on the example of the segmentation of thoracic organs and used the SegTHOR data set for training and evaluation. The experiments show promising results, as compared to an equally trained U-Net our extension yields significant improvements for most organs, particularly regarding Hausdorff distance. We demonstrated on this example, that a combination of proven non-deep learning concepts, such as the distance transform, with deep learning methods may yield great potential. In the future, we aim at extending our proposed layer for 3D segmentation networks.

Deep watershed transform for instance segmentation

Multi-task learning for neonatal brain segmentation using 3D dense-Unet with dense attention guided by geodesic distance

Towards recognition-based variational segmentation using shape priors and dynamic labeling

A distance map regularized CNN for cardiac cine MR image segmentation

Learning a predictable and generative vector representation for objects

Fast convolutional distance transform

How distance transform maps boost segmentation CNNs: an empirical study

Shape-aware complementary-task learning for multi-organ segmentation

Anatomically Constrained Neural Networks (ACNNs): application to cardiac image enhancement and segmentation

Matching distance functions: a shape-toarea variational approach for global-to-local registration

Deep learning with anatomical priors: imitating enhanced autoencoders in latent space for improved pelvic bone segmentation in MRI

U-net: convolutional networks for biomedical image segmentation

Shape priors for level set representations

Multiorgan segmentation using distance-aware adversarial networks