key: cord-0672709-kle5u4g4
authors: Maga, Bal'azs
title: Chest X-ray lung and heart segmentation based on minimal training sets
date: 2021-01-20
journal: nan
DOI: nan
sha: 015c025527820abd37261356dff7f541eb98e785
doc_id: 672709
cord_uid: kle5u4g4

As the COVID-19 pandemic aggravated the excessive workload of doctors globally, the demand for computer aided methods in medical imaging analysis increased even further. Such tools can result in more robust diagnostic pipelines which are less prone to human errors. In our paper, we present a deep neural network to which we refer to as Attention BCDU-Net, and apply it to the task of lung and heart segmentation from chest X-ray (CXR) images, a basic but ardous step in the diagnostic pipeline, for instance for the detection of cardiomegaly. We show that the fine-tuned model exceeds previous state-of-the-art results, reaching $98.1pm 0.1%$ Dice score and $95.2pm 0.1%$ IoU score on the dataset of Japanese Society of Radiological Technology (JSRT). Besides that, we demonstrate the relative simplicity of the task by attaining surprisingly strong results with training sets of size 10 and 20: in terms of Dice score, $97.0pm 0.8%$ and $97.3pm 0.5$, respectively, while in terms of IoU score, $92.2pm 1.2%$ and $93.3pm 0.4%$, respectively. To achieve these scores, we capitalize on the mixup augmentation technique, which yields a remarkable gain above $4%$ IoU score in the size 10 setup.

All around the world, a plethora of radiographic examinations are performed day to day, producing images using different imaging modalities such as X-ray, computed tomography (CT), diagnostic ultrasound and magnetic resonance imaging (MRI). According to the publicly available, official data of the National Health Service ( [13] ), in the period from February 2017 to February 2018, the count of imaging activity was about 41 million only in England. The thorough examination of the vast quantity of these images imposes a huge workload on radiologists, which increases the number of the avoidable human mistakes. Consequently, automated methods aiding the diagnostic processes are sought-after.

The examination of medical images customarily includes various segmentation tasks, in which detecting and pixelwise annotating different tissues and certain anomalies are vital. Common examples include lung nodule segmentation in the diagnosis of lung cancer, lung and heart segmentation in the diagnosis of cardiomegaly, or plaque segmentation in the diagnosis of thrombosis. Even in the case of 2-dimensional modalities, such segmentation tasks can be extremely time-demanding, and the situation gets even worse in three dimension. Taking into consideration that these tasks are easier to formalize as a standard computer vision exercise than the identification of a particular disease, it is not surprising that they sparked much activity in the field of automated medical imaging analysis.

Semantic segmentation -that is assigning a pre-defined class to each pixel of an image -requires a high level of visual understanding, in which state-of-the-art performance is attained by methods utilizing Fully Convolutional Networks (FCN) [4] . An additional challenge of the field is posed by the strongly limited quantity of training data on which one train machine learning models, as annotating medical imaging requires specialists in contrast to "reallife" images. To overcome this difficulty, the so-called U-Net architecture was proposed: its capability to being efficiently trained on small datasets has been demonstrated in [5] . Over the past few years several modifications and improvements have been proposed on the original architecture, some of which involved different attention mechanisms designed to help the network to detect the important parts of the images.

In the present paper we introduce a new network primarily based on the ideas of [12] and [8] , to which we refer to as Attention BCDU-Net. We optimize its performance through hyperparameter tests on the depth of the architecture and the loss function, and we demonstrate the superiority of the resulting model compared to the state-of-the-art network presented in [15] in the task of lung and heart segmentation on chest X-rays. Besides that, we will also give an insight into two interesting phenomena arising during our research which might be interesting for the AI medical imaging community: one of them is the very small data requirement of this particular task, while the other one is the peculiar evolution of the loss curves over the training.

2 Deep learning approach 2.1 Related work As already mentioned in Section 1, [5] introducing U-Nets is of paramount importance in the field. Since then U-Nets have been used to cope with diverse medical segmentation tasks, and numerous papers aimed to design U-Net variants and mechanisms such that the resulting models tackles better the problem considered. Some of these paid primary attention to the structure of the encoder and the decoder -that is the downsampling and the upsampling path -of the original architecture. For example in [18] , the authors developed a network (CoLe-CNN) with multiple decoder branches and Inception-v4 inspired encoder to achieve state-of-the-art results in 2-dimensional lung nodule segmentation. In [10] and [14] , the authors introduced U-Net++, a network equipped with intermediate upsampling paths and additional convolutional layers, leading to essentially an efficient ensemble of U-Nets of varying depths, and demonstrated its superiority compared to the standard U-Net in many image domains. Other works put emphasis on the design of skip connections and the way the higher resolution semantic information joins the features coming through the upsampling branch. In [12] , the authors proposed the architecture BCDU-Net, in which instead of the simple concatenation of the corresponding filters, the features of different levels are fused using a bidirectional ConvLSTM layer, which introduces nonlinearity into the model at this point and makes more precise segmentations available. In [8] it has been shown that for medical image analysis tasks the integration of so-called Attention Gates (AGs) improved the accuracy of the segmentation models, while preserving computational efficiency. In [15] , this network was enhanced by a critic network in a GAN-like scheme following [9] , and achieved state-of-the-art results in the task of lung and heart segmentation. Other attention mechanisms were introduced in [17] and in [16] .

The network architecture Attention BCDU-Net we propose is a modification of the Attention U-Net, shown at Figure 1. [8] , the tensor addition to alter is highlighted by an arrow.

In [12] , the authors demonstrated that it is beneficial to use bidirectional ConvLSTM layers to introduce nonlinearity in the step of merging semantic information gained through skip connections and the features arriving through the decoder. This inspired us to modify the attention gates (see Figure 2 ) in a similar manner, in which these pieces of information are merged via tensor addition, that is a linear operation as well. This addition is replaced by a bidirectional ConvLSTM layer, to which the output of W g and W x -the processed features and the semantic information, respectively -is fed in this order. We note that to our best knowledge, there is a slight ambiguity about the structure of the resampling steps in the attention gate: while the official implementation is in accordance with the figure, there are widely utilized implementations in which the output of W g is upsampled instead of downsampling the output of W x in order to fit their shape. We tested both solutions and did not experience a measurable difference in the performance. We also experimented with the usage of additional spatial and channel attention layers as proposed by [17] , however, we found that it does not improve the performance of our model. The depth of the network is to be determined by hyperparameter testing. Our tests confirmed that four downsampling steps results in the best performance, however, the differences are minuscule.

A standard score to compare segmentations is the Intersection over Union (IoU): given two sets of pixels X, Y , their IoU is

In the field of medical imaging, Dice Score Coefficient (DSC) is probably the most widespread and simple way to measure the overlap ratio of the masks and the ground truth, and hence to compare and evaluate segmentations. It is a slight modification of IoU: given two sets of pixels X, Y , their DSC is

If Y is in fact the result of a test about which pixels are in X, we can rewrite it with the usual notation true/false positive (TP/FP), false negative (FN) to be

We would like to use this concept in our setup. The class c we would like to segment corresponds to a set, but it is more appropriate to consider its indicator function g, that is g i,c ∈ {0, 1} equals 1 if and only if the ith pixel belongs to the object. On the other hand, our prediction is a probability for each pixel denoted by p i,c ∈ [0, 1]. Then the Dice Score of the prediction in the spirit of the above description is defined to be

where N is the total number of pixels, and ε is introduced for the sake of numerical stability and to avoid divison by 0. The IoU of the prediction can be calculated in a similar manner. The linear Dice Loss (DL) of the multiclass prediction is then

A deficiency of Dice Loss is that it penalizes false negative and false positive predictions equally, which results in high precision but low recall. For example practice shows that if the region of interests (ROI) are small, false negative pixels need to have a higher weight than false positive ones. Mathematically this obstacle is easily overcome by introducing weights α, β as tuneable parameters, resulting in the definition of Tversky similarity index [1] :

where p i,c = 1 − p i,c and g i,c = 1 − g i,c , that is the overline simply stands for describing the complement of the class. Tversky Loss is obtained from Tversky index as Dice Loss was obtained from Dice Score Coefficient:

Another issue with the Dice Loss is that it struggles to segment small ROIs as they do not contribute to the loss significantly. This difficulty was addressed in [11] , where the authors introduced the quantity Focal Tversky Loss in order to improve the performance of their lesion segmentation model:

where γ ∈ [1, 3] . In practice, if a pixel with is misclassified with a high Tversky index, the Focal Tversky Loss is unaffected. However, if the Tversky index is small and the pixel is misclassified, the Focal Tversky Loss will decrease significantly.

In our work we use multiclass DSC and IoU to evaluate segmentation performance. As our initial tests demonstrated that training our network with Focal Tversky loss results in better scores, we will use this loss function. The optimal α, β, γ parameters should be determined by extensive hyperparameter testing and grid search. We worked below with α = 0.6, β = 0.4, 1 γ = 0.675.

For training-and validation data, we used the public Japanese Society of Radiological Technology (JSRT) dataset ( [3] ), available at [2]. The JSRT dataset contains a total of 247 images, all of them are in 2048 × 2048 resolution, and have 12-bit grayscale levels. Both lung and heart segmentation masks are available for this dataset. In terms of preprocessing, similarly to [15] , the images were resized to the resolution 512 × 512 first. As X-rays are grayscale images with typically low contrast, which makes their analysis a difficult task. This obstacle might be overcome by using some sort of histogram equalization technique. The idea of standard histogram equalization is spreading out the the most frequent intensity values to a higher range of the intensity domain by modifying the intensities so that their cumulative distribution function (CDF) on the complete modified image is as close to the CDF of the uniform distribution as possible. Improvements might be made by using adaptive histogram equalization, in which the above method is not utilized globally, but separately on pieces of the image, in order to enhance local contrasts. However, this technique might overamplify noise in near-constant regions, hence our choice was to use Contrast Limited Adaptive Histogram Equalization (CLAHE), which counteracts this effect by clipping the histogram at a predefined value before calculating the CDF, and redistribute this part of the image equally among all the histogram bins.

Concerning data augmentation, we follow [7] , in which the method mixup was used to improve glioma segmentation on brain MRI's. This slightly counter-intuitive augmentation technique was introduced by [6] : training data samples are obtained by taking random convex combinations of original image-mask pairs. That is, for (x 1 , y 1 ) and (x 2 , y 2 ) image-mask pairs, we create a random mixed up pair x = λx 1 +(1−λ)x 2 , y = λy 1 +(1−λ)y 2 , where λ is chosen from the beta distribution B(δ, δ) for some δ ∈ (0, ∞). In each epoch, the original samples are paired randomly, hence during the course of the training, a multitude of training samples are fed to the network. (From the mathematical point of view, as the coefficient λ is chosen independently in each case from a continuous probability distribution, the network will encounter pairwise distinct mixed up training samples with probability 1, modulo floating point inaccuracy.) In [6] , the authors argue that generating training samples via this method encourages the network to behave linearly in-between training examples, which reduces the amount of undesirable oscillations when predicting outside the training examples.

The choice of δ should be determined by hyperparameter testing for any network and task considered. In [6] , δ ∈ [0.1, 0.4] is proposed, while in [7] δ = 0.4 is applied.

In our main tests, the JSRT dataset was randomly split so that 85% of it was used for training and the rest for validation and testing. This split was carried out independently in each case, enhancing the robustness of our results. Besides that, we also experimented with small dataset training, in which rather modest sets of 10 and 20 X-rays was utilized as training set. (The test set remained the same.) It enabled us to measure the benefits of mixup more transparently. In each of these cases, we trained our network with Adam optimizer: in the former case, for 50 epochs, while in the latter cases for 1000 and 500 epochs, respectively. As these epoch numbers are approximately inversely proportional to the size of the training sets, these choices correspond to each other in terms of training steps. Table 1 summarizes the numerical results we obtained during the testing of Attention BCDU-Net with different train sizes and choices of δ, while Figure 3 -5 display visual results. Note that the highest DSC scores slightly exceed the ones attained by the state-of-the-art, adversarially enhanced Attention U-Net introduced in [15] (97.6 ± 0.5%) and admit higher stability. The effect of augmentation is the most striking in the case of training on an X-ray set of size 10, when the choice δ = 0.2 results in a 5% increase of IoU compared to the no mixup case. In general, we found this case particularly interesting: it was surprising that we could achieve IoU and DSC scores of this magnitude using such a small training set. Nevertheless the predictions have some imperfections, displayed by Figure 3 : the contours of the segmentation are less clear and both the heart and the lung segmentation tend to contain small spots far from the ground truth. However, such conspicuous faults are unlikely to occur in the case of the best models for 20 train X-rays ( Figure  4 ), which is still remarkable. The sufficiency of such small training sets is probably due to the relative simplicity of the task. Notably, lung and heart regions admit large similarity across a set of chest X-rays, and they are strongly correlated with simple intensity thresholds. Consequently, even small datasets have high representing potential. We note that as δ gets smaller, the probability density function of B(δ, δ) gets more strongly skewed towards the endpoints of the interval [0, 1], which results in mixed up samples being closer to original samples in general. The perceived optimality of δ = 0.2 in the small dataset cases show that a considerable augmentation is beneficial and desirable, yet it is unadvised to use too wildly modified samples.

The benevolent effect of mixup gets more obscure as we increase the size of the training set. Notably, the results of different augmentation setups are almost indistinguishable from each other. We interpret this phenomena as another consequence of the similarity of masks from different samples, which inherently drives the network towards simpler representations in the case of a sufficiently broad training set, even without using mixup.

We also note that in the case of 10 training samples, while the IoU differences between the no mixup and the mixup regime are striking, the gain in DSC is less remarkable. It hints that it is unadvised to rely merely on DSC when evaluating and comparing segmentation models. We would also like to draw attention to the peculiar loss curves we primarily encountered during the small dataset trainings, as displayed in Figure 6 . Notably, the curve of the validation DSC flattens far below the also flattening curve of the train DSC, strongly inciting the usage of early stopping. (Train DSC reaches essentially 1 in fact, which is unsurprising with such a small training set.) However, in the later stages the validation DSC catches up, even though the train DSC does not have any room for further improvement. We were especially puzzled by this behaviour in the 10-sized training setup, in which both the train and validation DSC seems completely stabilized after from epoch 50 to epoch 400, yet validation DSC skyrockets in the later stages in a very short amount of time. The same behaviour was experienced during each test run. We have yet to give the intuitive or theoretical explanation for this phenomenon that how the generalizing ability of the model can improve further when it seems to be in a perfect state from the training perspective. We note that these observations naturally led us to experiment with even longer trainings, but to no avail. 

In the present work, we addressed the problem of automated lung and heart segmentation on chest X-rays. We introduced a new model, Attention BCDU-Net, a variant of Attention U-Net equipped with modified attention gates, and surpassed previous state-of-the-art results. We also demonstrated its ability to attain surprisingly reasonable results with strongly limited training sets. Performance in these cases was enhanced using the mixup augmentation technique, resulting in highly notable contribution in the IoU score.

Concerning future work, a natural extension of this work would be adding a structure correcting adversarial network to the training scheme, similarly to [9] and [15] , and measuring its effect on the performance, especially in the setup of limited training sets. We would also like to give some kind of explanation to the phenomenon of peculiar loss curves.

Features of similarity

Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists' detection of pulmonary nodules

Fully convolutional networks for semantic segmentation

U-Net: Convolutional Networks for Biomedical Image Segmentation

mixup: Beyond Empirical Risk Minimization

Improving Data Augmentation for Medical Image Segmentation

Attention U-Net: Learning where to look for the pancreas

SCAN: Structure Correcting Adversarial Network for Organ Segmentation in Chest X-Rays

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

A novel focal Tversky loss function with improved attention U-Net for lesion segmentation

Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions

Diagnostic imaging dataset statistical release

UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation

Attention U-Net Based Adversarial Architectures for Chest X-ray Lung Segmentation

Enhancing U-Net with Spatial-Channel Attention Gate for Abnormal Tissue Segmentation in Medical Imaging

SCAU-Net: Spatial-Channel Attention U-Net for Gland Segmentation

CoLe-CNN: Context-learning convolutional neural network with adaptive loss function for lung nodule segmentation

The project was supported by the grant EFOP-3.6.3-VEKOP-16-2017-00002.