key: cord-0058332-6zzdn8l0
authors: Wang, Mingyu; Yuan, Chenglang; Wu, Dasheng; Zeng, Yinghou; Zhong, Shaonan; Qiu, Weibao
title: Automatic Segmentation and Classification of Thyroid Nodules in Ultrasound Images with Convolutional Neural Networks
date: 2021-02-23
journal: Segmentation, Classification, and Registration of Multi-modality Medical Imaging Data
DOI: 10.1007/978-3-030-71827-5_14
sha: cb03555f734c25528a9f82908e14ef1c09d091ea
doc_id: 58332
cord_uid: 6zzdn8l0

Ultrasound image plays an important role in the diagnosis of thyroid disease. Accurate segmentation and classification of thyroid nodules are challenging due to their heterogeneous appearance. In this paper, we propose an efficient cascaded segmentation framework and a dual-attention ResNet-based classification network to automatically achieve the accurate segmentation and classification of thyroid nodules, respectively. We evaluate our methods on the training dataset TN-SCUI 2020 Challenge. The 5-fold cross validation results demonstrate that the proposed methods achieve average IoU of 81.43% in segmentation task, and average F1 score of 83.22% in classification task. Finally, our method ranks the first place of segmentation task on the test set through the final online verification. The source code of the proposed methods is available at https://github.com/WAMAWAMA/TNSCUI2020-Seg-Rank1st. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this chapter (10.1007/978-3-030-71827-5_14) contains supplementary material, which is available to authorized users.

Thyroid nodules are one of most commonly diagnosed nodular lesions in the adult population. The nodules, detected at an early stage, are an extremely curable disease, and thus an accurate differentiation between malignant and benign thyroid nodules is necessary to ensure proper clinical management of malignant nodules [1] . Because the ultrasound imaging is noninvasive, realtime, and radiation-free, it is the key tool for diagnosis of thyroid nodules. However, the cumulative errors collected from blurring boundaries and significant changes in the appearance or intensity of thyroid nodules among different ultrasound images, make it challenging to analyze and recognize the subtle difference between malignant and benign nodules [2] . In this paper, we propose an efficient cascaded segmentation network and a dual-attention ResNet-based classification network to achieve automatic and accurate segmentation and classification of thyroid nodules, respectively. 

Due to different acquisition protocols, some thyroid ultrasound images have irrelevant regions (as shown in Fig. 1 ). First, we remove these regions which may bring redundant features by using a threshold-based approach. Specifically, we perform the operation of averaging along the x and y axes on original images with pixel values from 0 to 255, respectively, after which rows and columns with mean values less than 5 are removed. Then the processed images are resized to 256 × 256 pixels as the input of the first segmentation network.

Our cascaded segmentation pipeline is shown in Fig. 1 . We train two networks which share the same encoder-decoder structure with Dice loss function. The first segmentation network (at stage I of cascade) is trained to provide the rough localization of nodules, and the second segmentation network (at stage II of cascade) is trained for fine segmentation based on the rough localization. To our knowledge, in some current cascaded segmentation frameworks, the real output (mask or probability map) or pseudo-label output of the first network is generally fed for training the second network, so that the second network gets contextual information [3, 4] . But our preliminary experiments show that the provided context information in first network may do not play a significant auxiliary role for refinement of the second network. Therefore, we only train the second network using images within region of interest (ROI) obtained from ground truth (GT). When training the second network, we expand the nodule ROI obtained from GT, then the image in the expanded ROI is cropped out and resized to 512 × 512 pixels for training the second network. We observe that, in most cases, the large nodules generally have clear boundaries, and the gray value of small nodules is quite different from that of surrounding normal thyroid tissue (Fig. 2 ). Therefore, background information (the tissue around the nodule) is significant for segmenting small nodules. As shown in Fig. 3 , in the preprocessed image with the size of 256 × 256 pixels, the minimum external square of the nodule ROI is obtained first, and then the external expansion m is set to 20 if the edge length n of the square is greater than 80, otherwise the m is set to 30. To exhaustively focus and learn the features that are significant for identifying thyroid nodules, we propose a dual-attention ResNet framework. Specifically, we adopt ResNeSt200 [5] as the backbone network architecture to perform the classification of thyroid nodules. ResNeSt introduces the Split-Attention block, which enables feature map attention across different feature map groups, and spontaneously improves the learned feature representations to boost model performance.

In addition, we employ an online attentional mechanism based on class activation mapping to focus on the intrinsic relationship between the feature information of thyroid nodules and their clinical characteristics. The core idea is to convey weights of the fully connected layer onto the convolutional feature maps for generating the attention maps and optimizing dynamically network's decision. As shown in Fig. 4 , we define F as the feature maps before the global average pooling operation and W as the weight matrix of the fully connected layer. The generating attention maps is defined as: AM = Norm c=1 (F * W) , where the Norm denotes the normalization process and the operation of summation is performed along the channel axis. Finally, our loss function consists of two parts. On the one hand, we use MSE loss between attention maps and ground truths of segmentation to constrain the model's attention to the location of the thyroid nodules. On the other hand, we choose BCE loss between predictions and ground truths of classification to reduce confidence errors.

In both two tasks, following methods are performed in data augmentation: 1) horizontal flipping, 2) vertical flipping, 3) random cropping, 4) random affine transformation, 5) random scaling, 6) random translation, 7) random rotation, and 8) random shearing transformation. In addition, one of the following methods was randomly selected for additional augmentation: 1) sharpening, 2) local distortion, 3) adjustment of contrast, 4) blurring (Gaussian, mean, median), 5) addition of Gaussian noise, and 6) erasing.

Test time augmentation (TTA) generally improves the generalization ability of the segmentation model. In our framework, the TTA includes vertical flipping, horizontal flipping, and rotation of 180°for the segmentation task.

We validate our method on the training dataset of TN-SCUI 2020 challenge, which includes 2003 malignant nodules and 1641 benign nodules.

Cross Validation with a Size and Category Balance Strategy. 5-fold cross validation was used to evaluate the performance of our proposed method. In our opinion, it is necessary to keep the size and category distribution of nodules similar in the training and validation sets. In practice, the size of a nodule is the number of pixels of the nodule after unifying preprocessed image to 256 × 256 pixels. We stratified the size into three grades: 1) less than 1722 pixels, 2) less than 5666 pixels and greater than 1722 pixels, and 3) greater than 5666 pixels. These two thresholds, 1722 pixels and 5666 pixels, were close to the tertiles, and the size stratification was statistically significantly associated with the benign and malignant categories by the Chi-square test (p < 0.01). We divided images in each size grade group into 5 folds and combined different grades of single fold into new single fold. This strategy ensured that final 5 folds had similar size and category distributions. Implementation Details. We implemented our framework in PyTorch using 3 NVIDIA GTX 1080TI GPUs. For both two tasks, we chose Adam as the optimizer, and used the learning rate strategy based on learning rate warm-up with 5 epochs and cosine decay with 350 epochs. In the segmentation task, all network frameworks were built by seg-mentation_models_pytorch 1 , the batch sizes were 10 and 3 for the first and second network, respectively. Learning rate increased from 1e−12 to 1e−4 during warm-up phase, and then gradually decreased to 1e−12 during cosine-decay phase. When training the segmented network with Efficientnet as encoder, we initialized the encoder using weights pretrained on ImageNet. In the classification task, the batch size was 16, and the learning rate increased from 1e−12 to 2.5e−3 during warm-up phase, and then gradually decreased to 1e−12 during cosine-decay phase. We employed Dice similarity coefficient (DSC) and intersection over union (IoU) to evaluate the segmentation model performance. And area under the receiver operating characteristic curve (AUROC), sensitivity (SEN), specificity (SPE), accuracy (ACC) and F1 score were calculated to evaluate the classification model performance. Several experiments are implemented for extensive comparisons.

We tested different network structures on the validation set in the first fold, and the segmentation results without TTA are shown in Table 1 . In order to make the test results as close to the real situation as possible (such as testing on the final test set of TN-SCUI challenge), all the indicators in the table were calculated based on the GT corresponding to the original image, so the predicted masks were restored to the size of the unprocessed original images. The results show that among different networks, DeeplabV3plus with pretrained Efficientnet-B6 encoder work best, reaching 0.8699 for DSC and 79.00% for IoU. We choose DeeplabV3plus with pretrained Efficientnet-b6 encoder as the two segmentation networks for cascading, and the cascaded segmentation results are shown in Table 2 . All metrics in the table were calculated using the GT corresponding to the original image. The results of the first fold show that by using cascades and simultaneously using TTA in both first and second networks, IoU improved from 79.00% to 81.44% and DSC improved from 0.8699 to 0.8864 continuously. The results of the remaining 4 folds also show the same trend (as shown in Supplementary Table 1 ). Finally, we get an average DSC of 0.8873 and average IoU of 81.43% in 5-fold cross-validation. 

As shown in Table 3 , due to the powerful feature representation ability, ResNest200 achieve 0.8186 F1 score, exhibiting more excellent performances than ResNet and Efficientnet. Moreover, the online attention mechanism bring another 0.0136 improvement in F1 score because of the effective guidance towards important activation positions. 

We proposed two efficient and automatic frameworks with convolutional neural networks to segment and classify the thyroid nodules, respectively. The experimental results demonstrated that both cascade strategy and TTA could effectively improve the performance of segmentation. For the classification task, our proposed dual-attention ResNet achieved better performance than the original ResNet as well as ResNeSt.

The changing incidence of thyroid cancer

Thyroid nodule segmentation in ultrasound images based on cascaded convolutional neural network

An attempt at beating the 3D U-Net

CascadePSP: toward class-agnostic and very high-resolution segmentation via global and local refinement

ResNeSt: split-attention networks