key: cord-0073408-7jaroh4k authors: Zhao, Jun; Zhou, Xiaosong; Shi, Guohua; Xiao, Ning; Song, Kai; Zhao, Juanjuan; Hao, Rui; Li, Keqin title: Semantic consistency generative adversarial network for cross-modality domain adaptation in ultrasound thyroid nodule classification date: 2022-01-13 journal: Appl Intell (Dordr) DOI: 10.1007/s10489-021-03025-7 sha: ba52dec74dfbc48a1f26bfe8742c9d1cfc037a05 doc_id: 73408 cord_uid: 7jaroh4k Deep convolutional networks have been widely used for various medical image processing tasks. However, the performance of existing learning-based networks is still limited due to the lack of large training datasets. When a general deep model is directly deployed to a new dataset with heterogeneous features, the effect of domain shifts is usually ignored, and performance degradation problems occur. In this work, by designing the semantic consistency generative adversarial network (SCGAN), we propose a new multimodal domain adaptation method for medical image diagnosis. SCGAN performs cross-domain collaborative alignment of ultrasound images and domain knowledge. Specifically, we utilize a self-attention mechanism for adversarial learning between dual domains to overcome visual differences across modal data and preserve the domain invariance of the extracted semantic features. In particular, we embed nested metric learning in the semantic information space, thus enhancing the semantic consistency of cross-modal features. Furthermore, the adversarial learning of our network is guided by a discrepancy loss for encouraging the learning of semantic-level content and a regularization term for enhancing network generalization. We evaluate our method on a thyroid ultrasound image dataset for benign and malignant diagnosis of nodules. The experimental results of a comprehensive study show that the accuracy of the SCGAN method for the classification of thyroid nodules reaches 94.30%, and the AUC reaches 97.02%. These results are significantly better than the state-of-the-art methods. Thyroid nodules, described as abnormal growths of glandular tissue, are the most common thyroid disorder [2] . Over the past 30 years, thyroid cancer has been one of the most prevalent and fastest-growing cancers of all types [15] . Therefore, early diagnosis of the benignity or malignancy of nodules is essential to reduce the morbidity Juanjuan Zhao zhaojuanjuan@tyut.edu.cn 1 College of Information and Computer, Taiyuan University of Technology, Taiyuan, China 2 College of Information, Shanxi University of Finance and Economics, Taiyuan, China 3 Hunan University and State University of New York, Albany, NY, USA and mortality of thyroid cancer [8] . Ultrasonography has become the most preferred choice for diagnosing benign and malignant thyroid nodules. However, there are still some challenges in analyzing thyroid ultrasound images. First, ultrasound images are susceptible to speckle noise and echo fluctuations, making the texture distribution in ultrasound images blurred and non-uniform [37] . Second, the diagnosis of thyroid ultrasound images is subjective and highly dependent on the physicians' extensive experience and cognitive ability [14] . Conversely, the use of computeraided diagnosis systems (CADs) can significantly reduce physicians' workload and misdiagnosis rate. Thyroid image classification has become a research hotspot for computeraided thyroid disease diagnosis [5] . Traditional methods of thyroid nodule classification Acharya et al. [1] used Gabor transform to extract the features of thyroid benign and malignant images and compared the classification performance of SVM, MLP, KNN, and C4.5 classifiers. Raghavendra et al. [26] extracted highorder spectral (HOS) entropy features from particle swarm optimization (PSO) and support vector machine (SVM) models and distinguished benign and malignant lesions. Prochazka et al. [23] used dual-threshold binary decomposition to extract direction-independent features for random forest (RF) and SVM classifiers. The traditional training method is computationally inexpensive and does not require a large number of training images. However, there are still apparent limitations: 1) rely on many manually extracted image features and classifier selection, 2) is a tedious and unstable process, and 3) may lead to poor generalization ability. Thyroid nodules classification method based on deep learning Compared with traditional methods, deep learning methods can extract global and local features more accurately. In 2017, Ma et al. [16] applied the convolutional neural network for the first time to identify benign and malignant thyroid nodules. Wang et al. [36] designed an effective EM algorithm to train a CNN-based nodule classification model. Zhou et al. [50] proposed an online transfer learning (OTL) method to improve the diagnostic effect of ultrasound examination of thyroid nodules. Wang et al. [37] extracted multiple image features with different angles in one inspection for an attention-based feature aggregation network. All the above methods are based on single modality data for training and evaluation. In contrast, the actual medical imaging process expects to fuse data from different domains. Still, the following problems exist in the construction of models: 1) The scale of medical datasets remains a significant bottleneck for deep learning models. Data collection and manual annotation for each new modality or new domain are both time-consuming and expensive. Especially for thyroid imaging, there are fewer extant large-scale thyroid image datasets due to the specificity of thyroid location. 2) The distribution differences between different types of data, known as dataset deviations or domain shifts phenomenon, where deep networks trained on a large labeled dataset cannot be well generalized to new datasets and new tasks, resulting in significant degradation of the generalization performance the model. We adopt a domain adaptation (DA) algorithm [38] to address the above challenges. The DA algorithm aims to learn models from the source domain data distribution but works well for target domains with different but related data distribution. The principle behind DA is that the source and target domains can learn collaboratively and transfer their learned knowledge to each other during the entire training process, making the model robust to noise in the data. Currently, there is no work on effectively using cross-modal data to construct a DA framework for nodule diagnosis in thyroid ultrasound images. In general, the working pattern of the ultrasound physician is to combine information from both ultrasonography reports and ultrasound images and then to come up with a diagnosis. This model stimulated our interest in exploring the content of the reports. We find that the performance of image generation and image classification tasks can be improved by transferring the semantic-intensive feature representation associated with the images in the reports. In contrast, existing models lack the reasoning ability to imitate a physicians' interpretation of semantic information and ignore important domain and expert knowledge [41] related to the specific task of thyroid diagnosis. Therefore, in our approach, we will incorporate disease keywords extracted from ultrasonography reports as textual information in multimodal data, as shown in Fig. 1 . In this paper, we propose a new multi-task cascaded deep learning framework for diagnosing thyroid ultrasound images. First, we propose a self-attention-based semantic consistency generative adversarial network as a domain adaptation backbone to improve the quality of generated images. Second, to jointly analyze multimodal data features, the critical domain knowledge extracted from ultrasonography reports is fed into the generator structure through text modeling to promote the semantic consistency of generated images. Finally, the network integrates a modified classification model, ResNet-50, which uses combined features to classify benign or malignant thyroid nodules in ultrasound images. The main contributions of this paper are summarized as follows: • We propose an effective model: semantic consistency generative adversarial network (SCGAN). To the best of our knowledge, this work is the first to apply cross-modal domain adaptation based on generative adversarial networks to the classification task of benign and malignant thyroid nodules. • We propose a new cross-modal alignment self-attention module (CASAM) to facilitate domain adaption for achieving higher generative performance. The semantic alignment layer is used in CASAM to efficiently guide the semantic alignment process of image and knowledge features. • We introduce two advanced techniques: the visual discrepancy loss to dynamically balance the need for the generator to learn domain invariant features, and the cross-domain fusion zero-centered gradient penalty (CD-GP) is incorporated into the discriminator to synthesize more realistic and knowledge semantic consistent images. The rest of the paper is organized as follows. We present related work on domain adaption, generative adversarial networks, and attention mechanisms for medical images in Section 2. The details of our approach are presented in Section 3. Section 4 describes our thyroid ultrasound image dataset and experimental evaluation results to validate the effectiveness of our approach. Finally, the conclusion and future work are drawn in Section 5. Domain Adaptation In the context of medical image analysis, most prospective studies on domain adaptation have focused on adjusting data distribution from various clinical centers, scanning protocols, and scanning sites. Dou et al. [6] pioneered a plug-and-play adversarial domain adaptation network (PnP-AdaNet), which combines multiple adversarial learning domain adaptation layers to spatially align the potential features of the target domain and the source domain. They tested on cardiac MRI/CT images. Zhang et al. [49] introduced a collaborative unsupervised domain adaptation (CoUDA) algorithm for medical image diagnosis. This algorithm via the collective intelligence of two peer subnets to conduct transferability-aware domain adaptation on whole-slide images (WSI) and microscopy images (MSI) of colon datasets. However, it is often difficult to seek a source domain with the same feature and categorical space as the target domain. Therefore, this paper focuses on more realistic and challenging scenarios to address the correlation problems of cross-domain data observed in different feature spaces, namely heterogeneous domain adaptation (HDA) [44] . The domain-invariant representation of classification tasks from the source dataset to the target dataset has been extensively studied [38] by generating adversarial networks [45] . Chen et al. [7] investigated the domain adaptation framework, SIFA, which applies a deep supervision mechanism of synergistic image and feature alignment to deal with the transfer of domains due to adversarial learning, and extensive experiments on bidirectional cross-modality adaptation on multiple tasks. Ren et al. [27] considered the joint feature distribution between the source and target domain images and classified histological images obtained in different staining procedures via adversarial learning. Gu et al. [11] explored a twostep progressive transfer learning technique to improve the recognition performance of cross-domain skin diseases, and at the same time, adopted cycle-consistent adversarial learning to expand the model to cross-modal learning tasks such as melanoma detection. Although existing adversarial domain adaptation methods are effective in different tasks, the semantic correlation between domains has not been elucidated yet. Nowadays, attention mechanism has become a necessary element to capture inter-domain dependencies of the model. Wang et al. [40] added transferable attention for the domain adaptation (TADA) model and focused its application on core regions to enhance the transferability of images. Wang et al. [34] argued that complementing the attention branch in the Thorax-Net enhances the correlation between class labels and pathological abnormal locations. Furthermore, the three attention modules [35] can be merged into a unified framework for joint learning of channels, elements, and scales. In the thyroid ultrasound nodule diagnosis, we will demonstrate an improved version of a well-established self-attention mechanism to improve further diagnostic performance, which helps localize important regions of ultrasound images and enhancing cross-domain features' correlatability. This section illustrates the proposed semantic consistency generative adversarial network (SCGAN) for ultrasound image nodule classification. First, we introduce the selection criteria of domain knowledge and the processing of its integration into deep networks. Second, we present the overall structure of SCGAN, including the composition of the generator and discriminator, and focus on the contribution of the cross-modal alignment self-attention module to semantic consistency. Then, we explain the proposed visual discrepancy loss and regularization method. Finally, we give details of the modifications of the classifier. Ultrasonography report preprocessing The ultrasonography report [13] summarizes all clinical findings and physician impressions identified during the ultrasound study examination. Ultrasonography reports usually contain comprehensive patient information, but they may also contain inconclusive descriptions or irrelevant to the disease. For example, in the "Ultrasound Findings" of the ultrasonography report, as shown in Fig. 1 , normal/abnormal conditions are recorded for each site of the thyroid examination, such as location, size, and severity of the nodules. Besides, patients' personal information, medical history, and suspicious findings may lead to additional or follow-up studies. Therefore, parsing the content of ultrasonography reports is a complex and challenging task. The Thyroid Imaging Reporting and Data System (TI-RADS) [32] provides standardized terminology to describe thyroid nodule features in ultrasound images. Using TI-RADS as a guide, we screen the disease keywords in ultrasonography reports as domain knowledge, such as boundary, calcification, and echo pattern. By learning text embedding, this domain knowledge can facilitate the acquisition of semantic information in ultrasonography reports and improve the diagnostic performance of the leading classification tasks. We use a pre-trained text encoder φ to learn the semantic information described by domain knowledge. Each textual description t i is encoded as onehot vectors that are then mapped to embeddings and added with contextual information. The text embedding φ(t i ) is fed into the LSTM proposed by [10] . At each time step, the obtained text embedding sequence {φ 1 , ..., φ n } takes the current text as input and iteratively applies the transfer function to generate the hidden state h t : which allows the extraction of high-dimensional semantic vectors from domain knowledge. Domain knowledge contains meaningful disease features, and the key is to maintain diversity and independence among them. To this end, we extract the hidden state corresponding to each disease keyword and obtain a text representation sequence T enc s = [h 1 , ..., h n ] ∈ R E . The advantage of this strategy is that it enables the network to select relevant semantic features adaptively, ensuring that they are helpful for disease labeling (as shown in the experimental results). Our network architecture is shown in Fig. 2 . It consists of a pre-trained text encoder, a domain adaptation generator, and a discriminator. The generator is trained to generate images from the text describing the content, and the discriminator is trained to determine the authenticity of the images conditional on the semantics defined by the given text. We use the following notations: the domain adaptation The generator G s→t has two inputs, the text sequence T enc s of the source domain, and the other is the noise vector Z ∈ R Z ∼ N (0, 1) sampled from the Gaussian distribution to guarantee the diversity of the generated images. First, Z is fed into the fully connected layer and then sent to a series of upblocks and T enc s to upsample the images, which are used to integrate semantic information and image features during the image generation process. G s→t uses upblocks as its network backbone, including convolutional layers, a self-attention layer, residual blocks, and an upsample layer. The self-attention layer brings more non-linearity to G s→t , which is conducive to generating semantically consistent images from different textual descriptions. Therefore, G s→t synthesizes realistic pseudo-target domain images by I s→t = G s→t (Z, T enc s ). Then I s→t is regularized using visual discrepancy loss to be consistent with the corresponding region in the original image. The discriminator D t attempts to compete with G s→t by distinguishing between the synthetic pseudo target domain image I s→t and the real target domain image I t . D t converts I s→t into a feature map and downsamples it through a series of downblocks. Here, the intermediate layers of D t have a smaller receptive field that forces G s→t to pay more attention to finer details. The last few layers generally derive information from the larger image region and guide G s→t to produce an image with better global consistency. Then T enc s is replicated and spliced onto the image features. Formally, D t has to distinguish three input pairs composed of text: real images I match t with matching text, real images I mis t with mismatched text, and synthetic images I s→t . For the data heterogeneity between source-domain text representation and target-domain images, we propose the cross-modal alignment self-attention module (CASAM). The self-attention module efficiently computes long-range dependencies between features, allowing the generator to model the relationship between widely separated spatial regions effectively. CASAM leverages semantic association to effectively guide the alignment process while generating attention to important image features and text representation to provide more prominent and meaningful embedding for image generation tasks. As shown in Fig. 3 , the module accepts two inputs: image feature map F i and text representation sequence T enc s . First, according to the attention mechanism adopted in AttnGAN [42] , the three-dimensional image features (width× height×channel) of F i ∈ R w×h×c are flattened into a two-dimensional sequence (wh×channel, where wh=width×height), and transformed into the query feature map Q s→t to facilitate the calculation of attention. The formula is as follows: Two convolution layers with 1 × 1 filters are applied on T enc s to generate feature maps K s and V s , respectively: Intuitively, the key K s focuses on matching with Q s→t , while the other projection value V s can be better optimized to refine Q s→t to obtain better F i . We add a semantic alignment layer (SAL) to the module to strengthen the semantic relevance between Q s→t and K s by metric learning [22] . Here, we use: as the geometric similarity to measure the relationship between the potential feature space of Q s→t and K s . In consideration of building a reasonable distance metric [4, 19, 22] , the cosine similarity [9, 18, 33] is chosen in this paper as: The cosine similarity focuses on the similarity description of semantic classes. For the feature vector of each image subregion of Q s→t , the better the alignment, the shorter the distance. The attention maps weight to the feature maps Q s→t and K s are generated to achieve more discriminative feature representation, and the attention map A is obtained as: The aggregation operation is defined as follows: where the more refined features are captured by the dot product between A and V s for feature adaptation. The obtained attention weights are normalized using the softmax function to convert the values into relative probabilities. The features are updated by collecting the attention weights of each acquired feature and the original feature mapping to obtain contextual information. We propose a new visual discrepancy loss for the generator. Visual discrepancy loss is encouraged to capture disparity features. If there is no discrepancy loss, then the requirement for G s→t to learn the invariant domain information will be weaker. Thus, co-training visual discrepancy loss is an implicit facilitator for improving network adaptation and plays a crucial role in improving the quality and consistency of the final generated images. The L2 norm of the feature mapping between the real image I t and the generated image I s→t is defined as: where φ j (·) represents the process of extracting image feature maps. Recently, Mescheder et al. [17] introduced a zerocentered gradient penalty, adding regular terms to make the discriminator apply zero-centered gradient penalty to the input. Extending it to our domain adaptation task. We propose a cross-domain fusion zero-centered gradient penalty (CD-GP) function to improve the discriminator's generalization capability. We choose to impose penalty terms on the real and generated data, respectively: where α and β are hyperparameters that balance the effectiveness of the gradient penalty and cannot both be zero. Compared with adding discriminators to ensure the semantic consistency of the generated images, our CD-GP does not introduce additional networks to compute the semantic similarity and therefore does not increase the complexity of the domain adaptation process or the training parameters. To stabilize and converge the training process of SCGAN, inspired by the SAGAN architecture [47] , we evaluate the authenticity of the generated images and their consistency with the input semantics by minimizing the hinge version of the adversarial loss [3] . Formally, we represent the two outputs of D t as: D u t (·), the unconditional image score, and D c t (·), the conditional image score. Correspondingly, the objective functions L D for D t are formulated as L D uncond and L D cond , respectively: P r is the real data distribution, P g is the generated data distribution, and P mis is the mismatching data distribution. On the other side, G s→t is trained to generate images that could trick D t into giving high scores on visually realistic images and match the text. Similarly, the objective functions L G to be minimized by G s→t are L G uncond and L G cond , respectively: Taking into account the adversarial loss, visual discrepancy loss, and cross-domain fusion zero-centered gradient penalty, our total loss is defined as the weighted sum of these losses, as follows: λ 1 and λ 2 are regularization parameters to balance the trade-off between L vd , L CD−GP , and other terms. Each residual block of the ResNet-50 [12] network uses a bottleneck structure, which helps overcome the problem of gradient disappearance in large models. To adapt the ResNet-50 network to our problem of classifying benign and malignant nodules, the base layer of the model is frozen, and then custom layers are added to form the final framework. Therefore, we remove its last fully connected layer and add three fully connected layers of 2048, 1024, and 2 neurons, respectively. The weights of the final fully connected layer are fine-tuned by using a back-propagation technique which uses a gradient descent optimization algorithm to minimize the cost function. The final output of the model is obtained using the sigmoid activation function. Our research works use images from a dataset provided by the local hospital to acquire ultrasound examination images and ultrasonography reports of 1083 patients, and the hospital institutional review board approves the entire collection process Due to the variable size of nodules, we exclude nodules with tumor size < 0.60 cm or > 3.00 cm and finally include 1937 nodules from ultrasound examinations in the final analysis. Their available ultrasonography reports correlate with the ultrasound findings of 867 patients. Ultrasound images are screened by experienced thyroid ultrasound physicians (physicians with more than eight years of experience in thyroid ultrasound imaging) based on suspicious features in TI-RADS, solid components, hypoechoic, or markedly hypoechoic, microgranular or irregular margins, microcalcifications, and ultra-wide shapes. "Ultrasound Findings" are classified into two categories: benign or malignant. There are 1032 benign nodules Fig. 4 to extract regions of interest (ROIs) containing nodules, the metadata text (e.g., information about the scanner, location, patient.) placed on the images are discarded to obtain the actual ultrasound image regions. We count the horizontal and vertical diameters of all nodules so that the nodule with the cross marker symbols is in the center of the patch image. It is finally decided to fill the patch size with zero to a square of 64 × 64 pixels size to maintain the image aspect ratio, and the pixels in the image are normalized to zero mean and unit variance. Classification results are quantitatively evaluated by the mean and standard deviation of the obtained accuracy, sensitivity/recall, specificity, and area under the receiver operating curve (AUC). In this paper, the inception score (IS) [31] is chosen to measure the quality of the images generated by SCGAN. IS is the classical metric for evaluating GAN. Since IS does not reflect whether the generated images depend well on the given text representation, we combine it with physician evaluation. The semantic consistency of SCGAN is evaluated by experienced ultrasound physicians comparing the generated images with the corresponding domain knowledge description. We consider that physicians need to perform two tasks: one is to discriminate the authenticity of the image and determine whether the image matches the corresponding semantic information; the other is to diagnose the benignity or malignancy of the nodule. The entire network is implemented using the TensorFlow framework based on Python 3.6 and trained on a workstation with Ubuntu 18.04 LTS, 2.90 GHz Intel(R) Xeon(R) W-2102 CPU, and two NVIDIA GTX Titan XP GPUs. For the text encoder, the dimension E is set to 128, and the length of words is set to 30. In order to compare with previous work, the parameters of our text encoder are fixed during training. In the generator, the dimension of Z is set to 512. In the experiments, the network is trained using Adam optimizer with β 1 = 0.9, β 2 = 0.999. On our dataset, training is set up with 300 epochs and a minibatch size of 16. We choose For our proposed method, we set up three variants: 1. Remove the CASAM of SCGAN, additional loss functions L vd and L CD−GP , that is, directly concatenating text representation with image features in G s→t (DAGAN). The effectiveness of our proposed method is demonstrated by designing several experimental sessions as follows. In our research, the SCGAN model is based on an intelligent combination of knowledge and images. To evaluate the advantages of this cross-modal domain adaptation approach for feature extraction, first, we construct GANs model for nodule feature extraction using only images as input. The experimental results obtained by different classification methods are shown in Table 2 and Fig. 5 . Here, the SCGAN model is simplified to DCGAN [25] when no domain knowledge is added, and only unimodal data is used for feature extraction. The accuracy, sensitivity, specificity, and AUC obtained using the DCGAN+modified ResNet-50 model are 85.26 ± 1.62, 87.46 ± 3.14, 83.14 ± 1.69, 84.80 ± 2.51, respectively. By adding class labels to DCGAN as auxiliary information to form ACGAN [20] , the ACGAN+modified ResNet-50 model shows a slight improvement in all metrics. However, the classification performance of the above methods is far inferior to that of the GAN model using a multimodal combination of domain Compared with DAGAN, the metrics of SCGAN are improved by another 4.37%, 2.09%, 6.59%, and 4.04%, respectively. It suggests that integrating the domain knowledge from ultrasonography reports into the deep learning model can effectively improve the classification performance of nodules. It can also be concluded that the standard deviation of the classification results is smaller when domain knowledge is used, which means that the inclusion of domain knowledge can effectively improve the stability of nodule classification. In addition, to verify the classification stability of SCGAN applied to unbalanced samples, we randomly reduce the number of malignant nodules by half, but the parameters of the fixed pre-training model remain unchanged, denoted as SCGAN . In the case of unbalanced data sets, the fluctuation of each metric value is slight, and the classification performance of SCGAN is excellent. To evaluate whether the self-attention mechanism can help the domain adaptation process to generate higher quality and semantically consistent images. We use both direct concatenation (i.e., DAGAN) and CASAM alignment (i.e., variant SCGAN−L vd − L CD−GP ) for the cross-domain fusion of text representation and images in G s→t , respectively. Compared with DAGAN, SCGAN−L vd −L CD−GP further improves the quantization performance, as shown in Tables 3 and 5, indicating that achieving alignment between domains in a brute force manner does not resolve the strong heterogeneity that exists between domains. DAGAN is essentially a pixellevel superposition of data from two different modalities. The mixing of data from different imaging principles affects the feature extractor's judgment on target data's feature distribution. Conversely, CASAM does not affect the independence of the feature distribution for each domain data. In particular, the semantic alignment layer can calculate the similarity between the generated image and the textual description before generating new image features. It can discover the semantic relationship between each pixel and words, mapping the image features to the corresponding fine-grained text representation. Also, we quantitatively and qualitatively investigate the effects of L vd and L CD−GP . Compared with SCGAN−L vd − L CD−GP , SCGAN−L vd adds a gradient penalty L CD−GP to the discriminator to ensure the quality of the generated images. That is because L CD−GP reduces the gradient of I match t to the lowest point of the loss function curve while ensuring the smoothness of its adjacent regions, while other input images, such as I s→t , are placed on the high point of the curve. As shown in Fig. 8 , the IS is significantly improved, indicating that L CD−GP gives the generator a more explicit convergence target, guiding the generator to generate more realistic images and semantically consistent with ultrasonography reports. Further, our proposed SCGAN adds L vd to learn discrepancy features. In principle, in our cross-modal domain adaptation task, the data of these two modalities are different in the visual layer but converge in the semantic layer. If the generator only learns the low-level visual layer features in the source domain, the prediction results mapped in the target domain will deviate from our expectations and penalize by the adversarial loss. However, the results of our SCGAN converge significantly, Fig. 6 The loss curve of generator and discriminator in SCGAN indicating that G s→t learns high-level semantic layer features. Thus, L vd can reverse encourage G s→t to deceive D t in case of domain shifts, requiring G s→t to capture highlevel semantic domain invariant features across the source and target domains. As shown in Table 3 , the accuracy and specificity are significantly improved, in Table 5 , the IS is also boosted. The loss curves of the generator and discriminator in SCGAN are shown in Fig. 6 . The experimental results prove the scientific validity of our techniques. Specifically, we tune the parameters λ 1 and λ 2 in the loss function Equation (15) , and the results are shown in Table 4 and Fig. 7 . The IS significantly increases from 4.14 to 4.23 when λ 2 is changed from 0 to 2. Meanwhile, the IS increases to 4.26 when λ 1 is changed from 0 to 0.2, verifying the effectiveness of combining these two techniques. The IS score significantly decreased when λ 2 changed from 2 to 4 or λ 1 changed from 0.2 to 0.5. It may be that the penalty is too large, leading to the loss of some more important features. Therefore, in SCGAN, we set λ 1 and λ 2 to 0.2 and 2, respectively. Table 5 reports the IS scores of SCGAN and other compared methods. We can observe that we believe that the inclusion of domain information can guide the direction of the generator to generate images, which gives the generator has less freedom to generate images using random noise and reduces the uncertainty of the image generation process. In contrast, the multigenerator and multi-discriminator structures in StackGAN [48] and AttnGAN [42] make the quality of the generated images in the initial layer affect the final refinement, so the effect is poor. In conclusion, SCGAN can generate visually more realistic images with higher quality and better diversity than existing methods (Fig. 8) . In Table 6 , we compare L vd with the losses used in different methods. For example, SD-GAN [46] proposed a contrastive loss to improve the consistency between images generated by the same text description. Oord et al. [21] measured the dependence of two mutual information by learning the InfoNCE loss function and obtained a useful representation between the information. Wang et al. [39] used triplet loss to make video patches from the same trajectory closer in the embedding space than random patches. However, in contrastive loss and InfoNCE loss, all positive and negative matching pairs of each sample need to be sampled separately, and our L vd does not need to dig the negative of information, which can reduce the complexity Fig. 7 Inception score (IS) analysis for different parameters of the loss function of training. Adding triplet loss to the baseline reduces the quality score of the generated image. This result shows that the better disentanglement of triplet loss may separate the connections between features too much and reduce the smoothness of interpolation. Table 7 gives the performance metrics of the classification models of SCGAN when pre-trained with VGGNet [29] , GoogLeNet [30] , ResNet-50, ResNet-101 and ResNet-152, respectively. The results show that the highest accuracy values are achieved using ResNet-50. Moreover, the tradeoff between classification results and network optimization is crucial. Considering the dimensionality and parameter complexity of deep networks such as ResNet-101 and ResNet-152, and the relatively stable performance obtained with the ResNet series, we choose to use ResNet-50. Therefore, we use the modified ResNet-50 model to train our dataset and use it as a baseline classification method. Figure 9 visualizes 24 images generated using DAGAN, SCGAN−L vd − L CD−GP and SCGAN. Through human perception, we can find that compared with benign nodules, malignant nodules contain calcification (abnormal white spots) and irregular edge contours. From the perspective of the image quality generated, DAGAN without domain adaptation synthesizes nodules with irregular shapes, rough texture distribution, and lack of rich details. In contrast, the details of the nodules generated by CASAM gradually become clearer. However, the marginal area of some nodules changes greatly, which may be related to less marginal semantic information. The images generated by our SCGAN model are visually convincing. Among them, the internal grayscale difference of the nodules is obvious, and the tissue texture is clear. Benign nodules have smooth borders, and clustered calcifications accompany malignant nodules. It shows that the effective combination of CASAM and the two losses can potentially ensure the quality We collaborate with three senior physicians who treat thyroid diseases. The whole process is divided into two parts. First, the three physicians independently determine the authenticity of the images, the semantic consistency, and the benignity or malignancy of the nodules and give their respective diagnoses. Then, in the second part, the three physicians could discuss and give the final results of the consultation. The accuracy of each physician and their mean values are shown in Table 8 . Overall, our proposed model performs better than ultrasound physicians. The experiment results indicate that the highest individual score of the three physicians is 75.67%, and the consultation score is higher than the average value of the three physicians in determining the authenticity of the ultrasound images. For the diagnosis of benign and malignant nodules, the consultation score is higher than the three-person independent score, and its overall accuracy is higher than that of authenticity discrimination. We discuss further with the physicians and analyze the experimental results in detail. Physicians have a more accurate judgment of nodules with apparent Comparison with State-of-the-Arts Table 9 shows the performance comparison between the proposed SCGAN and nine state-of-the-art classification methods for thyroid nodules. The results show that the proposed model achieves a better classification performance. Since most of the datasets used for training models in the paper are derived from private datasets and the code is not open source, it is impossible to directly compare SCGAN with others' methods on the same datasets. Therefore, Table 9 lists the performance of these methods as recorded in the original published literature. The former used information from different modalities to train DScGANS models to facilitate the diagnosis of benign and malignant thyroid nodules. The latter focused on comparing the effects of different fusion strategies and different classification network structures on classification performance. Compared to Qin, our method has higher sensitivity and similar accuracy, specificity, and AUC. However, all the above methods are constrained by the limited availability of annotated data. Differently, Shi et al. [28] instead used standardized terminology to assist in the extraction of ultrasound image features in KACGAN to facilitate thyroid nodule image enhancement. This method is similar to our idea, but our method does better in crossmodal alignment and obtains higher metric values. As mentioned above, cross-domain fusion using multimodal data to improve the classification performance of thyroid nodules has become a trend in thyroid nodule diagnosis. In this paper, we propose a new deeply fused semantic consistency generative adversarial network (SCGAN) to diagnose benign and malignant nodules in thyroid ultrasound images. The method organically combines image features with textual information. The domain adaptation process of these two cross-modal data is accomplished jointly through the self-attention mechanism and metric learning, using their semantic consistency to reduce domain shifts in the training process. The addition of two new techniques to guide the hinge loss based on adversarial learning promotes the convergence of the network and improves the quality of image generation. The experimental results demonstrate the effectiveness of our SCGAN in improving the performance of target domain classification networks and have potential clinical applications. We will work on a training model that can be applied to more types of ultrasound images and domain knowledge in future work. For example, the inclusion of richer knowledge information such as blood flow signals or ultrasound elasticity images improves diagnostic accuracy. Besides, the embedding process of our domain knowledge relies on pretrained text encoders, which, unlike natural datasets, require parameter tuning for medical datasets. In the next step, we will add an attention mechanism to the text encoder to achieve the most advanced performance. Thyroid lesion classification in 242 patient population using gabor transform features from high resolution ultrasound images Multimodal feature fusion and knowledge-driven learning via experts consult for thyroid nodule classification Large scale gan training for high fidelity natural image synthesis Extracting semantic representations from word co-occurrence statistics: a computational study Deep learning: a primer for radiologists Semantic-aware generative adversarial nets for unsupervised domain adaptation in chest x-ray segmentation Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation A review of thyroid gland segmentation and thyroid nodule segmentation methods for medical ultrasound images Semantic image synthesis via adversarial learning Lstm: A search space odyssey Progressive transfer learning and adversarial domain adaptation for cross-domain skin disease classification Deep residual learning for image recognition Ultrasonographic thyroid nodule classification using a deep convolutional neural network with surgical pathology An improved deep learning approach for detection of thyroid papillary cancer in ultrasound images Classification of thyroid nodules with stacked denoising sparse autoencoder Semantic consistency generative A pre-trained convolutional neural network based method for thyroid nodule diagnosis Which training methods for gans do actually converge? Transformer reasoning network for image-text matching and retrieval Visual object tracking via the local soft cosine similarity Conditional image synthesis with auxiliary classifier gans Representation learning with contrastive predictive coding Towards zero-shot learning generalization via a cosine distance loss Patchbased classification of thyroid nodules in ultrasound images using direction independent features extracted by two-threshold binary decomposition Diagnosis of benign and malignant thyroid nodules using combined conventional ultrasound and ultrasound elasticity imaging Unsupervised representation learning with deep convolutional generative adversarial networks Optimized multi-level elongated quinary patterns for the assessment of thyroid nodules in ultrasound images Adversarial domain adaptation for classification of prostate histopathology whole-slide images Knowledge-guided synthetic medical image adversarial augmentation for ultrasonography thyroid nodule classification Very deep convolutional networks for large-scale image recognition Going deeper with convolutions Rethinking the inception architecture for computer vision Acr thyroid imaging, reporting and data system (ti-rads): white paper of the acr ti-rads committee Slidergan: Synthesizing expressive face images by sliding 3d blendshape parameters Thorax-net: an attention regularized deep neural network for classification of thoracic diseases on chest radiography Triple attention learning for classification of 14 thoracic diseases using chest radiography Learning from weakly-labeled clinical data for automatic thyroid nodule classification in ultrasound images Automatic diagnosis for thyroid nodules in ultrasound images by deep neural networks Deep visual domain adaptation: a survey Unsupervised learning of visual representations using videos Transferable attention for domain adaptation A survey on domain knowledge powered deep learning for medical image analysis Attngan: Fine-grained text to image generation with attentional generative adversarial networks Dscgans: Integrate domain knowledge in training dualpath semi-supervised conditional generative adversarial networks and s3vm for ultrasonography thyroid nodules classification Heterogeneous domain adaptation via soft transfer network Generative adversarial network in medical imaging: a review Semantics disentangling for text-to-image generation Selfattention generative adversarial networks Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks Collaborative unsupervised domain adaptation for medical image diagnosis Online transfer learning for differential diagnosis of benign and malignant thyroid nodules with ultrasound images Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations The authors declare that they have no conflict of interest.