key: cord-0510430-alcoaj4h authors: Mo, Yuhao; Han, Chu; Liu, Yu; Liu, Min; Shi, Zhenwei; Lin, Jiatai; Zhao, Bingchao; Huang, Chunwang; Qiu, Bingjiang; Cui, Yanfen; Wu, Lei; Pan, Xipeng; Xu, Zeyan; Huang, Xiaomei; Liu, Zaiyi; Wang, Ying; Liang, Changhong title: HoVer-Trans: Anatomy-aware HoVer-Transformer for ROI-free Breast Cancer Diagnosis in Ultrasound Images date: 2022-05-17 journal: nan DOI: nan sha: dfc4089293e25e09f2a0c8afa3b1a1a04707e69d doc_id: 510430 cord_uid: alcoaj4h Ultrasonography is an important routine examination for breast cancer diagnosis, due to its non-invasive, radiation-free and low-cost properties. However, it is still not the first-line screening test for breast cancer due to its inherent limitations. It would be a tremendous success if we can precisely diagnose breast cancer by breast ultrasound images (BUS). Many learning-based computer-aided diagnostic methods have been proposed to achieve breast cancer diagnosis/lesion classification. However, most of them require a pre-define ROI and then classify the lesion inside the ROI. Conventional classification backbones, such as VGG16 and ResNet50, can achieve promising classification results with no ROI requirement. But these models lack interpretability, thus restricting their use in clinical practice. In this study, we propose a novel ROI-free model for breast cancer diagnosis in ultrasound images with interpretable feature representations. We leverage the anatomical prior knowledge that malignant and benign tumors have different spatial relationships between different tissue layers, and propose a HoVer-Transformer to formulate this prior knowledge. The proposed HoVer-Trans block extracts the inter- and intra-layer spatial information horizontally and vertically. We conduct and release an open dataset GDPH&GYFYY for breast cancer diagnosis in BUS. The proposed model is evaluated in three datasets by comparing with four CNN-based models and two vision transformer models via a five-fold cross validation. It achieves state-of-the-art classification performance with the best model interpretability. Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer death in women globally [1] . Early breast cancer screening can reduce mortality and increase survival rates [2] . Breast ultrasound (BUS) is an important imaging modality for breast cancer diagnosis and screening because it is low-cost, non-invasive, radiation-free, and relatively more sensitive for dense breast tissue [3] . In addition, ultrasound is effective at the differentiation of cysts from solid lesions [4] . Therefore, it is meaningful for breast cancer patients if there exists a precise screening method for BUS, especially for the dense breast patients in Asia. Currently, ultrasound image evaluation generally relies on the subjective evaluation of sonographers. However, highquality screening ultrasound is constrained by the limited number of specialized sonographers. In addition, a high intraand inter-observer variability exist even among expert sonographers. To overcome such difficulties, computer-aided diagnosis (CAD) systems [5] , [6] have been constructed to help sonographers with a more efficient and more precise breast cancer screening. With the recent advance in deep learning, the performance of the diagnostic models even outperforms expert sonographers [7] . Even though existing models have already achieved outstanding diagnostic performance, most of them are still a 'black box', which lacks interpretability. Furthermore, since the open-source data in the community is very limited, existing models either evaluated the performance on a relatively small open dataset (BUSI [8] or UDIAT [9] ) or on their own private datasets. The clinical usability of the diagnostic model should be further assessed. In this paper, we promote automatic breast cancer screening in ultrasound images in the following two aspects, data resources and methodology. First, we release a large breast lesion classification dataset GDPH&GYFYY, which was collected from two medical centers with 886 benign and 1525 malignant images, for a total of 2411 BUS images. We only provide the whole BUS images without ROI annotations. Then, we propose an interpretable breast cancer diagnosis model for BUS. We find that the sequential data analysis nature of the transformer perfectly fits the anatomical prior of the breast ultrasound image. As shown in Fig. 1 are four layers from top to bottom, including subcutaneous fat layer, breast parenchyma layer, muscle layer and chest wall layer. Malignant tumors always start from the breast parenchyma layer and invade toward the deeper layers. Benign breast tumors typically originated in the glandular tissue and destructed the gland continuity. Therefore, we design an anatomy-aware model, called HoVer-Transformer (in short HoVer-Trans), which considers the prior knowledge of the anatomical structure in BUS. We propose a HoVer-Trans block to extract the inter-layer spatial information horizontally and the intra-layer spatial information vertically. In HoVer-Trans, we introduce convolutional layers to joint two adjacent transformer stages to fuse the horizontal and vertical image features and to introduce inductive bias. The proposed HoVer-Trans is evaluated by extensive experiments, including comparisons with SOTA methods in several datasets, model interpretability and ablation studies. HoVer-Trans achieves comparable quantitative performance in all the datasets. The visualization heatmaps also demonstrate that HoVer-Trans is able to pay attention to the malignant lesion boundary (invasive margin), which proves the horizontal and vertical design successfully learned anatomical prior knowledge. Ablation studies demonstrate the effectiveness of each specific technical design. The main contributions of this paper are summarized as follows. • We release a new breast cancer classification dataset, GDPH&GYFYY which is the largest open dataset in this field. • We propose an anatomy-aware model HoVer-Trans to fully automatically classify the breast lesion and achieve comparable performance compared with six baseline models. • HoVer-Trans is able to provide the interpretable evidence to support the decision of the model. In this section, we summarize the previous researches on breast cancer diagnosis in ultrasound images [10] , [11] and the transformer-based medical image classification models [12] . Ultrasonography is one of the most common and noninvasive imaging modalities for breast cancer screening and diagnosis. Precisely detecting and diagnosing malignant tumors allows early intervention to reduce mortality. It generally relies on the subjective evaluation of the sonographers. However, manual assessment highly depends on the clinical experience due to the heterogeneous of the malignant tumors and the low image quality of the ultrasound images. Therefore, it is essential to design CAD algorithms to automatically and objectively evaluate breast ultrasonography. With the recent advances in artificial intelligence techniques, like radiomics [13] and deep learning [14] , researchers started to solve various clinical prediction applications in a data-driven manner and achieved outstanding performances in breast cancer diagnosis, such as lesion classification [4] , [6] , axillary lymph node status prediction [15] , [16] , sentinel lymph node status prediction [17] , [18] and even molecular status prediction [19] , [20] . Currently, deep learning-based models have already dominated the breast lesion classification. Flores et al. [21] explored the prediction value of the morphological and texture features for breast lesion classification on ultrasonography. Byra et al. [22] transferred the model pre-trained on Im-ageNet to fine-tune the breast mass classification model, which is a popular way for small-or mid-sized data. Some researchers [23] , [24] attempted to ensemble the deep features from multiple classification architectures and applied machine learning classifiers for breast ultrasonography image classification. Zhuang et al. [25] proposed an image decomposition and enhancement method to enrich the information of the ultrasound image. Qian et al. [4] aggregated the multimodal ultrasound images for an explainable prediction to support the clinical decision-making of the sonographers and increase the confidence levels of the decision. Cui et al. [26] proposed an FMRNet to fuse combined tumoral, intratumoral and peritumoral regions to represent the whole tumor heterogeneous. Di et al. [27] introduced a saliency-guided approach to differentiate the foreground and background regions by two separated branches. A hierarchical feature aggregation branch was proposed to fuse the features from both branches and make the inference. Qi et al. [28] designed two identical CNN backbones to identify the malignant tumor and solid nodule separately. The class activation maps generated from two CNN backbones were used to guide each other. They validated the proposed model in a large dataset with 8145 breast ultrasonography images. Unfortunately, the dataset in this paper is private. Shen et al. [5] discussed how the CAD system helps sonographers reduce the false positive rate. For the breast lesion classification, even though existing models have already achieved comparable performance with sonographers, it is still worth to keep discovering the potential values of deep learning models from different perspectives. People now pay more attention to clinical usability other than only considering the accuracy. The clinical usability reflects in the following factors. (1) Accuracy and concordance: Whether the model prediction results have more precise results as well as a higher concordance rate with sonographers. (2) Interpretability: Whether the model can provide any sonographic symptom or evidence to support the decision. (3) Model Convenience: Whether the model is fully automated without any user input, like manual segmentation or predefined ROIs. HoVer-Trans formulates the anatomical prior knowledge in breast ultrasound images, which is designed to extract the intra-layer and inter-layer relationships of the anatomical layers in the breast. It consists of four branches. The horizontal branch and the vertical branch are designed to extract the inter-layer and intra-layer relationships respectively. H2V and V2H branches are introduced to fuse the horizontal and vertical features. The output features from each branch in the HoVer-Trans block will be regarded as the input features of the next HoVer-Trans block. (c) Conv block is applied to connect two stages and to introduce inductive bias. Transformer [29] is originally designed for natural language processing. It has been widely used in sequential data analysis thanks to the elegant self-attention mechanism [30] . The invention of the vision transformer (ViT) [31] is leading the transformer-based models toward the computer vision applications by cropping the image into several small tiles (visual words). Swim transformer [32] introduces multi-scale information like a CNN model does by a hierarchical structure and shifted windows. Sooner, various transformer models [33] were proposed for medical image classification, such as COVID-VIT for CT chest COVID-19 classification [34] , TransMIL for pathology image classification [35] , MIL-VT for fundus image classification [36] and etc. Instead of designing a complex black-box model for breast cancer prediction, we intend to take model interpretability, generalizability and convenience into consideration. By formulating the anatomical prior knowledge into the transformer model design, the proposed model demonstrates superior predictive ability while can provide interpretable features to support the decisions. Since the anatomical structures of the breast are very clear in the ultrasound images. By leveraging this prior knowledge, we propose an anatomy-aware model for fully automatic breast cancer diagnosis in ultrasound images. In this section, we demonstrate the methodology of the proposed model, shown in Fig. 2 . First, we introduce the key idea of the anatomyaware formulation in Sec. III-A. Based on this idea, the HoVer-Trans stage is proposed in Sec. III-B. Next, we define the overall network structure of the proposed model in Sec. III-C. Sec. III-D shows the implementation details. According to the ultrasound imaging principles and the anatomical structure of the breast, different breast tissues form different layers clearly in the ultrasound images, as shown in Fig. 1 . The size, location and morphological appearance of the lesion and the spatial relationship with different layers determine the malignancy of the lesion. Conventional CNN models are good at extracting representative local features but show less effective spatial relationship representation ability. That is the reason why most of the existing breast cancer diagnosis algorithms in ultrasound images need a pre-defined ROI of the lesion to remove the redundant area and let the CNN model classify the ROI. The self-attention nature of the transformer introduces strong spatial relationships of each visual word, as shown in Fig. 3 (a) . To further exploit the intralayer and inter-layer spatial correlations in BUS, we formulate the problem by transforming the square-shape visual words into horizontal and vertical strips to bring the anatomical prior knowledge into the model, as shown in Fig. 3 (b) . Since our proposed model is constructed on top of the ViT structure, we first briefly introduce ViT at the beginning of this part. And then we show how the proposed HoVer-Trans stage is constructed. 1) Vision Transformer: Vision transformer (ViT) [31] is the first to bring the most popular technique in natural language processing into the computer vision world. It tessellates the input image x ∈ R H×W ×C into patches x p ∈ R N ×(P 2 ·C) and regards them as the visual words (tokens), where (H, W, C) and (P, P, C) are the resolution with channels of the input image and the patches, respectively. N is the number of patches. For each visual word, they transform the 2D image into a 1D vector, called patch embedding. Multi-head selfattention mechanism builds spatial correlations across different tokens. The formulation of ViT is shown as follows: In our proposed HoVer-Trans, we use several ViT blocks with exactly the same structure without class embedding to construct the HoVer-Trans block. Thus, we denote all the ViT blocks in the following paper as Trans(·). 2) Embedding: To formulate the anatomical prior knowledge into the transformer model, we introduce additional two embedding ways shown in Fig. 2 (a) , following the idea presented in Sec. III-A. Given an input BUS image I ∈ R H×W ×C , patch embedding, horizontal strip embedding and vertical strip embedding are processed before feeding them into the model. The patching embedding cuts the input image into N × N patches x (r,c) p , where r and c denote the indices of the row and column. After flattening and linear projection, we get a group of 1D vectors z p . Horizontal strip embedding is introduced to represent the visual words of the same anatomical layer with M strips, defined as follows: Vertical strip embedding is introduced to represent the visual words across anatomical layers with M strips, defined as follows: 3) HoVer-Trans Block: The architecture of the HoVer-Trans block is depicted in Fig. 2 (b) . We design a symmetry structure with four branches in one HoVer-Trans block, H branch (horizontal), V branch (vertical), H2V branch (horizontal to vertical) and V2H branch (vertical to horizontal). Let us define the features at the l-th block in the s-th stage as z s,l {h,v,h2v,v2h} . HoVer-Trans block takes the outputs from the previous block and generates the features for the next block, defined as follows. H and V branches are two auxiliary branches to extract the inter-layer and intra-layer spatial correlations, with two identical H and V branches with horizontal strip embedding (defined in Eq. 6) and vertical strip embedding (defined in Eq. 7), respectively. The anatomy-aware spatial features z s,l h and z s,l v will be passed into the next two main branches (H2V and V2H). They are also regarded as the inputs of the next HoVer-Trans block. H2V and V2H branches are served as the main feature extraction branches which fuse the features from two auxiliary branches (H and V). For example in the H2V branch, the horizontal features z s,l h are added to the features from the previous HoVer-Trans block z s,l−1 h2v . After a transformer encoder, the vertical features z s,l v are added behind. The V2H branch is the mirror of the H2V branch. z s,l h2v = Trans(Trans(z s,l h + z s,l−1 h2v ) + z s,l v ) (12) The output features z s,l h2v and z s,l v2h will be passed into the next block. Note that, for the last HoVer-Trans Block in each stage, the features will be passed into a Conv Block, described as follows. 4) Conv Block: The transformer is good at processing sequential data and extracting spatial correlations. But it lacks inductive bias. In order to leverage the strength of both transformer and CNN, we introduce a convolutional block (Fig. 2 (c) ) after the last HoVer-Trans block at each stage to fuse the H2V and V2H features and introduce the inductive bias. The Conv block consists of three convolutional layers. In the first convolutional layer with 1 × 1 kernel size, we set the number of output channels to be twice the number of input channels. In the second convolutional layer, a 3 × 3 convolution kernel is used. In the third convolutional layer 1 × 1, the number of channels is compressed to adapt to the next stage. The features from two main branches (H2V and V2H) are reshaped to 2D feature maps and concatenated together before being processed by the convolutional block. where R(·) means to reshape the 1D feature vectors to 2D feature maps. The overall structure of our model is shown in Fig 2. The model consists of four stage modules. Each stage module consists of several HoVer-Trans blocks, one Conv block and one pooling layer. Given a BUS image I ∈ R H×W ×3 , we first use convolutional stem [37] to downscale I the image size to H/4 × W/4 × C. The resolution of the feature maps in the next three stages are H/8 × W/8 × 2C, H/16 × W/16 × 4C, H/32 × W/32 × 8C, which is similar to the structure of the traditional convolutional neural network [38] , [39] . To fuse the horizontal and vertical information, a Conv block is introduced to connect two adjacent stages. So the input of each stage is a 2D image or 2D feature maps. Embedding or flattening will be introduced to fit the input of the transformer. In the last stage, the fully connected layer is applied for inference. The model is optimized by the cross-entropy loss. We use Python3.6 and PyTorch1.8 to implement all the models. All the experiments are run with an 11 GB NVIDIA We train for 250 epochs with the AdamW [40] optimizer, a batch size of 32, weight decay of 0.1, 10 warm-up epochs and an initial learning rate of 0.0001 with a cosine decay learning rate scheduler. The augmentation strategy includes blurring, noise, horizontal flip, brightness and contrast. Because the order of the tissue layers is fixed, we do not use vertical flip data augmentation. All the images will be resized to 256×256. In this paper, we use three datasets to evaluate the diagnostic performance of our model, two of which are the public datasets and one is our constructed dataset. The first public dataset is a small dataset, named UDIAT [9] , which contains a total of 163 BUS images with 109 BUS images of benign lesions and 54 BUS images of malignant lesions. All the images were collected from the UDIAT Diagnostic Centre of the Parc Tauli Corporation, Sabadell, Spain. The average size of the images is 760 × 570 pixels and the range from 307 × 233 to 791 × 641. The second dataset, BUSI [8] , consists of a total number of 780 BUS images from the Baheya Hospital for Early Detection and Treatment of Women's Cancer, Cairo, Egypt. BUSI dataset includes 210 images with benign lesions, 437 images with malignant lesions, and 133 normal BUS images without lesions. Furthermore, each image has the pixel-level ground truth of the lesion. In this paper, since we only differentiate the malignant and benign lesions, normal BUS images are excluded. 647 images are finally utilized with the average resolution of 608 × 494 pixels and the range from 190 × 335 to 916 × 683. In this study, we also construct a publicly available dataset of BUS images for breast cancer diagnosis. All of the images are labeled as benign or malignant according to the A. Experimental Setting 1) Evaluation Metrics: To comprehensively evaluate the performance of the proposed model, we introduce the fol-lowing metrics, including area under the ROC curve (AUC), accuracy (ACC), specificity, precision, recall and F1 score. 2) Competitors: In this paper, we compare our proposed model with six state-of-the-art (SOTA) models, including two most popular CNN-based classification models ResNet50 [38] and VGG16 [39] , two vision transformer models ViT [31] and TNT-s [41] and two CNN-based models tailored for breast cancer diagnosis in ultrasound images JBHI2020 [42] and MICCAI2020 [6] . Although JBHI2020 and MICCAI2020 are both designed for breast cancer diagnosis, they are still a little bit different from our proposed model. Both JBHI2020 and MICCAI2020 do not process the entire ultrasound images. These two models have to first pre-define a region of interest (ROI) of the mass, and then classify the malignancy of the corresponding lesion. Furthermore, JBHI2020 includes an additional BI-RADS score in the training phase. So in the existing two public datasets UDIAT and BUSI, the quantitative performance of these two models is directly copied from the corresponding papers. In the GDPH&GYFYY dataset, we only compare our proposed model with the other four image classification baseline models due to the lack of ROIs of the lesions or BI-RADS scores. In all the experiments, we use 5-fold cross-validation to evaluate the models. Table I shows the quantitative results in three datasets. In the UDIAT dataset, since MICCAI2020 and JBHI2020 classify the lesion within the ROIs, it is a much easier task than to classify the lesion without an ROI. Among the other five models, VGG16 and our model achieve promising classification results even though there are only 163 BUS images. In the BUSI dataset (647 images), the classification performance of our proposed model achieves the best ACC of 0.855, the precision of 0.867 and the F1-score of 0.758. A larger dataset with more training samples allows the neural network models to learn better feature representation. The performance of our model in the BUSI dataset is even comparable to the performance of the ROI-based model JBHI2020. In our dataset GDPH&GYFYY (2411 images), we only compare with four ROI-free baseline models due to the lack of lesion ROIs. As can be seen, the proposed HoVer-Trans achieves the best classification performance on AUC, ACC, precision, recall and F1-score. When with enough training data, the advantage of the anatomy-aware design has been demonstrated. Besides the quantitative results, we also demonstrate where the models focus by showing the heatmaps to evaluate the interpretability of the models (in the GDPH&GYFYY dataset). Table I . Visualization of two CNN-based models ResNet50 and VGG16 are shown in Fig. 5 (c)&(d) . ResNet50 also has the same problem with transformer-based models. VGG16 can achieve better visualization results compared with the previous three models. But it also pays attention to the dark areas caused by signal attenuation. Our proposed model in Fig. 5 (b) shows the best visualization results with more accurate lesion locations and more focused attention, thus achieving the best F1-score of 0.911. Fig. 6 demonstrates more heatmaps of the HoVer-Trans model in GDPH&GYFYY. In this part, we conduct several ablation studies to evaluate the effectiveness of the anatomy-aware formulation, the association between transformer and CNN and different transformer configurations. 1) Effectiveness of Anatomical Prior Knowledge: We conduct this experiment to evaluate the effectiveness of the anatomy-aware formulation. Three variants are introduced in this experiment. 1) H branch with horizontal strip embedding and V branch with vertical strip embedding are removed, named M odel p . Only two main branches with patch embedding are left. 2) We remove the H branch and retain the other three branches, named M odel p+v . 3) We remove the V branch and retain the other three branches, named M odel p+h . The upper part of Table II demonstrates the five-fold cross validation results. It can be observed that without the design of HoVer in M odel p , the performance of all the metrics decreases by around 1%-3%. When only removing H branch or V branch, the quantitative results do not improve due to the asymmetric of the models. Fig. 7 (b)-(d) demonstrate the heatmap visualizations of three model variants and Fig. 7 (g) demonstrates our results. In Fig. 7 (b) we can observe that associating transformer with CNN can obtain visually more convincing attention maps better than transformer models only, shown in Fig. 6 (e)&(f). However, the model focuses are still imperfect due to the lack of H and V branches. When equipped with H or V branch in Fig. 7 (c)&(d) , the model can focus on the anomaly regions horizontally or vertically, guided by the anatomical prior. Thanks to the complete HoVer design shown in Fig. 7 (g), our proposed model achieves the best visualization results with the most correct lesion location and attention. 2) Effectiveness of Conv Block: In this study, we utilize the convolutional block to connect two adjacent HoVer-Trans stages in order to introduce multi-scale model representation and inductive bias. In order to evaluate the advantage of associating transformer with CNN, we compare our model with two variants. 1) We remove the Conv block and replace it with an average pooling layer for feature dimension reduction (w/o Conv). 2) We replace the Conv block with a 1 × 1 convolutional layer. The quantitative results are shown in the middle part in Table II . The attention maps are shown in Fig. 7 (e)-(g). When without the convolutional layer, we can observe a decreased quantitative performance and the attention maps totally lost focus. Introducing a 1 × 1 convolutional layer can slightly improve the classification performance, but the attention maps are still unsatisfactory (Fig. 7 (f) ). With the interaction of horizontal and vertical anatomy-aware formulation, the proposed HoVer-Trans model can achieve the best classification results and the most interpretable attention maps. 3) Sizes of Different Embedding Ways: Since we introduce three embedding ways in this model to formulate the anatomical structure. In this ablation study, we further explore how the sizes of different embedding ways affect the proposed model, shown in Table II . p = 2 means the visual tokens of the patching embedding are with the resolution of 2 × 2. h&v = 2 means the tokens of the horizontal strip embedding and the vertical strip embedding are with the resolution of 2 × width and height × 2, respectively. In this experiment, we let p = {2, 4, 8} and h&v = {1, 2, 4} and test all the combinations. It can be observed that the classification performance of all the models is close. According to the quantitative results in Table II, select the proper token size of the horizontal and vertical strip embeddings h&v. When h&v = 1, each token only contains the horizontal or vertical information with only onepixel width or height, which is too limited. When h&v = 4, the size of the token is too large, which may occupy more computational resources. Therefore, we let h&v = 2 for the horizontal and vertical strip embedding. For the token size of patch embedding, we choose the value of p = 2 because we want to let the token of patch embedding fit the token size of horizontal and vertical strip embedding. Experimental results prove that the configuration with p = 2 and h&v = 2 achieves the best classification performance. In this paper, we propose a novel HoVer-Trans model, which associates the transformer with CNN, for breast cancer diagnosis in breast ultrasound images. An anatomy-aware HoVer-Trans block is designed to formulate the anatomical prior knowledge in BUS images. To achieve that, we incorporate three embedding ways, patch embedding, horizontal strip embedding and vertical strip embedding to explore spatial correlations of the inter-layer and intra-layer visual words. There are several advantages to the above technical designs. 1) The proposed model is ROI-free which does not require a pre-defined lesion ROI. Such a property greatly improves the model flexibility in clinical practice. 2) The proposed model can provide the interpretable attention maps to support the model prediction results, which is the key that most sonographers care about when using AI algorithms to assist decision-making. 3) The proposed model achieves the best classification performance against several SOTA models in both quantitative evaluations and heatmap visualizations. Due to the model complexity, the proposed model shows poor classification performance when trained by a smaller dataset, such as UDIAT. That is the reason why we also construct and release a larger dataset GDPH&GYFYY for breast cancer diagnosis in BUS images. We are also planning to construct a way larger multi-center dataset to further explore the capacity and the generalizability of the proposed model. Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries Impact of screening on breast cancer mortality: the uk program 20 years on Breast cancer screening and diagnosis, version 3.2018, nccn clinical practice guidelines in oncology Prospective assessment of breast cancer risk from multimodal multiview ultrasound images via clinically applicable deep learning Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams Multi-scale gradationalorder fusion framework for breast lesions classification using ultrasound images A combined ultrasonic b-mode and color doppler system for the classification of breast masses using neural network Dataset of breast ultrasound images Automated breast ultrasound lesions detection using convolutional neural networks Methods for the segmentation and classification of breast ultrasound images: a review Deep-learning-based computer-aided systems for breast cancer imaging: a critical review A survey of visual transformers Radiomics: images are more than pictures, they are data Deep learning Deep learning radiomics can predict axillary lymph node status in early-stage breast cancer Lymph node metastasis prediction from primary breast cancer us images using deep learning Deep learning radiomics of ultrasonography: identifying the risk of axillary non-sentinel lymph node involvement in primary breast cancer Preoperative ultrasound-based radiomics score can improve the accuracy of the memorial sloan kettering cancer center nomogram for predicting sentinel lymph node metastasis in breast cancer Predicting her2 status in breast cancer on ultrasound images using deep learning method Deep learning with convolutional neural network in the assessment of breast cancer molecular subtypes based on us images: a multicenter retrospective study Improving classification performance of breast lesions on ultrasonography Breast mass classification in sonography with transfer learning using a deep convolutional neural network and color conversion Convolutional neural networks based classification of breast ultrasonography images by hybrid method with respect to benign, malignant, and normal using mrmr Computer-aided diagnosis of breast ultrasound images using ensemble learning from convolutional neural networks Breast ultrasound lesion classification based on image decomposition and transfer learning Fmrnet: A fused network of multiple tumoral regions for breast tumor classification with ultrasound images Saliency map-guided hierarchical dense feature aggregation framework for breast lesion classification using ultrasound image Automated diagnosis of breast ultrasonography images using deep neural networks Attention is all you need A survey of transformers An image is worth 16x16 words: Transformers for image recognition at scale Swin transformer: Hierarchical vision transformer using shifted windows Transformers in medical imaging: A survey Covid-vit: Classification of covid-19 from ct chest images based on vision transformer models Transmil: Transformer based correlated multiple instance learning for whole slide image classification Mil-vt: Multiple instance learning enhanced vision transformer for fundus image classification Early convolutions help transformers see better Deep residual learning for image recognition Very deep convolutional networks for large-scale image recognition Adam: A method for stochastic optimization Transformer in transformer Using bi-rads stratifications as auxiliary information for breast masses classification in ultrasound images