key: cord-0331456-6erzk6zu authors: Li, Q.; Zhu, Y.; Chen, M.; Guo, R.; Hu, Q.; Deng, Z.; Deng, S.; Wen, H.; Gao, R.; Nie, Y.; Li, H.; Zhang, T.; Chen, J.; Shi, G.; Shen, J.; Cheung, W. W.; Guo, Y.; Chen, Y. title: Automatic detection of pituitary microadenoma from magnetic resonance imaging using deep learning algorithms date: 2021-03-05 journal: nan DOI: 10.1101/2021.03.02.21252010 sha: 9d018b97d84e1f2e029c47789efc76a3ac9c1128 doc_id: 331456 cord_uid: 6erzk6zu Pituitary microadenoma (PM) is often difficult to detect by MR imaging alone. We employed a computer-aided PM diagnosis (PM-CAD) system based on deep learning to assist radiologists in clinical workflow. We enrolled 1,228 participants and stratified into 3 non-overlapping cohorts for training, validation and testing purposes. Our PM-CAD system outperformed 6 existing established convolutional neural network models for detection of PM. In test dataset, diagnostic accuracy of PM-CAD system was comparable to radiologists with > 10 years of professional expertise (94% versus 95%). The diagnostic accuracy in internal and external dataset was 94% and 90%, respectively. Importantly, PM-CAD system detected the presence of PM that had been previously misdiagnosed by radiologists. This is the first report showing that PM-CAD system is a viable tool for detecting PM. Our results suggest that PM-CAD system is applicable to radiology departments, especially in primary health care institutions. Prevalence of pituitary adenomas (PA) ranges from 1 in 865 to 1 in 2688 adults. PA may hypersecrete hormones or cause mass effects, which result in various clinical symptoms, including infertility, diabetes insipidus and hypopituitarism [1] . The end of the one-child policy and social shifts in China have resulted in the increasing demand for fertility treatments. Pituitary tumors, although usually benign, can inhibit the production of follicle-stimulating hormone (FSH) or luteinizing hormone (LH) and cause infertility. Functional PA have been found in many infertile patients, the most common is prolactinoma. Approximately 50% of all pituitary adenomas and 90% of prolactinomas are microadenomas, that rarely increase in size [2, 3] . Timely diagnosis and follow-up of pituitary microadenoma (PM), especially functional PM, is particularly important [4] . Due to its relatively small in size and variable anatomical structure among individuals, the diagnosis of PM is not easy by applying the technique of magnetic resonance imaging (MRI) alone [1, 5] . Manual analysis of MRI data is usually difficult, biased and time-consuming, and the diagnostic accuracy is closely related to the radiologist's experience. Therefore, MRI-based diagnosis for PM needs to be improved [6, 7] . The increase in the demand for radiologist is an inevitable trend, but the supply of radiologist has not increased proportionally [8] . A shortage of radiologists restricts the continuity in radiology services and causes a delay in diagnosis, compromising the overall quality of service to patients [8, 9] . Recently, we have encountered several cases of misdiagnosed PM in our hospital (Fig 1) . Deep learning has dominated various computer vision areas since 2012 due to its effective representation capability [10] . The development of convolutional neural network (CNN) has significantly improved the performance of image classification and object detection [11] . Deep learning has the potential to revolutionize disease diagnosis and management by improving the diagnostic accuracy and reducing the workload of clinicians. Specifically, CNN has achieved great progresses in the diagnosis of breast cancer [12, 13] , diabetes retinopathy [14, 15] , fibrotic lung disease [16] and COVID-19 [17] . Furthermore, reports demonstrated that computer-aided diagnosis (CAD) system can accurately diagnose patients with PA from MR images [18] [19] [20] . However, the information for CAD-based diagnosis of PM is limited [5, 6] . In this work, we have constructed a computer-aided PM diagnosis (PM-CAD) system based on deep learning and aimed to provide an accurate and timely diagnosis of PM from pituitary MR images. Our current research in applying deep learning algorithms to aid in the diagnosis of PM was prompted by several misdiagnosed cases in our hospital. A female patient in her 40's suffered from diabetes, osteoporosis and recurrent cellulitis. Laboratory examination and functional test were consistent with the presentation of Cushing's disease (Tab S1). The patient underwent three times of MRI scanning over a period of 20 months. However, the radiologists did not detect any pituitary adenoma (Fig 1a) . Nevertheless, this patient was given bromocriptine and cyproheptadine for 2 years, but the magnitude of the disease was not under effective control. Subsequently, the inferior petrosal sinus sampling (IPSS) was performed on this patients and a functional microadenoma was finally detected in her right pituitary gland. We performed another MRI examination on this patient and were able to detect 2 pituitary microadenomas (with diameters of 5 mm and 3 mm, respectively) (Fig 1a) . One week later, we performed transsphenoidal PM resectioning on this patient. Shortly after the completion of the operation, the serum cases of PM previously misdiagnosed by radiologist. We utilized the MRI dataset from 3 clinically misdiagnosed PM cases. Specifically, these 3 patients underwent surgical treatment (2 cases of Cushing's disease and 1 case of TSH secreting adenoma). The presence of PM on those patients was confirmed by a subsequent pathological evaluation of the extracted tissue samples as well as other relevant clinical information (Fig 3b and Tab S1). To our delight, PM-CAD system was able to detect the presence of PM in all 3 previous misdiagnosed cases. We used both internal and external dataset to test the robust generalization performance of our PM-CAD system. The performance of our system was further tested in additional 150 patients from 3 different hospitals. Six general radiologists from each hospital were recruited (2 radiologists with < 5 years professional experience, 2 radiologists with 5 -10 years professional experience while additional 2 radiologists with > 10 years professional experience). We observed that the accuracy of respective PM diagnosis was positively correlated to the radiologist's professional experience as well as the time allocated for each image reading (Fig 4a, b) . The diagnostic accuracy achieved by radiologists with professional experience >10 years was above 90%. Meanwhile, the diagnostic accuracy was higher than 88% when the MRI reading time for each patient was over 45 seconds. In contrast, the diagnosis accuracies achieved by PM-CAD system in internal hospital was 94 % and in external hospitals were 90 % and 88 %, respectively (Fig 4a) . The results in the external datasets are very encouraging, we showed that the diagnosis performance of our PM-CAD system is comparable to radiologists with 5-10 years of professional experience and has negligible of time cost. Deep learning has been used to diagnose various diseases [15] [16] [17] . However, existing CNN based pituitary diagnosis models mainly focus on pituitary macroadenoma [18] [19] [20] . Our work provides the first evidence in applying CNN model for pituitary microadenoma diagnosis. We showed that our PM-CAD system outperforms 6 existing established CNNs-based models (i.e., GoogLeNet [21] , 3D-CNN [22] , VGG [23] , ResNet [24] , DenseNet [25] and ResNeXt [26] ) in diagnosis of PM, achieves satisfactory predictions on the validation dataset, with an accuracy of 94.36%, a sensitivity of 96.97%, and a specificity of 93.02%. Diagnosis of PM requires its fine-grained feature in MR images. However, the aggressive downsampling of most modern 2D CNN models harms the forward propagation of fine-grained feature [27] . Furthermore, a large training dataset is required to train neural network parameters [28] . To address the issues faced by fine-grained feature learning and overfitting, we introduced an improved backbone and an attention module (further information is listed in Supplementary information for Materials & Methods, section Bmicroadenoma diagnosis model) to our PM-CAD system. We showed that our PM-CAD system is more suitable for microadenoma diagnosis than 6 existing CNN models. We also showed that the PM-CAD system outperformed 5 general radiologists in PM diagnosis. The weighted error achieved by PM-CAD was 10.00%, which was much lower that by radiologists (26.67%). The AUC achieved by PM-CAD system was 95.56% and was comparable to our radiologist with > 10 years of professional expertise (Fig 2a) . By comparing the negative and positive likelihood ratio, we showed that the PM-CAD system achieved better diagnosis performance, with a higher positive diagnosis rate and a lower false-negative rate than radiologists. To further confirm the diagnosis performance of PM-CAD system, three double positive cases (verified by PM-All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 5, 2021. ; https://doi.org/10.1101/2021.03.02.21252010 doi: medRxiv preprint CAD system and radiologists) and three false-negative cases (misdiagnosed by radiologists) were selected. The diagnosis accuracy of PM-CAD system on false-negative cases was 100%, which was confirmed by a subsequent pathological examination. Three false-negative cases were hormone-producing microadenomas, with small size and irregular morphology which were difficult to detect by using MRI alone. The current MRI diagnostic sensitivity for hormone-producing primary and recurrent pituitary microadenomas was only 47% and 39% respectively [5] . Our PM-CAD system demonstrates the excellent diagnosis performance in detecting those hormone-producing PM. For this study, the MR images were collected from different MRI machines. It is important that our PM-CAD system achieves a robust generalization performance on images collected from different MRI machines. Images from 3 additional hospitals (1 internal and 2 external datasets) have been used to test the generalization diagnostic performance of PM-CAD system. The diagnostic accuracy achieved by the PM-CAD system in 3 hospitals were 94%, 90%, and 88%, respectively (Fig 4a) . The internal test dataset and the training dataset have the same MR images sources, whereas the images in the external test dataset were generated with different MRI machines from different hospitals. Previous results suggest that CNN model performs better in the internal dataset than external dataset [29, 30] . Similar to those previous observations, we showed that PM-CAD system performs better on our internal test dataset than external dataset. Nevertheless, the diagnostic accuracy of PM-CAD system on external datasets was 90% and 88%, comparable to the diagnostic accuracy of radiologists with 5-10 years professional experience. We showed that our PM-CAD system could achieve robust generalization performance. The techniques presented in this study may potentially be applied to other hospitals in assisting for the diagnosis of PM. MRI techniques for detection and depiction of pituitary adenomas have witnessed rapid evolution ranging from the non-contrast MRI scans to thin section contrast MRI scans [7, 31] . Modern MRI device has led to a rise in detection of PM as it allows evaluation of sella and perisellar lesions with high soft tissue contrast and excellent anatomical resolution [32, 33] . However, it is worth noting that accurate diagnosis of PM is still largely relied on the radiologist's professional experience. As shown in Figure 4 , the radiologist diagnostic accuracy for PM is proportionally enhanced with professional experience as well as allocated reading time for image. In this aspect, PM-CAD system has a distinct advantage in reading time and stability. This work provides the first deep learning system for the diagnosis of PM. Our PM-CAD system provides a diagnostic accuracy comparable to experienced radiologists with marginal deployment costs. We acknowledge that there are several limitations for the current investigation. First, this is a retrospective study on multi-medical centers. Further validation with larger and prospective study is needed for clinical applications. Second, although a total of 1,228 participants have been used for training and testing of our deep learning system, the dataset is still insufficient. The robustness and accuracy of deep learning models can further be increased with more training data. In conclusion, results from this investigation have highlighted the potential applications of deep learning on the diagnosis of patients with PM. With the rapid development of computing power, deep learning algorithms can surpass gold diagnosis standard. Machine learning for the diagnosis of PM will serve as an important component in improving patient care and outcomes. These data were randomly divided into training, validation or test sets for further analysis. In the training and validation datasets, all images present PM were selected by 4 general radiologists (> 5 years of professional experience) and reviewed by 2 neuroradiologistes. All images of coronal dynamic enhancement T1WI sequence were used for test A, B and C datasets without additional human intervention. MRI was performed with a 1.5 or 3.0 T MRI unit (GE, Philips company) in the head-first supine position, 380 ms/12.5 ms (repetition time /echo time) and 1 or 3 mm-thick section. We excluded the MR images without complete pituitary scan or with too many MRI artifacts. 6 endocrinology fellows were involved in collecting patient clinical information, and the dataset were reviewed by 2 endocrinologists. Among PM patients, there were 391 cases of non-functional pituitary microadenomas and 152 cases of functional pituitary microadenomas ( Fig S3) . The MR image and relevant clinical information was retrieved from the digital database of respective hospitals. All participants were split into 5 parts (the training dataset, the validation dataset, the test A, B and C dataset.). Workflow diagram for the overall experimental design was illustrated in Fig 5. The training dataset was used to train the deep learning models. It consists of 780 participants, including 274 PM patients and 506 normal controls, from the Third Affiliated Hospital of Sun Yat-Sen University. The validation dataset was used to tune the hyper-parameters (e.g., learning rate, number of training epochs) of the models and select the best model. It consists of 195 participants from the same hospital as the training dataset, including 66 PM patients and 129 normal controls. The test dataset was divided into three parts, namely Test A, B and C for the diagnosis performance comparison between radiologists and PM-CAD system. Test A consists of 100 participants (50 PM patients and 50 control subjests) from hospital 1 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 5, 2021. ; and has been used to comparing the PM diagnosis performance of PM-CAD system and general radiologists. There is no overlap among images in the training and validation datasets. Six radiologists were recruited for this study. Radiologists 1 and 2 has professional experience < 5 years, Radiologists 3 and 4 has professional experience for 5 to 10 years, Radiologist 5 and 6 has professional experience > 10 years. Each radiologist read 100 participants MR images independently in 50 minutes (about 30 second for each MRI). In test B, we tested the diagnosis performances of our PM-CAD system on 3 misdiagnosed cases of PM. Test C has been used to evaluate the generalization ability and stability of the PM-CAD system on MRI scan from three hospitals (hospital 1 was used to form the internal dataset, hospital 2 and 3 were used to form the external dataset) and each hospital provided 50 cases of PM images. Six general radiologists were also recruited to read MR images, and the diagnostic accuracy of those radiologists were evaluated. The pipeline of our PM-CAD system is shown in Figure 6 . Specifically, each section of a MR image was first diagnosed by our CAD system. The MRI scan was then determined as normal if all sections of this MR images were diagnosed as non-microadenoma. In the pituitary detection model, the Faster R-CNN framework [34] was employed to locate pituitary region from MR images. The input MR image was processed by this model to generate classification and regression maps, which were further used to extract the pituitary bounding box in MR image. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Table1. Compare performance of PM-CAD system versus 6 established CNN models. We evaluate the performance of different models by using a series of metrics including the area under receiver operating characteristic curve (AUC) score (which is independent of probability threshold), F1-score, accuracy, sensitivity, positive predictive value (PPV), negative specificity and negative predictive value (NPV). Further information for each specific metric has been listed in Supplementary information for Materials and Methods, B.3. Experiment -Evaluation metric. Although pituitary occupies a small region of MR images, radiologists can diagnose microadenoma based on the appearance of such a small region. On the other hand, directly feeding the whole MR images into a deep learning based diagnosis method imposes a huge challenge since non-pituitary regions will hinder microadenoma diagnosis significantly. Therefore, we employ a detection model to locate the pituitary region before diagnosis. The design of the detection model potentially benefits the consequence task in two aspects: It enhances pituitary microadenoma (PM) feature by discarding irrelevant regions, and promotes the microadenoma diagnosis performance since the designed detection model can help the microadenoma diagnosis process focus on the pituitary region. It reduces the overfitting problem of our microadenoma diagnosis model with a limited amount of data. A.1. Architecture Given several consecutive MR images at the same anatomical section of the brain from the coronal dynamic enhancement T1 -weighted imaging (T1WI) sequence of MRI scan, our pituitary detection model aims at locating the pituitary in each image. The pituitary detection model is build upon the Faster-RCNN [1] framework, and mainly consists of three parts: ResNet-50 FPN [2] extracts the multi-scale features from each image, Region Proposals & Detection produces the bounding box of pituitary from the multi-scale features, and Post-processing refines the detection results. (See Fig 1. ) We use ResNet-50 FPN as our backbone to efficiently extract multi-scale features from each MR image. Specifically, ResNet-50 [3] is employed to process the input MR image, and four feature maps from conv2_3, conv3_4, conv4_6 and conv5_3 are further used as the input for the feature pyramid network (FPN). These four feature maps have 4×, 8×, 16×, 32× strided resolution compared to the input image. FPN aggregates the above four feature maps, and produces four feature maps in different scales (i.e., feature_x where ). The four feature maps are firstly processed by a 1×1 convolutional layer, to make them have the same number of channels. We denote the processed four feature maps as P_1, P_2, P_3 and P_4. The feature_4 is obtained by upsampling P4 with nearest interpolation. Feature_x (where ) is obtained by upsampling the combination of the feature_(x-1) and the lateral feature map P_x. Therefore, four multi-scale features maps are used as the input for the following part. In this part, we process each feature map from ResNet-50 FPN independently, and several candidate bounding boxes indicating pituitary are produced. Specifically, given a feature map feature_x, two branch layers (each with one convolutional layer) are used to generate two feature maps. One branch is used to classify whether the anchor contains an object, and the another branch is used to regress the size (i.e., height, width) and position (i.e., x, y) from the default anchor. We obtain the bounding box proposals from these two feature maps using the non-maximum suppression (NMS) [1] algorithm to discard high-overlapping bounding boxes. Then, RoI pooling [1] is performed to aggregate the feature of feature_x within the bounding box proposals. The aggregated feature is further used to refine the classification and regression results, and produce several candidate bounding boxes of pituitary with confidence scores. Each MR image has at most one pituitary. Therefore, from all candidate bounding boxes, we select the bounding box which has the highest confidence score. In order to reduce the false-positive rate of pituitary detection, we filter out the bounding box whose confidence score is smaller than a threshold ( in our experiments). Among all MR images, the above procedure might fail to locate the pituitary in some images. To handle this problem, we refine the pituitary detection result by post-processing. Specifically, we average all the positions and sizes of the detected All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 5, 2021. ; https://doi.org/10.1101/2021.03.02.21252010 doi: medRxiv preprint bounding boxes, which can be considered as a mean bounding box. Because the position and size of pituitary over different MR images do not change a lot, we assign the missed detected MR image with the mean bounding box. We collect 666 MRI scan (performed by GE, Philips machines) from the Third Affiliated Hospital of Sun Yat-Sen University. 2718 MR images with pituitaries from the coronal dynamic enhancement T1WI sequence of MRI are selected. A neuroradiologist annotates each image with a bounding box indicating the pituitary region. We randomly split all MRI scan into a pituitary detection training set with 532 scans (2167 images) and a pituitary detection validation set with 134 scans (551 images). As the evaluation metrics, we use AP@0.50 and AP@0.75 from the COCO Detection Benchmark [4] . AP@0.50 and AP@0.75 stand for the Average Precision (AP) with intersection of union (IOU) thresholds of 0.5 and 0.75, respectively. The pituitary detection model is implemented with PyTorch, and trained on a computer with an NVIDIA TITAN RTX GPU. We use the SGD optimizer with a momentum of 0.9 for training. The initial learning rate is set to 0.005. The warm-up strategy is adapted, which linearly increases the learning rate from 0.001 to the initial value over each iteration at the first epoch. Then, the learning rate decreases by 0.1 per 3 epochs. The weight decay is set to 0.0005. We resize the input images into 256×256 and normalize into (0,1) based on the window level (WL) and window width (WW) set by neuroradiologist as preprocessing. Other configurations (e.g., loss function, default anchor setting) are the same as [1] . We train the pituitary detection model for 20 epochs with a batch size of 16. Table 1 shows the pituitary detection performance on MRI achieved by our method. On the training set, our method achieves 0.9884 and 0.9078 in terms of AP@0.50 and AP@0.75, respectively. The performance of pituitary detection on the validation set is close to that of the training set. That is, our model achieves a good generalization capability on unseen data. Moreover, our detection model can accurately locate the pituitary region. When 0.5 is used as the IOU threshold to determine the success of pituitary detection, our method achieves an average precision of 97.83% over different levels of recall. Figure 2 shows the prediction results of our pituitary detection model and the ground-truth labels of several MR images on the validation set. The results demonstrate our method can produce accurate prediction bounding boxes with high overlaps with the ground-truth. Figure 2 . Some examples of pituitary detection on the validation set. Our method can produce accurate prediction bounding boxes (represented as green boxes) with high overlaps with the ground-truth (represented as purple boxes). Besides, the confidence scores produced by our pituitary detection model are also shown (i.e., the numbers near green boxes). All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Automatic diagnosis of PM with deep learning is highly challenging due to three main reasons: (1) Low inter-class variance. Microadenoma occupies a very small part of the pituitary, the difference between microadenoma and normal tissues mainly lies in fine-grained textures. Actually, normal pituitaries and pituitaries with microadenoma are very similar in most parts, it is therefore very hard to identify microadenoma even by radiologist. Consequently, the proposed model should be powerful enough to distinguish these minor differences. (2) High intra-class variance. PM vary significantly in terms of location, size, and shape. Consequently, the proposed model should be sufficiently representative to adapt to these variations. (3) Limited training data. Deep learning techniques usually require a large amount of training data to avoid overfitting. However, the insufficiency of PM MRI scan poses new challenges to the design of architecture and training skills of neural networks. To handle these challenges, we propose a novel CNN (namely, PM-CAD) for PM diagnosis. In PM-CAD, we modify the ResNet architecture to preserve fine-grained features during forward propagation. Furthermore, an attention module is used to further improve the discriminativeness of feature representation. To handle the overfitting problem, histogram matching normalization, intensity shift data augmentation and label-smoothing loss are used. Given several consecutive MR images at the same anatomical position of the brain from the coronal dynamic enhancement T1WI sequence of MRI scan, our microadenoma diagnosis model aims at classifying them into normal or microadenoma. In particular, all MR images are processed using intensity shift as data augmentation and histogram matching normalization, then patches which only contain the pituitary are cropped from each MR image based on the detection results from the pituitary detection model. Then, all patches are resized and stacked as an image with multi-channels. Finally, our proposed CNN-based model processes these stacked pituitary patches and classifies them into normal or microadenoma. In the following, we will first describe our CNN-based model with the improved backbone and an attention module, then highlight the intensity shift data augmentation and histogram matching normalization in pre-processing. Finally, we introduce the loss function used to train our CNN-based model. Most modern CNN models have two drawbacks when employing microadenoma diagnosis. (1) Aggressive downsampling. They use several max-pooling layers or strided convolutions to downsample the feature maps. For example, the ResNet series use a large kernel convolution with a stride of 2 followed by a max-pooling layer at the beginning, and 4 strided convolutional layers in the following. This strategy can efficiently reduce the computational cost and save the memory. However, fine-grained features which are important in PM classification might be lost in aggressive downsampling. (2) Large amount of trainable parameters. Modern CNN models tend to contain a large amount of trainable parameters. Therefore, a large training dataset is required to achieve a sufficient performance. However, the limited data of microadenoma MRI scan impose the over-fitting problem. A lightweight model with fewer parameters might be more suitable for microadenoma diagnosis. Therefore, we extend ResNet-18 from two aspects. First, we replace the large kernel convolution with 3×3 convolution at the beginning of the network. Second, we remove the max-pooling layer at the beginning and the last stages of convolution to preserve fine-grained features. Therefore, conv4_2 feature map is obtained with a 8× strided resolution. Then a global average pooling (GAP) and a multi-layer perceptron (MLP) with softmax are used to produce the probability of PM. Experimental results show that this simple modification of ResNet-18 can efficiently improve the performance of PM diagnosis. The vector at each position of conv4_2 feature map is considered as the feature of a small patch from the input. Since the number of normal patches is larger than that of microadenoma, the global average pooling (GAP) which directly performs on this feature map might make the microadenoma feature overwhelming from the normal feature. Therefore, we introduce an attention module to augment the microadenoma feature before GAP. Given a feature map of conv4_2, we employ a learnable convolution layer to produce a soft mask: where and represent the 1×1 convolutional layers with 256 and 1 channels, respectively. This soft mask is further used to automatically select the prominent spatial areas of feature map by pixel-wise multiplication as follows: During training, this mask can learn to highlight the areas which have the discriminative features of microadenoma and suppress the irrelevant areas. Therefore, the augmented feature map contains the areas which is helpful for distinguishing the microadenoma and normal tissue. The augmented feature map is further performed by GAP to aggregate spatial features. Then, the fully connected layers (FC) with softmax is used to produce the presence probability of microadenoma. Given several consecutive MR images with pituitary detection results, we first normalize them into (0,1) based on the window level (WL) and window width (WW). Then, data augmentation techniques such as Gaussian noise, translation, scaling, and the proposed intensity shift are performed during training. Furthermore, the proposed histogram matching normalization is used to align the histograms of all MR images. Finally, the patches only contain the pituitary are cropped from the MR images based on the detection result, and stacked as an image with mutli-channels. In the following, we will detail in our intensity shift data augmentation and histogram matching normalization. To facilitate the domain knowledge from neuroradiologists, we normalize each MR image using WL and WW. This domain knowledge can help the network to focus on discriminative regions by assigning irrelevant pixels as background, and make the model converge faster. However, the normalization process decreases the generalization capacity on different hospitals. Different machines usually have different default values for WL and WW. Since WL and WW are adjusted by neuroradiologists based on their personal experience, the MR images normalized by different neuroradiologists from different machines in different hospitals have different distributions. Therefore, the model trained with the data from one hospital is difficult to generalize to other hospitals. To handle this limitation, we propose an intensity shift data augmentation approach to make the model insensitive to different WL and WW settings. In particular, we randomly shift the intensity of non-background pixels in MR images by adding a value within (0,0.1). Histogram matching normalization (HM). The MR images are acquired from the same anatomical section of brain. Therefore, the histograms between MR images are similar. In order to make the features between microadenoma and normal tissue more discriminative, we align the histograms using the histogram matching algorithm. Specifically, we select the histogram from one of the MR images as reference. Then, the histogram matching is performed to other MR image histograms to match the reference. In our experiments, we select the histogram from 3rd MR images among five consecutive MR images as the reference. The cross-entropy loss is commonly used in classification. Specifically, the cross-entropy loss in binary classification can be formulated as: where and are the ground-truth label and the probability produced by the model. Usually, the probability is obtained by normalizing (e.g., softmax) the logits from the last layer of the model. The cross-entropy loss forces the logit to become an infinite value. That is, it pushes the distance of the learned feature between microadenoma and normal to be large, which potentially leads to over-fitting on the training data. Motivated by [6] , we use the label-smoothing loss as follows: where is a small number and we set in our experiments. The label-smoothing loss can achieve the optimal performance when the logit is a finite value, this encourages the learned feature between microadenoma and normal more compact. In this subsection, we will describe the approach for adapting our method to MRI scan. Usually, a coronal dynamic enhancement T1WI sequence of MRI scan (DICOM) is used to diagnose PM. These images in a sequence can be divided into groups, and different groups focus on different anatomical sections of the pituitary. Each group contains several MR images, which are acquired at different enhancement times. 2 neuroradiologistes and 1 neurosurgeon examines all MR images of each pituitary section, and diagnoses whether microadenomas present in this f )))) ( (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 5, 2021. section. The MRI scan is considered as normal if the microadenoma cannot be found in any pituitary sections. Following the same idea, we adapt our PM-CAD to the MR images as to achieve end-to-end diagnosing without any human intervention. Specifically, the proposed method first inspects each section of the pituitary gland based on the learned feature, and then performs 'OR' operation based on the classification result of all sections. Similarly, the MR image is considered normal if and only if all sections of the pituitary are diagnosed as normal by our PM-CAD. Among MR images of each section, we select several consecutive images in the medium term of enhancement scan as the input for PM-CAD. Dataset 1228 participants with MRI scan were selected from retrospective cohorts of three hospitals from January 2014 to December 2019. The coronal dynamic enhancement T1WI sequence of MRI scan were used to diagnose PM in our experiments. All participants were split into 5 parts, that is: The Training Dataset: It is used to train the deep learning models. The Validation Dataset: It is used to tune the hyper-parameters (e.g., learning rate, number of training epochs) of the models and select the best model during training. Test A: It is used to compare the performances between radiologists and our PM-CAD system in PM diagnosis. Test B: It is used to test the diagnosis performances of our PM-CAD system on radilologists misdiagnosed cases. Test C: It is used to evaluate the generalization ability of our proposed PM-CAD system on MRI scan from different hospitals. To compare the performance of PM diagnosis between different models, we use the area under receiver operating characteristic curve (AUC) score (which is independent of probability threshold). In addition, F1-score, accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), error, positive likelihood ratio (PLR) and negative likelihood ratio (NLR) are used as metrics for performane evaluation. The 95% confidence interval (CI) is also calculated for each metric. In particular, CI for AUC is calculated using bootstrap confidence intervals [10] (resample 50,000 times with replacement). CIs for F1-score, accuracy, sensitivity, specificity, PPV and NPV are calculated using the Clopper-Pearson interval [11] because it is common for calculating binomial confidence intervals. CIs for PLR and NLR are calculated using the "Log method" [7] . We implement the microadenomas diagnosis model with PyTorch, and train our PM-CAD on NVIDIA TITAN RTX GPUs. We use the SGD optimizer with a momentum of 0.9. The initial learning rate is set to 0.02 and decreases by 0.99 per epoch. The weight decay is set to 0.0005. We train our PM-CAD for 500 epochs with a batch size of 16. We set the MLP in backbone to have two layers with 256 and 1 channels, respectively. The in our label-smoothing loss is set to 0.1. We fine-tune our PM-CAD model from the weights pre-trained on ImageNet [8] . The probability threshold of the model is selected based on the Youden Index [5] of the Receiver Operating Curve (ROC). Specifically, given the probabilities produced by the model and their corresponding ground-truth labels, we can first calculate the True Positive Rates (TPR) and the False Positive Rates (FPR) under different levels of threshold t. Then, the Youden Index at threshold t is defined as: Finally, the threshold producing the largest Youden Index is considered as the optimal probability threshold : We compare our PM-CAD with several 2D and 3D baselines in PM diagnosis. In particular, five 2D CNNs-based models (i.e., GoogLeNet, VGG-16, ResNet-50 , DenseNet-169 and ResNeXt-50 32×4d) are selected with the following considerations: these models achieve state-of-the-art performance on the largest image classification benchmark (i.e., ImageNet [8] ), and their superiority has been demonstrated in many other medical image analysis tasks. All these models are implemented in PyTorch. Since 5 MR images are stacked to form an image with multi-channels, the numbers of input channels and output categories are set to 5 and 2, respectively. To leverage the advantage of transfer learning, we also finetune these models from the weights pre-trained on ImageNet. We train these models with the same configuration as PM-CAD using the cross-entropy loss. On the other hand, five consecutive MR images can be considered as a 3D volume. We employ a 3D CNN to diagnose PM from this volume. We choose 3D CNN from the Model Genesis [9] , which is the first work serving as a primary source of x YoudenInde t max arĝ = All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 5, 2021. ; https://doi.org/10.1101/2021.03.02.21252010 doi: medRxiv preprint transfer learning for 3D medical imaging applications. We fine-tune the 3D CNN model from the weights pre-trained on a large number of unlabeled medical images. We train it with the same configuration as PM-CAD using the cross-entropy loss. We train the model using the training dataset, and select the best one which has the highest AUC score on the validation dataset during training. In manuscript, Table 1 , Fig S1 and Fig S2 shows the results of our PM-CAD and existing models on the validation dataset. We provide the ablation study on the intensity shift data augmentation and the histogram matching normalization. We train our PM-CAD with or without intensity shift data augmentation and histogram matching normalization using training dataset, and evaluate it on Test C. Table 2 shows the accuracy has improved by 4% and 2% in hospital 3 and hospital 2, when equipped with HM and IS. And HM and IS do not improve the performance on the data from hospital 1. That is because, the training dataset and test data from hospital 1 have the same source which is the Third Affiliated Hospital of Sun Yat-Sen University, resulting in consistent normalization set by WL and WW. The result shows that the proposed HM and IS can eliminate the effect of inconsistent normalization from different hospitals. With HM and IS, our PM-CAD demonstrate a good generalization capacity, achieving above 88% accuracy of different hospitals. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Diagnosis and Treatment of Pituitary Adenomas: A Review Prolactinomas in pregnancy: considerations before conception and during pregnancy Do nothing but observe microprolactinomas: when and how to replace sex hormones? Pituitary Results of a single-center observational 10-year survey study on recurrence of hyperprolactinemia after pregnancy and lactation PET/MRI in the Diagnosis of Hormone-Producing Pituitary Microadenoma: A Prospective Pilot Study Advances in the Imaging of Pituitary Tumors Modern Imaging of Pituitary Adenomas Radiologist shortage leaves patient care at risk, warns royal college BMA urges more career flexibility and better occupational support to fight workforce crisis ImageNet Classification with Deep Convolutional Neural Networks. International Conference on Neural Information Processing Systems Medical Image Analysis using Convolutional Neural Networks: A Review Lymph Node Metastasis Prediction from Primary Breast Cancer US Images Using Deep Learning Deep learning radiomics can predict axillary lymph node status in early-stage breast cancer Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning Predicting optical coherence tomography-derived diabetic macular edema grades from fundus photographs using deep learning Deep learning for classifying fibrotic lung disease on high-resolution computed tomography: a case-cohort study Artificial intelligence-enabled rapid diagnosis of patients with COVID-19 A novel diagnostic method for pituitary adenoma based on magnetic resonance imaging using a convolutional neural network Preoperative prediction of cavernous sinus invasion by pituitary adenomas using a radiomics method based on magnetic resonance images Prediction of high proliferative index in pituitary macroadenomas using MRI-based radiomics and machine learning Going deeper with convolutions Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Feature Pyramid Networks for Object Detection Deep Residual Learning for Image Recognition Microsoft COCO: Common Objects in Context. Computer Vision-ECCV Index for rating diagnostic tests Rethinking the Inception Architecture for Computer Vision ImageNet Large Scale Visual Recognition Challenge Models Genesis: Generic Autodidactic Models for 3D Medical Image Analysis. Medical Image Computing and Computer Assisted Intervention Generalised Clopper-Pearson confidence intervals for the binomial proportion Radiologist 1 & 2: < 5 years professional experience All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Supplementary Fig 2. Calibration curves of different PM diagnosis models. Calibration curves for predicted versus observed risk in the overall validation cohort, PM-CAD shows optimal diagnostic performance. Supplementary Fig 3. Data distribution diagram. The distribution of functional and nonfunctional pituitary microadenomas in the training, validation and test dataset.