key: cord-0741751-x5ec6cde authors: Chen, Xuxin; Wang, Ximin; Zhang, Ke; Fung, Kar-Ming; Thai, Theresa C.; Moore, Kathleen; Mannel, Robert S.; Liu, Hong; Zheng, Bin; Qiu, Yuchen title: Recent advances and clinical applications of deep learning in medical image analysis date: 2021-05-27 journal: Medical image analysis DOI: 10.1016/j.media.2022.102444 sha: 968b7f11bf90fbbf18167bca94d86beeac671548 doc_id: 741751 cord_uid: x5ec6cde Deep learning has received extensive research interest in developing new medical image processing algorithms, and deep learning based models have been remarkably successful in a variety of medical imaging tasks to support disease detection and diagnosis. Despite the success, the further improvement of deep learning models in medical image analysis is majorly bottlenecked by the lack of large-sized and well-annotated datasets. In the past five years, many studies have focused on addressing this challenge. In this paper, we reviewed and summarized these recent studies to provide a comprehensive overview of applying deep learning methods in various medical image analysis tasks. Especially, we emphasize the latest progress and contributions of state-of-the-art unsupervised and semi-supervised deep learning in medical image analysis, which are summarized based on different application scenarios, including classification, segmentation, detection, and image registration. We also discuss the major technical challenges and suggest the possible solutions in future research efforts. In current clinical practice, accuracy of detection and diagnosis of cancers and/or many other diseases depends on the expertise of individual clinicians (e.g., radiologists, pathologists) (Kruger et al., 1972) , which results in large inter-reader variability in reading and interpreting medical images. In order to address and overcome this clinical challenge, many computer-aided detection and diagnosis (CAD) schemes have been developed and tested, aiming to help clinicians more efficiently read medical images and make the diagnostic decision in a more accurate and objective manner. The scientific rationale of this approach is that using computer-aided quantitative image feature analysis can help overcome many negative factors in the clinical practice, including the wide variations in expertise of the clinicians, potential fatigue of human experts, and lack of sufficient medical resources. Although early CAD schemes have been developed in 1970s (Meyers et al., 1964; Kruger et al., 1972; Sezaki and Ukena, 1973) , progress of the CAD schemes accelerates since the middle of 1990s (Doi et al., 1999) , due to the development and integration of more advanced machine learning methods or models into CAD schemes. For conventional CAD schemes, a common developing procedure consists of three steps: target Depending on whether labels of the training dataset are present, deep learning can be roughly divided into supervised, unsupervised, and semi-supervised learning. In supervised learning, all training images are labeled, and the model is optimized using the image-label pairs. For each testing image, the optimized model will generate a likelihood score to predict its class label (LeCun et al., 2015) . For unsupervised learning, the model will analyze and learn the underlying patterns or hidden data structures without labels. If only a small portion of training data is labeled, the model learns input-output relationship from the labeled data, and the model will be strengthened by learning semantic and fine-grained features from the unlabeled data. This type of learning approach is defined as semi-supervised learning (van Engelen and Hoos, 2020). In this section, we briefly mentioned supervised learning at the beginning, and then majorly reviewed the recent advances of unsupervised learning and semi-supervised learning, which can facilitate performing medical image tasks with limited annotated data. Popular frameworks for these two types of learning paradigms will be introduced accordingly. In the end, we summarize three general strategies that can be combined with different learning paradigms for better performance in medical image analysis, including attention mechanisms, domain knowledge, and uncertainty estimation. Convolutional neural networks (CNNs) are a widely used deep learning architecture in medical image analysis (Anwar et al., 2018) . CNNs are mainly composed of convolutional layers and pooling layers. Figure 2 shows a simple CNN in the context of medical image classification task. The CNN directly takes an image as input, and transforms it via convolutional layers, pooling layers, and fully connected layers, and finally outputs a class-based likelihood of that image. At each convolutional layer , a bunch of kernels = { 1 , … , } are used to extract features from the input image, and biases = { 1 , … , } are added, generating new feature maps + . Then a nonlinear transform, an activation function (. ), is applied resulting in +1 = ( + ) as the input of the next layer. After the convolutional layer, a pooling layer is incorporated to reduce the dimension of feature maps, thus reducing the number of parameters. Average pooling and maximum pooling are two common pooling operations. The above process is repeated for the rest layers. At the end of the network, fully connected layers are usually employed to produce the probability distribution over classes via a sigmoid or softmax function. The predicted probability distribution gives a label � for each input instance so that a loss function ( �, ) can be calculated, where is the real label. Parameters of the network are iteratively optimized by minimizing the loss function. In the past few years, unsupervised representation learning has gained huge success in natural language processing (NLP), where massive unlabeled data is available for pre-training models (e.g. BERT, Kenton and Toutanova, 2019) and learning useful feature representations. Then the feature representations are fine-tuned in downstream tasks such as question answering, natural language inference, and text summarization. In computer vision, researchers have explored a similar pipeline -models are first trained to learn rich and meaningful feature representations from the raw unlabeled image data in an unsupervised manner, and then the feature representations are fine-tuned in a wide variety of downstream tasks with labeled data, such as classification, object detection, instance segmentation, etc. However, this practice was not as successful as in NLP for quite a long time, and instead supervised pre-training has been the dominant strategy. Interestingly, we find this situation is changing toward the opposite direction in recent two years, as more and more studies report a higher performance of self-supervised pre-training than supervised pre-training. In recent literature, the term self-supervised learning is used interchangeably with unsupervised learning; more accurately, self-supervised learning actually refers to a form of deep unsupervised learning, where inputs and labels are created from unlabeled data itself without external supervision. One important motivation behind this technology is to avoid supervised tasks that are often expensive and time-consuming, due to the need to establish new labeled datasets or acquire high-quality annotations in certain fields like medicine. Despite the scarcity and high cost of labeled data, there usually exist large amounts of cheap unlabeled data remaining unexploited in many fields. The unlabeled data is likely to contain valuable information that is either weak or not present in labeled data. Self-supervised learning can leverage the power of unlabeled data to improve both the performance and efficiency in supervised tasks. Since self-supervised learning touches upon vaster data than supervised learning, features learnt in a self-supervised manner can potentially better generalize in the real world. Self supervision can be created in two ways: pretext tasks based methods and contrastive learning based methods. Since the contrastive learning based methods have received broader attention in very recent years, we will highlight more works in this direction. Pretext task is designed to learn representative features for downstream tasks, but the pretext itself is not of the true interest (He et al., 2020) . The pretext tasks learn representations by hiding certain information (e.g., channel, patches, etc.) for each input image, and then predict the missing information from the image's remaining parts. Examples include image inpainting (Pathak et al., 2016) , colorization , relative patch prediction (Doersch et al., 2015) , jigsaw puzzles (Noroozi and Favaro, 2016) , rotation (Gidaris et al., 2018) , etc. However, the learnt representations' generalizability is heavily dependent on the quality of hand-crafted pretext tasks . Contrastive learning relies on the so-called contrastive loss, which can date back to at least (Hadsell et al., 2006; Chopra et al., 2005a) . Later a number of variants of this contrastive loss were used (Oord et al., 2018; Chen et al., 2020a; Chaitanya et al., 2020) . In essence, the original loss and its later versions all enforce a similarity metric to be maximized for positive (similar) pairs and be minimized for negative (dissimilar) pairs, so that the model can learn discriminative features. In the following we will introduce two representative frameworks for contrastive learning, namely Momentum Contrast (MoCo) (He et al., 2020) and SimCLR . MoCo formulates contrastive learning as a dictionary look-up problem, which requires an encoded query to be similar to its matching key. As shown in Figure 3 (a), given an image , an encoder encodes the image resulting in a feature vector, which is used as a query ( ). Likewise, with another encoder the dictionary can be built up by the features { 0 , 1 , 2 , … } , also known as keys, from a large set of image samples { 0 , 1 , 2 , … }. In MoCo, the encoded query and a key are considered similar if they come from different crops of the same image. Suppose there exists a single dictionary key ( + ) that matches with , then these two items are regarded as a positive pair, whereas the rest keys in the dictionary are considered negative. The authors compute the loss function of a positive pair using InfoNCE (Oord et al., 2018) as follows: Established from a sampled subset of all images, a large dictionary is important for good accuracy. To make the dictionary large, the authors maintain the feature representations from previous image batches as a queue: new keys are enqueued with old keys dequeued. Therefore, the dictionary consists of encoded representations from the current and previous batches. This, however, could lead to a rapidly updated key encoder, rendering the dictionary keys inconsistent, i.e., their comparisons to the encoded query are not consistent. The authors thus propose using momentum update on the key encoder to avoid rapid changes. This key encoder is referred as the momentum encoder. SimCLR is another popular framework for contrastive learning. In this framework, two augmentated images are considered a postitive pair if they derive from the same example; if not, they are a negative pair. The agreement of feature representations from of postive image pairs are maximized. As shown in Figure 3 (b), SimCLR consists of four components: (1) stochastic image augmentation; (2) encoder networks (f (.)) extracting feature representations from augmented images; (3) a small neural network (multilayer perceptron (MLP) projection head) (g (.)) that maps the feature representations to a lower-dimensional space; and (4) contrastive loss computation. The third component differs SimCLR from its predecessors. Previous frameworks like MoCo compute the feature representations directly rather than first mapping them to a lower-dimensional space. This component is further proven important in achieving satisfactory results, as demonstrated in MoCo v2 . Note that since self-supervised contrastive learning is very new, wide applications of recent advances such as MoCo and SimCLR in the medical image analysis field have yet been established at the time of this writing. Nonetheless, considering the promising results of self-supervised learning reported in the existing literature, we anticipate studies applying this new technology to analyze medical images are likely to explode soon. Also, self-supervised pre-training has great potential to be a strong alternative of supervised pre-training. Different from unsupervised learning that can work just on unlabeled data to learn meaningful representations, semi-supervised learning (SSL) combines labeled and unlabeled data during model training. Especially, SSL applies to the scenario where limited labeled data and large-scale but unlabeled data are available. These two types of data should be relevant, so that the additional information carried by unlabeled data could be useful in compensating the labeled data. It is reasonable to expect that unlabeled data would lead to an average performance boost -probably the more the better for performing tasks with only limited labeled data. In fact, this goal has been explored for several decades, and the 1990s already witnessed a rising interest of applying SSL methods in text classification. The Semi-Supervised Learning book (Chapelle et al., 2009 ) is a good source for readers to grasp the connection of SSL to classic machine learning algorithms. Interestingly, despite its potential positive value, the authors present empirical findings that unlabeled data sometimes deteriorates the performance. However, this empirical finding seems to be experiencing changes in recent literature of deep learning -an increasing number of works, mostly from the computer vision field, have reported that deep semi-supervised approaches generally perform better than high-quality supervised baselines (Ouali et al., 2020) . Even when varying the amount of labeled and unlabeled data, a consistent performance improvement can still be observed. At the same time, deep semi-supervised learning has been successfully applied in the medical image analysis field to reduce annotation cost and achieve better performance. We divide popular SSL methods into three groups: (1) consistency regularization based approaches; (2) pseudo labeling based approaches; (3) generative models based approaches. Methods in the first category share one same idea that the prediction for an unlabeled example should not change significantly if some perturbations (e.g., adding noise, data augmentation) are applied. The loss function of an SSL model generally consists of two parts. More concretely, given an unlabeled data example and its perturbed version � , the SSL model outputs logits ( ) and ( �). On the unlabeled data, the objective is to give consistent predictions by minimizing the mean squared error ( ( ), ( �) ), and this leads to the consistency (unsupervised) loss on unlabeled data. On the labeled data, a cross entropy supervised loss is computed. Example SSL models that are regularized by consistency constraints include Ladder Networks (Rasmus et al., 2015) , Π-Model (Laine and Aila, 2017), and Temporal Ensembling (Laine and Aila, 2017). A more recent example is the Mean Teacher paradigm (Tarvainen and Valpola, 2017), composed of a teacher model and a student model ( Figure 4 ). The student model is optimized by minimizing on unlabeled data and on labeled data; as an Exponential Moving Average (EMA) of the student model, the teacher model is used to guide the student model for consistency training. Most recently, several works such as unsupervised data augmentation (UDA) and MixMatch (Berthelot et al., 2019) have brought the performance of SSL to a new level. For pseudo labeling (Lee, 2013), an SSL model itself generates pseudo annotations for unlabeled examples; the pseudo-labeled examples are used jointly with labeled examples to train the SSL model. This process is iterated for several times, during which the quality of pseudo labels and the model's performance get enhanced. The naïve pseudo-labeling process can be combined with Mixup augmentation (Zhang et al., 2018a) to further improve SSL model's performance (Arazo et al., 2020) . Pseudo labeling also works well with multiview co-training (Qiao et al., 2018) . For each view of the labeled examples, co-training learns a separate classifier, and then the classifier is used to generate pseudo labels for the unlabeled data; co-training maximizes the agreement of assigning pseudo annotations among each view of unlabeled examples. For methods in the third category, semi-supervised generative models such as GANs and VAEs put more focus on solving target tasks (e.g., classification) than just generating high-fidelity samples. Here we illustrate the mechanism of semi-supervised GAN for brevity. One simple way to adapt GAN to semisupervised settings is by modifying the discriminator to perform additional tasks. For example, in the task of image classification, Salimans et al. (2016) and Odena (2016) changed the discriminator in DCGAN by forcing it to serve as a classifier. For an unlabeled image, the discriminator functions as in the vanilla GAN, providing a probability of the input image being real; for a labeled image, the discriminator predicts its class besides generating a realness probability. However, Li et al. (2017) demonstrated that the optimal performance of the two tasks may not be achieved at the same time by a single discriminator. Thus, they introduced an additional classifier that is independent from the generator and discriminator. This new architecture composed of three components is called Triple-GAN. . refers to the transformation operations, including rotation, flipping, and scaling. and ̃ are network outputs. Attention originates from primates' visual processing mechanism that selects a subset of pertinent sensory information, rather than using all available information for complex scene analysis (Itti et al., 1998) . Inspired by this idea of focusing on specific parts of inputs, deep learning researchers have integrated attention into developing advanced models in different fields. Attention-based models have achieved huge success in fields related to natural language processing (NLP), such as machine translation (Bahdanau et al., 2015; Vaswani et al., 2017) and image captioning (Xu et al., 2015; You et al., 2016; Anderson et al., 2018) . One prominent example is the Transformer architecture that solely relies on self-attention to capture global dependencies between input and output, without requiring sequential computation (Vaswani et al., 2017) . Attention mechanisms have also become popular in computer vision tasks, such as natural image classification Woo et al., 2018; Jetley et al., 2018) , segmentation Ren and Zemel, 2017) , etc. When processing images, attention modules can adaptively learn "what" and "where" to attend so that model predictions are conditioned on the most relevant image regions and features. Based on how attended locations in an image are selected, attention mechanisms can be roughly divided into two categories, namely soft and hard attention. The former deterministically learns a weighted average of features at all locations, whereas the latter stochastically samples one subset of feature locations to attend . Since hard attention is not differentiable, soft attention, despite being more computationally expensive, has received more research efforts. Following this differentiable mechanism, different types of attention have been further developed, such as (1) spatial attention (Jaderberg et al., 2015) , (2) Most well-established deep learning models, originally designed to analyze natural images, are likely to produce only suboptimal results when directly applied to medical image tasks (Zhang et al., 2020a) . This is because natural and medical images are very different in nature. First, medical images usually exhibit high inter-class similarity, so one major challenge lies in the extraction of fine-grained visual features to understand subtle differences that are important to making correct predictions. Second, typical medical image datasets are much smaller than benchmark natural datasets that contain images ranging from tens of thousands to millions. This hinders models with high complexity in computer vision from being directly applied in the medical domain. Therefore, how to customize models for medical image analysis remains an important issue. One possible solution is to integrate proper domain knowledge or task-specific properties, which has proven beneficial to facilitate learning useful feature representations and reducing model complexity in the medical imaging context. In this review paper, we will mention a variety of domain knowledge, such as anatomical information in MRI and CT images (Zhou et Reliability is of critical concern when it comes to clinical settings with high-safety requirements (e.g. cancer diagnosis). Model predictions are easily subject to factors such as data noise and inference errors, so it is desirable to quantify uncertainty and make the results trustworthy (Abdar et al., 2021) . Commonly used techniques for uncertainty estimation include Bayesian approximation (Gal and Ghahramani, 2016) and model ensemble (Lakshminarayanan et al., 2017) . Bayesian approaches like Monte Carlo dropout (MC-dropout) (Gal and Ghahramani, 2016) revolve around approximating the posterior distribution over neural networks' parameters. Ensemble techniques combine multiple models to measure uncertainty. Readers interested in uncertainty estimation are referred to the comprehensive review by Abdar et al. (2021) . Medical image classification is the goal of computer-aided diagnosis (CADx), which aims at either distinguishing malignant lesions from benign ones or identifying certain diseases from input images (Shen et al., 2017; van Ginneken et al., 2011) . Deep learning based CADx schemes have received huge success over the last decade. However, deep neural nets generally depend on sufficient annotated images to ensure good performance, and this requirement may not be easily satisfied by many medical image datasets. To alleviate the lack of large annotated data, many techniques have been used, and transfer learning has stood out indisputably as the most dominant paradigm. Beyond transfer learning, several other learning paradigms including unsupervised image synthesis, self-supervised and semi-supervised learning, have demonstrated great potential in performance enhancement given limited annotated data. We will introduce these learning paradigms' applications in medica image classification in the following subsections. Starting from AlexNet (Krizhevsky et al., 2012) , a variety of end-to-end models with increasingly deeper networks and larger representation compacity have been developed for image classification, such as VGG (Simonyan and Zisserman, 2014) , GoogleLeNet (Szegedy et al., 2015) , and ResNet , and DenseNet . These models have yielded superior results, making deep learning mainstream not only in developing high-performing CADx schemes but also in other subfields of medical image processing. Nonetheless, the performance of deep learning models highly depends on the size of training dataset and the quality of image annotations. In many medical image analysis tasks especially 3D scenarios, it can be challenging to establish a sufficiently large and high-quality training dataset because of difficulties in data acquisition and annotation (Tajbakhsh et Within the paradigm of supervised classification, different types of attention modules have been used for performance boost and better model interpretability (Zhou et al., 2019b) . Guan et al. (2018) introduced an attention-guided CNN, which is based on ResNet-50 . The attention heatmaps from the global X-ray image were used to suppress large irrelevant areas and highlight local regions that contain discriminative cues for the thorax disease. The proposed model effectively fused the global and local information and achieved a good classification performance. In another study, Schlemper et al. (2019) incorporated attention modules to a variant network of VGG (Baumgartner et al., 2017) and U-Net (Ronneberger et al., 2015) for 2D fetal ultrasound image plane classification and 3D CT pancreas segmentation, respectively. Each attention module was trained to focus on a subset of local structures in input images, and these local structures contain salient features useful to the target task. Classic data augmentation (e.g., rotation, scale, flip, translation, etc.) is simple but effective in creating more training instances to achieve better performance (Krizhevsky et al., 2012) . However, it cannot bring much new information to the existing training examples. Given the advantage of learning hidden data distribution and generating realistic images, GANs have been used as a more complicated approach for data augmentation in the medical domain. Frid-Adar et al. (2018b) exploited DCGAN for synthesizing high-quality examples to improve liver lesion classification on a limited dataset. The dataset only has 182 liver lesions including cysts, metastases, and hemangiomas. Since training GAN typically needs a large number of examples, the authors applied classic data augmentation (e.g., rotation, flip, translation, scale) to create nearly 90,000 examples. The GAN-based synthetic data augmentation significantly improved the classification performance, with the sensitivity and specificity increased from 78.6% and 88.4% to 85.7% and 92.4% respectively. In their later work (Frid-Adar et al., 2018a), the authors further extended lesion synthesis from the unconditional setting (DCGAN) to a conditional setting (ACGAN). The generator of ACGAN was conditioned on the side information (lesion classes), and the discriminator predicted lesion classes besides synthesizing new examples. However, it was found that ACGANbased synthetic augmentation delivered a weaker classification performance than its unconditional counterpart. To alleviate data scarcity and especially the lack of positive cancer cases, Wu et al. (2018a) adopted a conditional structure (cGAN) to generate realistic lesions for mammogram classification. Traditional data augmentation was also used to create enough examples for training GAN. The generator, conditioned with malignant/non-malignant labels, can control the process of generating a specific type of lesions. For each nonmalignant patch image, a malignant lesion was synthesized onto it using a segmentation mask of another malignant lesion; for each malignant image, its lesion was removed, and a non-malignant patch was synthesized. Although the GAN-based augmentation achieved better classification performance than traditional data augmentation, the improvement was relatively small, less than 1%. Recent self-supervised learning approaches have shown great potential in improving performance of medical tasks lacking sufficient annotations (Bai et . This method is suitable to the scenario where large amounts of medical images are available, but only a small percentage are labeled. Accordingly, the model optimization is divided into two steps, namely, self-supervised pre-training and supervised fine-tuning. The model is initially optimized using unlabeled images to effectively learn good features that are representative of the image semantics (Azizi et al., 2021) . The pre-trained models from self-supervision are followed by supervised fine-tuning to achieve faster and better performance in subsequent classification tasks (Chen et al., 2020c) . In practice, selfsupervision can be created either through pretext tasks (Misra and Maaten, 2020) or contrastive learning (Jing and Tian, 2020) as follows. Self-supervised pretext task based classification utilizes common pretext tasks such as rotation prediction (Tajbakhsh et (Larsson et al., 2017) , and WGAN-based patch reconstruction, to pre-train models for classification tasks. After pre-training, models were trained using labeled examples. It was shown that pretext tasks based pre-training in the medical domain was more effective than random initializations and transfer learning (ImageNet pre-training) for diabetic retinopathy classification. For self-supervised contrastive classification, Azizi et al. (2021) adopted the self-supervised learning framework SimCLR to train models (wider versions of ResNet-50 and ResNet-152) for dermatology condition classification and chest X-ray classification. They pre-trained the models by first using unlabeled natural images then with unlabeled dermatology images and chest X-rays. Feature representations were learned by maximizing agreement between positive image pairs that are either two augmented examples of the same image or multiple images from the same patient. The pre-trained models were fine-tuned using much fewer labeled dermatology images and chest X-rays. These models outperformed their counterparts pretrained using ImageNet by 1.1% in mean AUC for chest X-ray classification and 6.7% in top-1 accuracy for dermatology condition classification. MoCo (He et al., 2020; Chen et al., 2020b) is another popular selfsupervised learning framework to pre-train models for medical classification tasks, such as COVID-19 diagnosis from CT images and pleural effusion identification in chest X-rays (Sowrirajan et al., 2021) . Furthermore, it has been shown that self-supervised contrastive pre-training can greatly benefit from the incorporation of domain knowledge. For example, Vu et al. (2021) harnessed patient metadata (patient number, image laterality, and study number) to construct and select positive pairs from multiple chest X-ray images for MoCo pre-training. With only 1% of the labeled data for pleural effusion classification, the proposed approach improved mean AUC by 3.4% and 14.4% compared to previous contrastive learning method (Sowrirajan et al., 2021) and ImageNet pre-training respectively. Unlike self-supervised approaches that can learn useful feature representations just from unlabeled data, semi-supervised learning needs to integrate unlabeled data with labeled data through different ways to train models for a better performance. Madani et al. (2018a) employed GAN that was trained in a semi-supervised manner (Kingma et al., 2014) for cardiac disease classification in chest X-rays where labeled data was limited. Unlike the vanilla GAN (Goodfellow et al., 2014) , this semi-supervised GAN was trained using both unlabeled and labeled data. Its discriminator was modified to predict not only the realness of input images but also image classes (normal/abnormal) for real data. When increasing the number of labeled examples, the semi-supervised GAN based classifier consistently performed better than supervised CNN. Semi-supervised GAN was also shown useful in other data-limited classification tasks, such as CT lung nodule classification ( Employing a semi-supervised GAN architecture to address the scarcity of labeled data. Medical image segmentation, identifying the set of pixels or voxels of lesions, organs, and other substructures from background regions, is another challenging task in medical image analysis (Litjens et al., 2017) . Among all common image analysis tasks such as classification and detection, segmentation needs the strongest supervision (large amounts of high-quality annotations) (Tajbakhsh et al., 2020) . Since its introduction in 2015, U-Net (Ronneberger et al., 2015) has become probably the most well-known architecture for segmenting medical images; afterwards, different variants of U-Net have been proposed to further improve the segmentation performance. From the very recent literature, we observe that the combination of U-Net and Transformers from NLP (Chen et al., 2021b) has contributed to state-of-the-art performance. In addition, a number of semi-supervised and self-supervised learning based approaches have also been proposed to alleviate the need for large annotated datasets. Accordingly, in this section we will (1) review the original U-Net and its important variants, and summarize useful performance enhancing strategies; (2) introduce the combination of U-Net and Transformers, and Mask RCNN (He et al., 2017); 3) cover self-supervised and semi-supervised approaches for segmentation. Since recent studies focus on applying Transformers to segment medical images in a supervised manner, we purposely position the introduction of Transformers-based architectures in the supervised segmentation section. However, it should be noted that such categorization does not mean Transformers-based architectures cannot be used in semi-supervised or unsupervised settings. In a convolutional network, the high-level coarse-grained features learned by higher layers capture semantics beneficial to the whole image classification; in contrast, the low-level fine-grained features learned by lower layers contain useful details for precise localizations (i.e., assigning a class label to each pixel) (Hariharan et al., 2015) , which are important to image segmentation. U-Net is built on the fully convolutional network (Long et al., 2015) , the key innovation of U-Net is the so-called skip connections between opposing convolutional layers and deconvolutional layers, which successfully concatenate features learned at different levels to improve the segmentation performance. Meanwhile, skip connections is also helpful in recovering the network's output to be of the same spatial resolution as the input. U-Net takes 2D images as input, and it generates several segmentation maps, each of which corresponds to one respective pixel class. proposed the Dense Vnetwork that modified V-Net's loss function of binary segmentation to support multiorgan segmentation of abdominal CT images. Although the authors followed the V-Net architecture, they replaced its relatively shallow down-sampling network with a sequence of three dense feature stacks. The combination of densely linked layers and the shallow V-Net architecture demonstrates its importance in improving segmentation accuracy, and the proposed model yielded significantly higher Dice scores for all organs compared to multiatlas label fusion (MALF) methods. In their first network (RU-Net), the authors replaced U-Net's forward convolutional units using RCNN's recurrent convolutional layers (RCL) ( Figure 5 (b)), which can help accumulate useful features to improve segmentation results. In their second network (R2U-Net), the authors further modified RCL using ResNet's residual units ( Figure 5 (d)), which learns a residual function by using identity mapping for shortcut connections, thus allowing for training very deep networks. Both models achieved better segmentation performance than U-Net and residual U-Net. Dense convolutional blocks also demonstrated its superiority in enhancing segmentation performance on liver and tumor CT volumes . Besides the redesigned skip connections and modified architectures, U-Net based segmentation approaches also benefit from adversarial training ( Transformers are a group of encoder-decoder network architectures used for sequence-to-sequence processing in NLP (Chaudhari et al., 2021). One critical sub-module is known as multi-head self-attention (MSA), where multiple parallel self-attention layers are used to simultaneously generate multiple attention vectors for each input. Different from the convolution based U-Net and its variants, Transformers rely on the self-attention mechanisms, which possess the advantage of learning complex, long-range dependencies from input images. There exist two ways to adapt Transformers in the context of medical image segmentation: hybrid and Transformer-only. The hybrid approach combines CNNs and Transformers, while the latter approach does not involve any convolution operations. Chen et al. (2021b) present TransUNet, the first Transformers-based framework for medical image segmentation. This architecture combines CNN and Transformer in a cascaded manner, where one's advantages are used to compensate for the other's limitations. As introduced previously, U-Net and its variants based on convolution operations have achieved satisfactory results. Because of skip connections, low-level/highresolution CNN features from the encoder, which contain precise localization information, are utilized by the decoder to enable better performance. However, due to the intrinsic locality of convolutions, these models are generally weak at modeling long-range relations. On the other hand, although Transformers based on selfattention mechanisms can easily capture long-range dependencies, the authors found that using Transformer alone cannot provide satisfactory results. This is because it exclusively concentrates on learning global context but ignores learning low-level details containing important localization information. Therefore, the authors propose to combine low-level spatial information from CNN features with global context from the Transformer. As shown in Figure 6 (b), TransUNet has an encoder-decoder design with skip connections. The encoder is composed of a CNN and several Transformer layers. The input image needs to be first split into patches and tokenized. Then the CNN is used to generate feature maps for input patches. CNN features at different resolution levels are passed to the decoder though skip connections, so that spatial localization information can be retained. Next, patch embeddings and positional embeddings are applied to the sequence of feature maps. The embedded sequence is sent into a series of Transformer layers to learn global relations. Each Transformer layer consists of an MSA block (Vaswani et al., 2017; Dosovitskiy et al., 2020) and a multi-layer perceptron (MLP) block ( Figure 6 (a)). The hidden feature representations produced by the last Transformer layer are reshaped and gradually upsampled by the decoder, which outputs a final segmentation mask. TransUNet demonstrates superior performance in the CT multi-organ segmentation task over other competing methods like attention U-Net. In another study, Zhang et al. (2021) adopt a different approach to combine CNN and Transformer. Instead of first using CNN to extract low-level features and then passing features through the Transformer layers, the proposed model TransFuse combines CNN and Transformer with two branches in a parallel manner. The Transformer branch consisting of several layers takes as input a sequence of embedded image patches to capture global context information. The output of the last layer is reshaped into 2D feature maps. To recover finer local details, these maps are upsampled to higher resolutions at three different scales. Correspondingly, the CNN branch uses three ResNet-based blocks to extract features from local to global at three different scales. Features with the same resolution scale from both branches are selectively fused using an independent module. The fused features can capture both the low-level spatial and high-level global context. In the end, the multilevel fused features are used to generate a final segmentation mask. TransFuse achieved good performance in prostate MRI segmentation. In addition to 2D image segmentation, the hybrid approach is also useful to 3D scenarios. Hatamizadeh et al. (2022) propose a UNet-based architecture to perform volumetric segmentation of MRI brain tumor and CT spleen. Similar to 2D cases, 3D images are first split into volumes. Then linear embeddings and positional embeddings are applied to the sequence of input image volumes before fed to the encoder. The encoder, composed of multiple Transformer layers, extracts multi-scale global feature representations from the embedded sequence. The extracted features at different scales are all upsampled to higher resolutions and later merged with multi-scale features from the decoder via skip connections. In another study, Xie et al. (2021b) research on reducing Transformers' computational and spatial complexities in the 3D multi-organ segmentation task. To achieve this goal, they replace the original MSA module in the vanilla Transformer with the deformable self-attention module (Zhu et al., 2021a) . This attention module attends over a small set of key positions instead of treating all positions equally, thus resulting in much lower complexity. Besides, their proposed architecture CoTr, is in the same spirit of TransUNet -a CNN generates feature maps, used as the inputs of Transformers. The difference lies in that instead of extracting only single-scale features, the CNN in CoTr extracts feature maps at multiple scales. For the Transformer-only paradigm, Cao et al. (2021) present Swin-Unet, the first UNet-like pure Transformer architecture for medical image segmentation. Swin-UNet has a symmetric encoder-decoder structure without using any convolutional operations. The major components of the encoder and decoder are (1) Swin Transformer blocks and (2) patching merging or expanding layers. Enabled by a shifted windowing scheme, the Swin Transformer block exhibits better modeling power as well as lower complexity in computing self-attention. Therefore, the authors use it to extract feature representations for the input sequence of image patch embeddings. The subsequent patching layer down-samples the feature representations/ maps into lower resolutions. These down-sampled maps are further passed through several other Transformer blocks and patching merging layers. Likewise, the decoder also uses Transformer blocks for feature extraction, but its patching expanding layers upsample feature maps into higher resolutions. Similar to U-Net, the upsampled feature maps are fused with the down-sampled feature maps from the encoder via skip connections. Finally, the decoder outputs pixel-level segmentation predictions. The proposed framework achieved satisfactory results on multi-organ CT and cardiac MRI segmentation tasks. Note that, to ensure good performance and reduce training time, most of the Transformers-based segmentation models introduced so far are pre-trained on a large external dataset (e.g., ImageNet). Interestingly, it has been shown that Transformers can also produce good results without pre-training by utilizing computationally efficient self-attention modules ( Aside from the above UNet and Transformers-based approaches, another architecture Mask RCNN (He et al., 2017), which was originally developed for pixelwise instance segmentation, has achieved good results in medical tasks. Since it is closely related to Faster RCNN (Ren et al., 2015; , which is a regionbased CNN for object detection, details of Mask RCNN and its relations with the detection architectures will be elaborated later. To sum up in brief, Mask RCNN has (1) a region proposal network (RPN) as in Faster RCNN to produce high-quality region proposals (i.e., likely to contain objects), (2) the RoIAlign layer to preserve spatial correspondence between RoIs and their feature maps, and (3) Zhou et al. (2019c) combined UNet++ and Mask RCNN, leading to Mask RCNN++. As mentioned earlier, UNet++ demonstrates better segmentation results using the redesigned nested and dense skip connections, so the authors use them to replace the plain skip connections of the FPN inside Mask RCNN. A large performance boost was observed using the proposed model. For medical image segmentation, to alleviate the need for a large amount of annotated training data, reserachers have adopted generative models for image synthesis to increase the number of training examples Zhao et al., 2019a) . Meanwhile, exploiting the power of unlabeled medical images seems like a much more popular choice. In contrast to the difficult and expensive high-quality annotated dataset, unlabeled medical images are often available, usually coming with a large number. Given a small medical image dataset with limited ground truth annotations and a related but unlabeled large dataset, reserachers have explored self-supervised and semi-supervised learning approches to learn useful and transferrable feature representations from the unlabled dataset, which will be discussed in this and the next section respectively. Self-supervised pretext tasks: Since self-supervision via pretext tasks and contrastive learning can learn rich semantic representations from unlabeled datasets, self-supervised learning is often used to pre-train the model and enable solving downstream tasks (e.g., medical image segmentation) more accurately and efficiently when limited annotated examples are available (Taleb et al., 2020) . The pretext tasks could be either designed based on application scenarios or chosen from traditional ones used in the computer vision field. For the former type, Bai et al. (2019) designed a novel pretext task by predicting anatomical positions for cardiac MR image segmentation. The self-learnt features via the pretext task were transferred to tackle a more challenging task, accurate ventricles segmentation. The proposed method achieved much higher segmentation accuracy than the standard U-Net trained from scratch, especially when only limited annotations were available. For the latter type, Taleb et al. (2020) extended performing pretext tasks from 2D to 3D scenarios, and they investigated the effectiveness of several pretext tasks (e.g., rotation prediction, jigsaw puzzles, relative patch location) in 3D medical image segmentation. For brain tumor segmentation, they adopted the U-Net architecture, and the pretext tasks were performed on a large unlabeled dataset (about 22,000 MRI scans) to pre-train the models; then the learned feature representations were fine-tuned on a much smaller labeled dataset (285 MRI scans). The 3D pretext tasks performed better than their 2D counterparts; more importantly, the proposed methods sometimes outperformed supervised pre-training, suggesting a good generalization ability of the self-learnt features. The performance of self-supervised pre-training could also be improved by adding other types of information. Hu et al. (2020) implemented a context encoder (Pathak et al., 2016) performing semantic inpainting as the pretext task, and they incorporated DICOM metadata from ultrasound images as weak labels to boost the quality of pre-trained features toward facilitating two different segmentation tasks. Self 2020a) is suitable for learning image-level (global) feature representations, it does not guarantee learning distinctive local representations that are important for per-pixel segmentation. They proposed a local contrastive loss to capture local features that can provide complementary information to boost the segmentation performance. Meanwhile, to the best of our knowledge, when computing the global contrastive loss, these authors are the first to utilize the domain knowledge that there is structural similarity in volumetric medical images (e.g., CT and MRI). In MR image segmentation with low annotations, the proposed method substantially outperformed other semi-supervised and self-supervised methods. In addition, it was shown that the proposed method could further benefit from data augmentation techniques like Mixup (Zhang et al., 2018a). Semi-supervised consistency regularization: The mean teacher model is commonly used. Based on the mean teacher framework, Yu et al. (2019) introduced uncertainty estimation (Kendall and Gal, 2017) for better segmentation of 3D left atrium from MR images. They argued that on an unlabeled dataset, the output of the teacher model can be noisy and unreliable; therefore, besides generating target outputs, the teacher model was modified to estimate these outputs' uncertainty. The uncertainty-aware teacher model can produce more reliable guidance for the student model, and the student model could in turn improve the teacher model. The mean teacher model can also be improved by the transformation-consistent strategy . In one study, Wang et al. (2020b) proposed a semi-supervised framework to segment COVID-19 pneumonia lesions from CT scans with noisy labels. Their framework is also based on the Mean Teacher model; instead of updating the teacher model with a predefined value, they adaptively updated the the teacher model using a dynamic threshold for the student model's segmentation loss. Similarly, the student model was also adaptively updated by the teacher model. To simultaneously deal with noisy labels and the foreground-background imbalance, the authors developed a generalized version of the Dice loss. Semi-supervised generative models: As one of the earliest works that extended generative models to semi-supervised segmentation task, Sedai et al. (2017) utilized two VAEs to segment optic cup from retinal fundus images. The first VAE was employed to learn feature embeddings from a large number of unlabeled images by performing image reconstruction; the second VAE, trained on a smaller number of labeled images, mapped input images to segmentation masks. In other words, the authors used the first VAE to perform an auxiliary task (image reconstruction) on unlabeled data, which can help the second VAE to better achieve the target objective (image segmentation) using labled data. To leverage the feature embeddings learned by the first VAE, the second VAE simultaneously reconstructed segmentation masks and latent representations of the first VAE. The utilization of additional information from unlabled images improved segmentation accuracy. In another study, Chen et al. (2019c) also adopted a similar idea of introducing an auxiliary task on unlabeled data to facilitate performing image segmentation with limited labeled data. Specifically, the authors proposed a semi-supervised segmentation framework consisting of a UNet-like network for segmentation (target objective) and an autoencoder for reconstruction (auxiliary task). Unlike the previous study that trained two VAEs separately, the segmentation network and reconstruction network in this framework share the same encoder. Another difference lies in that the foreground and background parts of the input image were reconstructed/generated separately, and the respective segmentation labels were obtained via an attention mechanism. This semi-supervised segmentation framework outperformed its counterparts (e.g., fully supervised CNNs) in different labeled/unlabeled data splits. In addition to the aforementioned approaches, researchers have also explored incorporating domainspecific prior knowledge to tailor the semi-supervised frameworks for a better segmentation performance. The prior knowledge varies a lot, such as the anatomical prior ( Utilizing an attention mechanism to create separate segmentation labels for foreground and background areas of the input image so that the auxiliary reconstruction task and the target segmentation task can be better linked. A natural image may contain objects belonging to different categories, and each object category may contain several instances. In the computer vision field, object detection algorithms are applied to detect and identify if any instance(s) from certain object categories are present in the image ( ., 2017b) . In this section, we will first briefly review several recent milestone detection frameworks, including one-stage and two-stage detectors. It should be noted that, since these detection frameworks are often used in supervised and semi-supervised settings, we introduce them under these learning paradigms. Then we will cover these frameworks' applications in specific type of lesion detection and universal lesion detection. In the end, we will introduce unsupervised lesion detection based on GANs and VAEs. RCNN framework (Girshick et al., 2014 ) is a multi-stage pipeline. Despite its impressive results in object detection, RCNN has some drawbacks namely, the multistage pipeline makes training slow and difficult to optimize; separately extracting features for each region proposal makes training expensive in disk space and time, and it also slows down testing (Girshick, 2015) . These drawbacks have inspired several recent milestone detectors, and they can be categorized into two groups (Liu et al. Two-stage detectors: Unlike RCNN, the Fast RCNN framework (Girshick, 2015) is an end-to-end detection pipeline employing a multi-task loss to jointly classify region proposals and regress bounding boxes. Region proposals in Fast RCNN are generated on a shared convolutional feature map rather than the original image to speed up computation. Then a Region of Interest pooling layer was applied to warp all the region proposals into the same size. The adjustments resulted in a better and faster detection performance but the speed of Fast RCNN is still bottlenecked by the inefficient process of computing region proposals. In the Faster RCNN framework ( is closely related to Faster RCNN but it was originally designed for pixelwise object instance segmentation. Mask RCNN also has a RPN to propose candidate object bounding boxes; this new framework extends Faster RCNN by adding an extra branch that outputs a binary object mask to the existing branch of predicting classes and bounding box offsets. Mask RCNN uses a Feature Pyramid Network (FPN) (Lin et al., 2017a) as its backbone to extract features at various resolution scales. Besides instance segmentation, Mask RCNN can be used for object detection, achieving excellent accuracy and speed. One-stage detectors Redmon et al. (2016) proposed a single-stage framework YOLO; instead of using a separate network to generate region proposals, they treated object detection as a simple regression problem. A single network was used to directly predict object classes and bounding box coordinates. YOLO also differs from region proposal based frameworks (e.g., Faster CNN) in that it learns features globally from the entire image rather than from local regions. Despite being faster and simpler, YOLO has more localization errors and lower detection accuracy than Faster RCNN. Later the authors proposed YOLOv2 and YOLO9000 (Redmon and Farhadi, 2017) Common To deliver good detection performance in the medical domain, these frameworks need to be adjusted through different methods, such as incorporating domain-specific characteristics, uncertainty estimation, or semisupervised learning strategy, which are presented as follows. Incorporating domain-specific characteristics has been a popular choice in both the radiology and histopathology domains. In the radiology domain, the intrinsic 3D spatial context information among volumetric images (e.g. CT scans) has been utilized in many studies (Roth et with three head branches. The detection branch predicts whether each proposed region is lesion and regresses bounding boxes; the tagging branch predicts 185 tags (e.g., body part, lesion type, intensity, shape, etc.) for each lesion proposal; the segmentation branch outputs a binary mask (lesion/non-lesion) for each proposed region. MULAN significantly surpassed previous lesion detection models such as ULDor and 3DCE (Yan et al., 2018a) . Furthermore, Yan et al. (2020) have recently shown that learning from heterogenous lesion datasets and partial labels can also boost detection performance. In addition to the above strategies, attention mechanism is another useful way to improve lesion detection. Tao In clinical practice, it is common for radiologists to inspect multiple CT windows for an accurate lesion diagnosis. The authors first employed three FPNs to generate feature maps from three frequently inspected windows; then the attention module (Woo et al., 2018 ) was used to reweight feature maps from different windows. The prior knowledge of lesion positions was also incorporated to further improve the performance. We observe that, whether in the detection of specific-type of lesions or universal lesions, two-stage detectors are still quite prevalent for their high performance and robustness; however, separately generating region proposals might hinder developing streamlined CADe schemes. Several very recent studies have demonstrated that good detection performance can also be obtained by one-stage detectors (Pisov et al., 2020; Lung et al., 2021; Zhu et al., 2021b) . We predict that advanced anchor-free one-stage detectors (e.g., CenterNet (Duan et al., 2019)) if adjusted properly to accommodate the uniqueness of medical images, will attract much more attention and even become a better choice than two-stage detectors for developing new CADe schemes in the long run. As mentioned in the above subsections, no matter it is specific-type or universal lesion detection, certain amounts of supervision are necessary to train one-stage or two-stage detectors. To establish the supervision, types of lesions need to be prespecified before training the detectors. Once trained, the detectors cannot detect lesions not contained in the training dataset. On the contrary, unsupervised lesion detection does not require ground-truth annotations, thus the lesion types do not need to be prespecified beforehand. The unsupervised detection has the potential to detect arbitrary type of lesions (Baur et al., 2021) , but its performance is not comparable to that of fully-supervised/semi-supervised methods. Despite that, it can be used to establish a rough detection of suspicious areas and provide imaging biomarker candidates. To avoid potential confusion, we make two following clarifications. First, methods to be introduced in this subsection originate from "unsupervised anomaly detection", since it is natural to consider lesions like brain tumors as one type of anomaly in medical images. The term "anomaly detection" will be used frequently throughout the context. Second, it should be noted that "anomaly detection" often appears with another term "anomaly segmentation" in the literature (Baur et al., 2021) . This is because they are essentially two closely connected tasks -once anomalous regions are detected in an image, the segmentation map can be obtained by applying a binarization threshold to the detection map. In other words, approaches applicable to one direction are usually suitable to the other, so readers will see the term "anomaly segmentation". The core assumption of unsupervised anomaly detection is that the underlying distribution of normal parts (e.g. healthy tissues and anatomy) in the images can be captured by unsupervised models, but abnormal parts such as tumors deviate from the normative distribution, so these anomalies can be detected. Commonly used models for estimating the normative distribution mainly stem from the concept of VAE and GAN, and the success of these unsupervised models has mostly been seen in MRI. Notably, Baur et al. (2021) review a variety of autoencoders-based anomaly segmentation methods in brain MR images. The authors conduct a thorough comparison of these models and present many interesting insights into successful applications. One important conclusion reached by this paper is that restoration-based approaches generally perform better than reconstruction-based ones when runtime is not a concern. In contrast to this comprehensive review paper, we will briefly introduce reconstruction-based approaches and narrow our focus to recent works related to restoration-based detection. In the reconstruction-based pargdigm, an AE-or VAE-based model projects an image into lowdimensional latent space and then reconstructs the original image from its latent representation. Only healthy images are used for training, and the model is optimized to generate low pixel-wise reconstruction error. When unhealthy images pass through the the model, the reconstruction error is expected to be low regarding normal regions but high for anomalous areas. Uzunova et al. (2019) employed a CVAE to learn latent representations of healthy image patches. Besides the reconstruction error, they further assumed a large distance between the latent representations of healthy and unhealthy patches. Combining these two distances together, the CVAEbased model delivered resonable segmentation results on MRIs with tumors. It is worthy noting that the authors integrated local context into CVAE by utilizing the relative positions of patches as condition. The locationrelated condition can provide additional prior information of healthy and unhealthy tissues to improve performance. In the restoration-based paradigm, the target to be restored is either (1) an optimal latent representation or (2) the healthy counterpart of the input anomalous image. Both GAN-based and VAE-based methods have been applied, but GAN is generally used during latent representation restoration for the first type. Although the generator of GAN can easily map latent vectors back to images, it lacks the capability to perform inverse mapping, (i.e., images to latent space), which is important in calculating anomaly score. This is a key issue tackled by many works adapting GAN for anomaly detection. As a pionerring work, Schlegl et al. (2017) proposed the so-called AnoGAN to obtain the inverse mapping, the authors first pre-trained a GAN (a generator and a discriminator) using healthy images to learn the normative distribution, and kept this model's weights fixed. Then given an input image (either normal or anomalous), gradient descent in the latent space (regarding latent variable) is performed to restore the corresponding optimal latent representation. More concretely, the optimization is guided by two combined losses, namely residual loss and discrimination loss. The residual loss, just like the previously mentioned reconstruction error, measures the pixel-wise dissimilarity between real input images and images that are generated by the generator from latent variable. Meanwhile, these two types of images are sent into the discriminator network, and one intermediate layer is used to extract features for them. The difference of intermediate feature represenations is computed, resulting in the discirmination loss. Last, after optimizing on the latent variable, the authors use both losses to calculate an anomaly score, indicating whether the input image contains anomalous regions. AnoGAN delievers good performance, but iterative optimization is time-consuming. In their follow-up work, Schlegl et al. (2019) proposed a more efficient model f-AnoGAN by introducing an additional encoder, which can perform fast inverse mapping from image space to latent space. Similar to developing AnoGAN, they first pre-trained a WGAN using healthy images and again kept the models's weights fixed. Then the generator with fixed weights was employed as the decoder of an AE without futher training, whereas this AE's encoder was trained using a combination of two loss functions as introduced in AnoGAN. Once fully trained, the encoder network can efficiently map images to latent space with one single forward pass. Slightly earlier than f-AnoGAN, Baur et al. (2018) proposed the so-called AnoVAEGAN that combines VAE and GAN for fast inverse mapping. In this framework, GAN's generator and VAE's decoder are the same network, and VAE's encoder is employed to learn the inverse mapping. Therefore, three components including encoder, decoder and discriminator need to be trained. The loss function here differs from that of AnoGAN and f-AnoGAN, but it still has the reconstruction error. Also, in contrast to these two patch-based models, AnoVAEGAN directly takes the entire MR images as input and thereby can capture and utilize global context potentially valuable to anomaly segmentation. For the second type, restoring a healthy counterpart of the input image means, if the input contains abnormal regions, they are expected to be removed in the restored version, while the rest normal areas are retained. Thus, a pixel-wise dissimilarity map between the input and restored image can be acquired, and anomalies can be detected. Successful restoration typically relies on maximum a posteriori (MAP) estimation. Specifically, the posterior being maximized is composed of a normative distribution of healthy images and a data consistency term (Chen et al., 2020d) . The normative distribution can be modeled through a VAE or its variants, and its training is guided by ELBO, an estimation for VAE's orginal objective function (Kingma and Welling, 2013) . As for the data consistency term, it controls to what extent the restored image should resemble the input. In the task of detecting brain tumors from MR images, You et al. (2019) first employ a GMVAE to capture the distribution of lesion-free MR images, and adopt the total variation norm for data consistency regularization. Then these two elements together steer the optimization in MAP estimation so that the healthy counterpart of an anomalous input is iteratively restored. Recently in their following work, Chen et al. (2021c) claim that ELBO may not be a good approximation for VAE's original loss function. As a result, this inaccurate loss could lead to learning an inaccurate normative distribution, making gradient computation in iterative optimization deviate from the true direction. To solve this issue, the authors propose using the derivatives of local Gaussian distributions to replace the gradients of ELBO. When detecting glioblastomas and gliomas on MR images, the proposed approach demonstrates higher accuracy at low false positive rates compared to other methods. Also, different from most of previous works that depend on 2D MR slices, the authors incorporate 3D information into VAE's training to further improve performance. Utilizing location-related condition to provide additional prior information of healthy and unhealthy tissues for better performance. Registration, the process of aligning two or more images into one coordinate system with matched contents, is also an important step in many (semi-)automatic medical image analysis tasks. Image registration can be sorted into two groups: rigid and deformable (non-rigid). In rigid registration, all the image pixels uniformly experience a simple transform (e.g., rotation), while deformable registration aims to establish a nonuniform mapping between images. In recent years, there have been more applications of deep learning related to this research topic, especially for deformable image registration. Similar to the organization of the review article (Haskins et al., 2020) , deep learning-based medical image registration approaches in our survey are categorized into three groups: (1) deep iterative registration; (2) supervised registration; (3) unsupervised registration. Interested readers can refer to several other excellent review papers (Fu et al., 2020; Ma et al., 2021b) for a more comprehensive set of registration methods. In deep iterative registration, deep learning models learn a metric that quantifies the similarity between a target/moving image and a reference/fixed image; then the learned similarity metric is used in conjunction with traditional optimizers to iteratively update the registration parameters of classical (i.e., nonlearning-based) transformation frameworks. For example, Simonovsky et al. (2016) used a 5-layer CNN to learn a metric to evaluate the similarity between aligned 3D brain MRI T1-T2 image pairs, and then incorporated the learnt metric into a continuous optimization framework to complete deformable registration. This deep learning based metric outperformed manually defined similarity metrics such as mutual information for multimodal registration (Simonovsky et al., 2016) . In essence, this work is most related to previous approach in Cheng et al. (2018) that estimates the similarity of 2D CT-MR patch pairs using an FCN pre-trained with stacked denoising autoencoder; the major difference between these two works lies in network architecture (CNN vs. FCN), application scenario (3D vs. 2D), and training strategy (from scratch vs. pre-training). For T1-T2 weighted MR images and CT-MR images, Haskins et al. (2019) claimed it is relatively easy to learn a good similarity metric because these multimodal images share large similar views or simple intensity mappings. They extended the deep similarity metric to a more challenging scenario, 3D MR-TRUS prostate image registration, where a large appearance difference exists between the two imaging modalities. In summary, "deep similarity", which can avoid manually defining similarity metrics, is useful for establishing pixel-to-pixel and voxel-to-voxel correspondences. Deep similarity remains an important research track, and it is often mentioned interchangeably with several other terms like "metric learning" and "descriptor learning" (Ma et al., 2021b) . Note that methods related to reinforcement learning can also be used to implicitly quantify image similarity, but we do not expand on this topic since reinforcement learning is beyond the scope of this review paper. Instead, more advanced deep similarity based approaches (e.g., adversarial similarity) will be reviewed in the unsupervised registration subsection. Despite the success of deep iterative registration, the process of learning a similarity metric followed by iterative optimization in classic registration frameworks is too slow for real-time registration. In comparison, some supervised registration methods directly predict deformation fields/transformations in just one step, bypassing the need for iterative optimization. These methods typically require ground truth warp/deformation fields, which can be synthesized/simulated Uzunova et al. (2017) , manually annotated, or obtained via classical registration frameworks. For 3D deformable image registration, Sokooti et al. (2017) developed multi-scale CNNs based model to directly predict displacement vector fields (DVFs) between image pairs. To make their training dataset larger and more diversified, they first artificially generated DVFs with varying spatial frequency and amplitude, and then applied data augmentation on the generated DVFs, resulting in approximately 1 million training examples. After training, deformed images were registered in one-shot, and their method demonstrated close performance to a conventional B-spline registration. Besides the supervision from ground truth deformation fields, image similarity metrics are sometimes incorporated to provide additional guidance for more accurate registration. Such combination is referred to as "dual supervision." In a recent study, Fan et al. (2019a) developed a dual-supervised training strategy with dual guidance for brain MR image registration. With the ground truth guidance, the difference between the ground truth field and predicted deformation field was calculated. With the image similarity guidance, the authors computed the difference between the template image and subject image that was warped using the predicted deformation field. The former guidance enabled the network to converge fast, while the latter guidance further refined training and yielded more accurate registration results. Figure 7 ) that does not need supervised information (e.g., true registration fields or anatomical landmarks). The model has two components, including a convolutional U-Net and a spatial transformer network (STN). The authors formulated 3D MR brain volume registration as a parametric function, which was modeled using the U-Net architecture. The encoder's input is the concatenation of a moving image and a fixed image, and the decoder outputs a registration field. The spatial transformer network (Jaderberg et al., 2015) was applied to warp the moving image with the learned registration field, resulting in a reconstructed version of the fixed image. By minimizing the difference between the reconstructed image and the fixed image, VoxelMorph can update parameters for generating desired deformation fields. This unsupervised registration framework was able to operate orders of magnitude faster but achieved competitive performance to Symmetric Normalization (SyN) (Avants et al., 2008) , a classic registration algorithm. In a later paper , the authors extended VoxelMorph to leverage auxiliary segmentation information (anatomical segmentation maps), and the extended model demonstrated an improved registration accuracy. Prior to this, several works had shown when there is no ground truth for voxel-level transformation, solely using auxiliary anatomical information can achieve accurate cross-modality registration (Hu et al., 2018c; Hu et al., 2018d) . Note that the inclusion of segmentation information from corresponding anatomical structures is often referred to as "weakly supervised registration". DLIR is another famous unsupervised registration framework (de Vos et al., 2019), which is an extension of the previous work (de . DLIR has four stages to progressively perform image registration. The first stage is designed for affine image registration (AIR), and the rest three stages are for deformable image registration (DIR). In the AIR stage, a CNN takes as input pairs of fixed and moving images and outputs predictions for the affine transformation parameters so that affinely aligned image pairs can be obtained. In the subsequent DIR stage, these aligned image pairs are the input of a new CNN, whose output is a B-spline displacement vector as the deformation field. With this field, deformably registered image pairs can be obtained, and the registration results are further refined through the rest two DIR stages. The unsupervised registration frameworks described above all utilize manually defined similarity metrics and certain regularization terms to design their loss functions. For instance, the loss function of VoxelMorph consists of a similarity metric (mean squared error, cross-correlation (Avants et al., 2008) ) to quantify the voxel correspondence between the warped image and the fixed image and a regularization term to control the spatial smoothness of the warped image . Despite the effectiveness of classical similarity measures in mono-modal registration, they receive less success than deep similarity metrics in most multi-modal cases. To this end, advanced deep similarity metrics learned under unsupervised regimes have been proposed to achieve superior results for multi-modal registration. One notable example is the adversarial similarity proposed by Fan et al. (2019b) . Specifically, the authors proposed an unsupervised adversarial network, with a UNet-based generator and a CNN-based discriminator. The generator takes two input image volumes (moving image and fixed image) and outputs a deformation field, whereas the discriminator determines whether a negative pair of images (the fixed image and the moving image warped using the predicted field) are well-registered by comparing their similarity with a positive pair of images (the fixed image and a reference image). Using the feedback from the discriminator to improve itself, the generator is trained to generate as accurate deformations as possible to fool the discriminator. This unsupervised adversarial similarity network yielded promising results for mono-modal brain MRI image registration and multi-modal pelvic image registration. (1) Using the discriminator of GAN to implicitly learn an adversarial similarity to determine the voxel-to-voxel correspondence; (2) The proposed framework applies to both mono-modal and multi-modal registration. The progress of medical image analysis using deep learning follows a lagging but similar timeline to computer vision. However, due to the difference between medical images and natural images, a direct use of methods from computer vision may not yield satisfactory results. In order to achieve good performance, challenges unique to medical imaging tasks need to be addressed. For the classification task, the key to success lies in extracting highly discriminative features with respect to certain classes. This is relatively easy for domains with large inter-class variance (e.g., accuracies on many public chest X-ray datasets often exceed 90%), but it can be difficult for domains with high inter-class similarity. For example, the performance of mammogram classification is not so good overall (e.g., 70~80% accuracies are commonly seen on private datasets), since discriminative features for breast tumors are difficult to capture in the presence of overlapping, heterogeneous fibroglandular tissues (Geras et al., 2019) . The notion of fine-grained visual classification (FGVC) , which aims at identifying subtle differences between visually similar objects, might be suited for learning distinctive features given high inter-class similarity. But note that, benchmark FVGC datasets were purposely collected to make all the image samples unanimously exhibit high inter-class similarity. As a result, approaches developed and evaluated on these datasets may not be readily applicable to medical datasets, where only a certain fraction rather than all the images exhibit high inter-class similarity. Nonetheless, we believe FVGC methods, if modified appropriately, will be valuable to learning feature representations with high discriminative power in medical image classification. Other possible ways to enhance features' discrimination power include the use of attention modules, local and global features, domain knowledge, etc. Medical object detection is more complicated than classification as can be seen from the process of bounding box prediction. Naturally, detection faces the challenges inherent to classification. Meanwhile, there exist additional challenges, especially the detection of small-scale objects (e.g., small lung nodules) and class imbalance. One-stage detectors typically perform comparably well as two-stage detectors in detecting large objects but struggles more in detecting small objects. Existing studies show that using multi-scale features can greatly alleviate this issue both in one-stage and two-stage detectors. A simple yet effective approach is featurized image pyramids , where features are extracted from multiple scales of the same image independently. This method can help enlarge small objects to achieve better performance but is computationally expensive and slow. Nonetheless, it is suitable to medical detection tasks with no requirement of fast speed. Another useful but much faster approach is feature pyramids, which utilizes multi-scale feature maps from different convolutional layers. Although there exist various ways to build feature pyramids, a rule of thumb is that it is necessary to fuse strong, high-level semantics with high-resolution feature maps. This plays an important role in detecting small objects, as shown by FPN (Lin et al., 2017a) . Class imbalance arises from the fact that detectors need to evaluate a huge number of candidate regions, but only a few contain objects of interest. In other words, class balance is severely skewed toward negative examples (e.g., background regions), most of which are easy negatives. The presence of large amounts of easy negatives can overwhelm the training process, leading to bad detection results. Two-stage detectors can handle this class imbalance issue much better than one-stage detectors, because most negative proposals are filtered out at the region proposal stage. In terms of one-stage detectors, recent studies show that abandoning the dominant use of anchor boxes can largely alleviate class imbalance . However, most approaches adopted in medical object detection are still anchor-based. In the near future, we expect to see more explorations of anchor-free, one-stage detectors in medical object detection. Medical image segmentation combines challenges in classification and detection. Just like detection, class imbalance is a common issue across 2D and 3D medical segmentation tasks. Another similar challenge is the segmentation of small-sized lesions (e.g., MRI multiple sclerosis) and organs (e.g., pancreas from abdominal CT scans). Also, these two challenges often appear intertwined. These issues have been largely alleviated by adapting metrics/losses to evaluate the segmentation performance, such as Dice coefficient (Milletari et al., 2016) , generalized Dice (Sudre et al., 2017) , the integration of focal loss (Abraham and Khan, 2019), etc. However, these metrics are region-based (i.e., segmentation errors are computed in a pixel-wise manner). This can lead to a loss of valuable information regarding structures, shapes, and contours that are important to diagnosis/prognosis in later stages. Therefore, we believe it is necessary to develop non-regionbased metrics that can provide complementary information to region-based metrics for better segmentation performance. Currently only a few studies exist in this direction (Kervadec et al., 2019) . We expect to see more in the future. In addition, strategies such as incorporating local and global context, attention mechanisms, multi-scale features, and anatomical cues are generally beneficial to increasing segmentation accuracy for both large and small objects. Here we want to emphasize the great potentials of Transformers due to their strong capability of modeling long-range dependencies. Although long-range dependencies are helpful to achieving accurate segmentation, a majority of CNN-based methods do not explicitly focus on this aspect. There are roughly two types of dependencies, namely intra-slice dependency (pixel relationships within a CT or MRI slice) and interslice dependency (pixel relationships between CT or MRI slices) (Li et al., 2020e) . Recent studies show that Transformer-based approaches are powerful in both cases (Chen et al., 2021b; Valanarasu et al., 2021) . Applications of Transformers for medical image segmentation especially 3D are still in the initial stage, and more works in this trial are likely to emerge soon. Medical image registration significantly differs from previous tasks because its purpose is to find the pixel-wise or voxel-wise correspondence between two images. One unique challenge is associated with the difficulty in acquiring reliable ground truth registrations, which are either synthetically generated or produced by conventional registration algorithms. Unsupervised methods have shown great promise in solving this issue. However, many unsupervised registration frameworks (e.g. de Vos et al., 2019) are composed of multiple stages to register images in a coarse-to-fine manner. Despite the good performance, multi-stage frameworks can increase computational complexity and make training difficult. It would be desirable to develop registration frameworks that have as few stages as possible and can be trained end to end. Although deep learning has brought about huge successes across different tasks in the context of radiological image anlaysis, the further performance improvement is majorly hurdled by the requirement for large amounts of annotated datasets. Supervised transfer learning can greatly alleviate this issue, by initializing the model's weights (for the target task) with the weights of the model that is pre-trained on relevant/irrelevant datasets (e.g. ImageNet). Besides the widely used transferring learning, there are two possible directions: (1) utilizing GAN model to enlarge the labeled dataset; (2) utilizing the self-supervised and semi-supervised learning models to exploit the information underlying vast unlabeled medical images. GAN has shown great promise in medical image synthesis and semi-supervised learning; but one challenge is how to build a strong connection between GAN's generator and the target task (e.g., classifier, detector, segmentor). The lack of such connection may cause a subtle performance boost as compared to the conventional data augmentation (e.g., rotation, rescale, and flip) (Wu et al., 2018a) . The connection between the generator and classifier can be strengthened by utilizing semi-supervised GAN, in which the discriminator was modified to serve as a classifier (Salimans et al., 2016) . Several training strategies can also be employed: identifying a "bad" generator that can significantly contribute to good semi-supervised classification (Dai et al., 2017) ; jointly optimizing the triple components of a generator, a discriminator, and a classifier . It is meaningful to explore new ways that can effectively set up connections between the generator and a specific medical image task for a better performance. Additionally, GAN usually needs at least thousands of training examples to converge, which limits its applicability on small medical datasets. This challenge can be partially addressed by using classic data augmentation for adversarial learning (Frid-Adar et al., 2018a; Frid-Adar et al., 2018b). Further, if there exist relatively large amounts of medical images that share structural, textural, and semantic similarities with the target dataset, pre-training generators and/or discriminators may facilitate faster convergence and better performance (Rubin et al., 2019) . Meanwhile, some recent novel augmentation mechanisms, such as the differentiable augmentation (Zhao et al., 2020) and adaptive discriminator augmentation (Karras et al., 2020) have enabled GAN to effectively generate high-fidelity images under data-limited conditions, but they have not been applied to any medical image analysis tasks. We anticipate that these new methods can also demonstrate promising performance in future studies of the medical image analysis field. Self-supervision can be constructed by either pretext tasks or contrastive learning, but the latter seems to be a more promising research direction. This is because, on one hand directly using pretext tasks (e.g. jigsaw puzzle) from computer vision is typically not adequate to ensure learning robust feature representations for radiological images. On the other hand, designing novel pretext tasks can be difficult, which demands delicate manipulation. Instead of designing various pretext tasks, self-supervised contrastive learning trains the network to capture meaningful feature representations by forcing them to be invariant to different augmented views, which can potentially outperform supervised transfer learning on different downstream tasks, such as medical image classification and segmentation. Despite the encouraging performance of self-supervised contrastive learning, its applications in radiological image analysis are still at the exploratory stage, and how to make appropriate use of this new learning paradigm is a difficult problem. To unleash its potential, here we provide our suggestions from the following three aspects. (1) Harness the benefits of contrastive learning and supervised learning. Observing from the exiting studies, we find a majority adopt two separate steps for medical image analysis: contrastive pre-training on unlabeled data and supervised fine-tuning with labeled data. At the pretraining stage, most studies are reliant on relatively large, unlabeled datasets to ensure learning high-quality, transferrable features, which can yield superior performance after being tuned using limited labeled data. However, the reliance on large unlabeled data could be problematic in tasks lacking large amounts of unlabeled data. To expand the application scope, learning high-quality feature presentations with less unlabeled data would be desirable. One possible approach is unifying the previously mentioned two separate steps into one so that the label information can be leveraged in contrastive learning. This is somewhat reminiscent of semisupervised learning that simultaneously utilizes unlabeled and labeled data to achieve better performance. More concretely, class labels can be used to guide constructing positive and negative pairs in a more compact manner by pushing images from the same class to be more closely aligned in the lower-dimensional representation space (Khosla et al., 2020) . Features learned in this way should require less unlabeled data and be less redundant than features learned solely through self-supervised learning (i.e., without any class labels). (2) Take into account certain properties of contrastive learning for better performance. For example, one study proves that contrastive learning benefits more from large blocks of similar points than pairs (Saunshi et al., 2019) . This heuristic may be well suited to learning transferrable features from 3D CT and MRI volumes exhibiting consecutive anatomical similarity. (3) Customize data augmentation strategies for downstream tasks that are sensitive to augmentation. The composition of different data augmentation strategies proves critical to learning representative features in most existing contrastive learning frameworks. For instance, SimCLR applies three types of transformations to unlabeled images, namely random cropping, color distortions, and Gaussian blur . However, some commonly used augmentation techniques may not be applicable to medical images. In radiology, where most images are presented in grayscales, the color distortion strategy is likely not suitable. Also, in cases where fine-grained details of unlabeled medical image carry important information, applying Gaussian blur may ruin the detailed information and degenerate the quality of feature representations during the pre-training stage. Therefore, it is important to choose appropriate data augmentation strategies to ensure satisfactory downstream performance. In addition, self-supervised contrastive pre-training is currently impeded by the high computing complexity of large models (e.g., ResNet-50 (4×), ResNet-152 (2×)), which require a large group of multi-core TPUs . Therefore, it should be an important direction to develop novel models or training strategies to enhance the computing efficiency. For example, Reed et al. (2022) proposed a hierarchical pre-training strategy to make the self-supervised pre-training process converge up to 80× faster with an improved accuracy across different tasks. Like self-supervised contrastive learning, recent semi-supervised methods such as FixMatch (Sohn et al., 2020) heavily rely on advanced data augmentation strategies to achieve good performance. To facilitate the applications of semi-supervised learning in medical image analysis, it is necessary to develop appropriate augmentation policies in a dataset-driven and/or task-driven manner. Being "dataset-driven" means finding the optimal augmentation policy for a specific dataset of interest. In the past, this was not easy to achieve due to the extremely very large size of the parameter search space (e.g., 10 34 possible augmentation policies as shown by Cubuk et al. (2020) ). Recently, automated data augmentation strategies like RandAugment have been proposed to significantly reduce the search space. However, the concept of automated augmentation remains largely unexplored in medical image analysis. Being "task-driven" means finding suitable augmentation strategies for a specific task (e.g., MRI prostate segmentation) that have several datasets. This could be regarded as the extension of dataset-driven augmentation and thus is more challenging, but it can help algorithms developed on one dataset generalize better to other dataset(s) of the same task. Another issue is the potential performance degradation caused by violation of the underlying assumption of semi-supervised learning -labeled and unlabeled data are from the same distribution. Indeed, distribution mismatch is a common problem when semi-supervised methods are applied for medical image analysis. Consider the following example: in the task of segmenting COVID-19 lung infections from CT slices, say you have a set of labeled CT volumes containing a relatively balanced number of infected and non-infected slices, while the unlabeled CT volumes available may contain no or just a few infected slices. Or the unlabeled CT images contain not only COVID-19 infections but also some other disease class(es) (e.g., tuberculosis) that are absent from the labeled images. What will happen if the distribution of unlabeled data mismatches with the distribution of labeled data? Exiting studies suggest this will cause the performance of semi-supervised methods to degrade drastically, sometimes even worse than that of a simple supervised baseline (Oliver et al., 2018; Guo et al., 2020) . Therefore, it is necessary to adapt semi-supervised algorithms to be tolerant of the distribution mismatch between labeled and unlabeled medical data. As a related field, "domain adaption" may provide insights for achieving this goal. The continuing success of deep learning in medical image analysis originates from not only different learning paradigms (unsupervised, semi-supervised) but also, maybe to a larger extent, the architectures/models proposed over time. Looking back, we find non-trivial improvements are closely related to the progress of "Given this progression history, it is certainly possible that a better neural architecture can by itself overcome many of the current limitations", as pointed out by Yuille and Liu (2021) . We disucuss two aspects that may be helpful to finding better architectures. First, biologically and cognitively inspired mechanisms will continue playing an important role in architecture designing. Deep learning neural networks were originally inspired by the architecture of cerebal context. In recent years the concept of attention, which was inspired by primates' visual attention mechanisms, has been successfully used in NLP and computer vision to make models focus on important parts of input data, leading to superior performance. A preeminent example are the family of Transformers based on self attention (Vaswani et al., 2017) . Transformer-based architectures are better at capturing global/long-range dependencies between input and output sequences than mainstream models based on CNNs. Also, inductive biases inherent to CNNs (e.g., translation equivariance and locality) are much less in Transformers (Dosovitskiy et al., 2020) . Aside from the attention mechanisms, many other biological or cognitive mechanisms, such as dynamic hierarchies in human language, one-shot learning of new objects and concepts without gradient descent, etc (Marblestone et al., 2016) , may provide inspirations for designing more powerful achitectures. Second, automatic architecture engineering may shed light on developing better architectures. Currently employed architecutres mostly come from human experts, and the designing process is iterative and prone to errors. Partially for this reason, models used for medical image anlaysis are primarily adapted from models developed in computer vision. To avoid the need of manual designing, reserachers have proposed to automate architecutre engineering, and one related field is neural architecture search (NAS) (Zoph and Le, 2016). However, most exisiting studies of NAS are confined within image classification (Elsken et al., 2019) , and truly revolutionary models that can bring fundamental changes have not come out of this process (Yuille and Liu, 2021) . Nonetheless, NAS is still a direction worthy exploration. At a broader level, piplelines with automated configuration capabilities would be desirable. Although architecture engineering still faces many difficulties, developing automatic pipelines, which are capable of automatically configuring its subcomponents (e.g., choosing and adapting an appropriate architecture among the exisiting ones) to achieve better performance, will be beneficial to radiolgical image analysis. At present, deep leanring based pipelines typically involve several interdepedent subcomponents such as image preprocessing and post-processing, adapting and training a network architecture, selecting appropriate losses, data augmentation methods, etc. But the design choices are often too many for experimenters to manually figure out an optimal pipeline. Moreover, a high-performing pipeline configured for a dataset (e.g., CT images from one hospital) of a specific task may perform badly on anther dataset (e.g., CT images from a different hospital) of the same task. Therefore, pipelines that can automatically configure their subcomponents are needed to speed up empirical design. Examples falling in this scope include NiftyNet (Gibson et al., 2018b), a modular pipeline for different medical applications, and nnU-Net (Isensee et al., 2021) specifically for medical image segmentation. We expect more research will be coming out of this track. Domain knowledge, which is an important aspect but sometimes overlooked, can provide insights for developing high-performing deep learning algorithms in medical image analysis. As mentioned previously, most models used in medical vision are adapted from models developed for natural images; however, medical images are generally more difficult to handle due to unique challenges (e.g., high inter-class similarity, limited size of labeled data, label noise). Domain knowledge, if used appropriately, helps alleviate these issues with less time and computation costs. It is relatively easy for researchers with strong deep learning background to utilize weak domain knowledge, such as anatomical information in MRI and CT images (Zhou et al., 2021; Zhou et al., 2019a), multi-instance data from the same patient (Azizi et al., 2021) , patient metadata (Vu et al., 2021) , radiomic features, and text reports accompanying images (Zhang et al., 2020a) . On the other hand, we observe it can be more difficult to effectively incorporate strong domain knowledge that radiologists are familiar with. One example is breast cancer identification from mammograms. For each patient, four mammograms are available, including two cranio-caudal (CC) and two medio-lateral oblique (MLO) view images of left (L) and right (R) breasts. In clinical practice, the bilateral difference (e.g. LCC vs. RCC) and unilateral correspondence (e.g. LCC and LMLO) serve as important cues for radiologists to detect suspicious regions and determine malignancy. Currently there exist few methods that can reliably and accurately to utilize this expert knowledge. Therefore, more research efforts are needed to maximize the use of strong domain knowledge. Deep learning, despite being intensively used for analyzing medical images in academia and industrial research institutions, has not made that significant impact as we expected in clinical practice. This is clearly reflected in the early stages of fighting against COVID-19, the first global pandemic falling in the era of deep learning. Due to its widespread medical, social, and economic consequences, this pandemic, to a large extent, can be regarded as a big test for examining the current status of deep learning algorithms in clinical translation. Soon after the outbreak, researchers around the world applied deep learning techniques to analyze mainly chest X-rays and CT images from patients with suspected infection, aiming at accurate and efficient diagnosis/prognosis of the disease. To this end, numerous deep learning and machine learning based approaches were developed. However, after systematically reviewing over 200 prediction models from 169 studies that were published up to 1 July 2020, Wynants et al. (2020) concluded that all these models were of high or unclear risk of bias, and thus none of them were suitable for clinical use -either moderate or excellent performance was reported by each model; however, the optimistic results were highly biased due to model overfitting, inappropriate evaluation, use of improper data sources, etc. Similar conclusion was drawn in another review paper (Roberts et al., 2021) -after reviewing 62 studies that were selected from 415 studies the authors concluded that, because of methodological flaws and/or underlying biases, none of the deep learning and machine learning models identified were clinically applicable to the diagnosis/prognosis of COVID-19. Going beyond the example of COVID-19, the high-risk bias of deep learning approaches is indeed a recurring concern across different medical image analysis tasks and applications (Nagendran et al., 2020) , which has severely limited deep learning's potential in clinical radiology. Although quantifying the underlying bias is difficult, it can be reduced if handled appropriately. In the following we summarize three major aspects that could lead to biased results and provide our recommendations. Data forms the basis of deep learning. In medical vision, medical image datasets with increasingly larger size (e.g. usually at least several hundred images) have been or are being developed to facilitate training and testing new algorithms. One notable example is the yearly MICCAI challenges where benchmark datasets for different diseases (e.g. cancer) are released, greatly promoting the progress of medical vision. However, we need to be cautious about the potential biases caused by using a single public dataset alone -as the whole community strive for achieving state of the art performance, community-wide overfitting is likely to exist on this dataset (Roberts et al., 2021) . This problem has been recognized by many researchers, so it is common to see several public datasets and/or private dataset(s) are used to test a new algorithm's performance more comprehensively. In this way the community-wide bias is reduced but not to the extent of large-scale clinical applications. The community-wide bias can be further lowered by incorporating additional data to train and test models. One direct way to introduce new data, of course is data curation, i.e., continually creating large, diverse datasets via collective work with experts. Different from this track, we recommend a less direct but effective way -integrating scattered private datasets as ethical and law regulations permit. The medical image analysis community might have the overall impression that large, representative, labeled data seems always lacking. This is only partially true, though. Due to time and cost constraints, it is true that many established public datasets have limited size and variety. On the other hand, rich medical image sources (labeled and unlabeled) of different sizes and difficulty levels already exist but inconveniently "in the form of isolated islands" . Because of factors such as privacy protection and political intricacy, most existing data sources are kept private and scattered in different institutions across different countries. Thus, it would be desirable to exploit the unified potentials of private datasets and even personal data without comprising patients' privacy. A promising approach to achieving this goal is federated learning (Li et al., 2020f) , which allows models to securely access sensitive data. Federated learning can train deep learning algorithms collaboratively on multiinstitutional data without exchanging data among participating institutions (Rieke et al., 2020) . Although this technology is accompanied by new challenges, it facilitates learning less biased, more generalizable, more robust, and better-performed algorithms that would better meet the needs of clinical applications. Most research papers in medical image analysis report models' performance via commonly used metrics, for example, accuracy and AUC for classification tasks, and Dice coefficient for segmentation tasks. While these metrics can easily quantify the technical performance of presented approaches, they often fail to reflect clinical applicability. Ultimately, clinicians care about whether the use of algorithms would bring about a beneficial change in patient care, rather than the performance gains reported in papers (Kelly et al., 2019) . Therefore, aside from applying necessary metrics, we believe it is important for research teams to collaborate with clinicians for algorithms appraisal. We simply mention two possible directions as to establishing collaborative evaluation. First, involve clinicians into viewpoints sharing of open clinical questions, paper writing, and even the peer review process of conferences and journals. For example, the Machine Learning for Healthcare (MLHC) conference provides a research track and clinical track for members from separated communities to exchange insights. Second, measure if the performance and/or efficiency of clinicians can be improved with the assistance of deep learning algorithms. Utilizing model results as a "second opinion" to facilitate clinicians' final interpretation has been explored in some studies. For instance, in the task of predicting breast cancer from mammograms, McKinney et al. (2020) evaluated the complementary role of deep learning model. They found that the model could correctly identify many cancer cases missed by radiologists. Furthermore, in the "double-reading process" (standard practice in UK), the model significantly reduced the second reader's workload while maintaining a comparable performance to the consensus opinion. The quick progress of computer vision is closely related to the research culture that advocates reproducibility. In medical image analysis, more and more researchers choose to make their code publicly available, and this greatly helps avoid duplication of effort. More importantly, good reproducibility can help deep learning algorithms gain more trust and confidence from a wide population (e.g., researchers, clinicians), which is beneficial to large-scale clinical applications. To make the results more reproducible, we suggest paying extra attention to describing data selection in papers. It is not uncommon to see that different studies select different subsets of samples from the same public dataset. This could increase the difficulty of reproducing results stated in the paper. In a case study on lung nodule classification, Baltatzis et al. (2021) demonstrated that specific choices of data turn out to be favorable to proving the proposed models' superiority. Advanced models with bells and whistles may underperform simple baselines if data samples are changed. Therefore, it is necessary to clearly state the data selection process to make the results more reproducible and convincing. In conclusion, deep learning is a fast-developing technology, and has produced promising potential in broad medical image analysis fields including disease classification, segmentation, detection, and image registration. Despite of significant research progress, we are still facing many technical challenges or pitfalls (Roberts et al., 2021) to develop deep learning based CAD schemes that can achieve high scientific rigor. Therefore, more research efforts are needed to overcome these pitfalls before the deep learning based CAD schemes can be commonly accepted by clinicians. Automated computer analysis of radiographic images Automated Radiographic Diagnosis via Feature Extraction and Classification of Cardiac Size and Shape Descriptors Automatic Computation of the Cardiothoracic Ratio with Application to Mass Screening Computer-aided diagnosis in radiology: potential and pitfalls Characterization of mammographic masses based on level set segmentation with new image features and patient information Computer-aided characterization of mammographic masses: accuracy of mass segmentation and its effects on characterization Deep learning Deep learning for liver tumor diagnosis part I: development of a convolutional neural network classifier for multi-phasic MRI Self-Supervised Feature Learning via Exploiting Multi-Modal Data for Retinal Disease Diagnosis MetaCOVID: A Siamese neural network framework with contrastive loss for n-shot diagnosis of COVID-19 patients Contrastive Learning of Medical Visual Representations from Paired Images and Text GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification An ensemble of fine-tuned convolutional neural networks for medical image classification An Ensemble of Fine-Tuned Convolutional Neural Networks for Medical Image Classification Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation Uncertainty-Aware Self-ensembling Model for Semisupervised 3D Left Atrium Segmentation Lung Infection Segmentation From CT Images You Only Look on Lymphocytes Once SANet: A Slice-Aware Network for Pulmonary Nodule Detection Exploring uncertainty measures in deep networks for Multiple sclerosis lesion detection and segmentation 3D deep learning for efficient and robust landmark detection in volumetric data A deep metric for multimodal registration Nonrigid Image Registration Using Multi-scale 3D Convolutional Neural Networks An Unsupervised Learning Model for Deformable Medical Image Registration Dermatologist-level classification of skin cancer with deep neural networks An artificial intelligence platform for the multihospital collaborative management of congenital cataracts A survey on deep learning in medical image analysis Deep learning in medical image analysis Generative adversarial network in medical imaging: A review GANs for medical image analysis Not-so-supervised: A survey of semi-supervised, multiinstance, and transfer learning in medical image analysis Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation A survey on semi-supervised learning Medical image analysis using convolutional neural networks: a review Reducing the Dimensionality of Data with Neural Networks Auto-association by multilayer perceptrons and singular value decomposition Greedy layer-wise training of deep networks Efficient learning of sparse representations with an energy-based model Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion Contractive auto-encoders: Explicit invariance during feature extraction Auto-encoding variational bayes Learning structured output representation using deep conditional generative models Deep unsupervised clustering with gaussian mixture variational autoencoders An introduction to variational autoencoders Generative adversarial nets Wasserstein generative adversarial networks Conditional Generative Adversarial Nets Conditional image synthesis with auxiliary classifier gans Bert: Pre-training of deep bidirectional transformers for language understanding Momentum Contrast for Unsupervised Visual Representation Learning Context Encoders: Feature Learning by Inpainting Colorful image colorization Unsupervised Visual Representation Learning by Context Prediction Unsupervised learning of visual representations by solving jigsaw puzzles Unsupervised Representation Learning by Predicting Image Rotations A Simple Framework for Contrastive Learning of Visual Representations Dimensionality reduction by learning an invariant mapping Learning a similarity metric discriminatively, with application to face verification Representation learning with contrastive predictive coding Contrastive learning of global and local features for medical image segmentation with limited annotations 2020b. Improved Baselines with Momentum Contrastive Learning Semi-supervised learning (chapelle, o An overview of deep semi-supervised learning Semi-supervised learning with Ladder networks Temporal Ensembling for Semi-Supervised Learning Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results Unsupervised Data Augmentation for Consistency Training MixMatch: A Holistic Approach to Semi-Supervised Learning Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks Mixup: Beyond empirical risk minimization Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning Deep Co-Training for Semi-Supervised Image Recognition Improved techniques for training GANs Semi-supervised learning with generative adversarial networks Triple generative adversarial nets Transformation-Consistent Self-Ensembling Model for Semisupervised Medical Image Segmentation A model of saliency-based visual attention for rapid scene analysis Neural Machine Translation by Jointly Learning to Align and Translate Attention is all you need Show, attend and tell: neural image caption generation with visual attention Image Captioning with Semantic Attention Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Residual Attention Network for Image Classification CBAM: Convolutional Block Attention Module Learn to Pay Attention Attention to Scale: Scale-Aware Semantic Image Segmentation End-to-End Instance Segmentation with Recurrent Attention Describing multimedia content using attention-based encoder-decoder networks Spatial transformer networks Squeeze-and-excitation networks Non-local neural networks An attentive survey of attention models Models Genesis Models Genesis: Generic Autodidactic Models for 3D Medical Image Analysis Self supervised deep representation learning for fine-grained body part recognition Self-supervised Feature Learning for 3D Medical Images by Playing a Rubik's Cube Rubik's Cube+: A self-supervised feature learning framework for 3D medical image analysis Big self-supervised models advance medical image classification Medaug: Contrastive learning leveraging patient metadata improves representations for chest x-ray interpretation A survey on incorporating domain knowledge into deep learning for medical image analysis A review of uncertainty quantification in deep learning: Techniques, applications and challenges Dropout as a bayesian approximation: Representing model uncertainty in deep learning Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 30 Computer-aided diagnosis: how to move from the laboratory to the clinic Imagenet classification with deep convolutional neural networks Very deep convolutional networks for large-scale image recognition Going deeper with convolutions Deep Residual Learning for Image Recognition Densely Connected Convolutional Networks Convolutional neural networks for medical image analysis: Full training or fine tuning? Transfer Learning for 3D Medical Image Analysis Decaf: A deep convolutional activation feature for generic visual recognition ImageNet: A large-scale hierarchical image database Machine learning approaches in medical image analysis: From detection to diagnosis Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning Prostate cancer classification with multiparametric MRI transfer learning model Digital mammographic tumor classification using transfer learning from deep convolutional neural networks Deep-covid: Predicting covid-19 from chest xray images using deep transfer learning Collaborative Learning of Semi-Supervised Segmentation and Classification for Medical Images Diagnose like a Radiologist: Attention Guided Convolutional Neural Network for Thorax Disease Classification Attention gated networks: Learning to leverage salient regions in medical images SonoNet: Real-Time Detection and Localisation of Fetal Standard Scan Planes in Freehand Ultrasound U-net: Convolutional networks for biomedical image segmentation Synthetic data augmentation using GAN for improved liver lesion classification Conditional infilling GANs for data augmentation in mammogram classification, Image Analysis for Moving Organ, Breast, and Thoracic Images Self-Supervised Learning for Cardiac MR Image Segmentation by Anatomical Position Prediction Revisiting Rubik's Cube: Self-supervised Learning with Volume-Wise Transformation for 3D Medical Image Segmentation Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning Self-supervised learning of pretext-invariant representations Self-supervised visual feature learning with deep neural networks: A survey Surrogate Supervision for Medical Image Analysis: Effective Deep Learning From Limited Quantities of Labeled Data Self-supervised learning for medical image analysis using image context restoration Colorization as a Proxy Task for Visual Understanding Momentum contrastive learning for few-shot COVID-19 diagnosis from chest CT images Moco pretraining improves representation and transferability of chest x-ray models Semi-supervised learning with generative adversarial networks for chest X-ray classification with ability of data domain adaptation Semi-supervised learning with deep generative models Semi-supervised adversarial model for benign-malignant lung nodule classification on chest CT Deep echocardiography: data-efficient supervised and semi-supervised deep learning towards automated diagnosis of cardiac disease Leveraging Other Datasets for Medical Imaging Classification: Evaluation of Transfer, Multi-task and Semi-supervised Learning Semi-Supervised Medical Image Classification With Relation-Driven Self-Ensembling Model Transunet: Transformers make strong encoders for medical image segmentation Mask R-CNN Hypercolumns for object segmentation and fine-grained localization Fully convolutional networks for semantic segmentation The Importance of Skip Connections in Biomedical Image Segmentation Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support 3D U-Net: learning dense volumetric segmentation from sparse annotation V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation Automatic Multi-Organ Segmentation on Abdominal CT With Dense V-Networks Recurrent convolutional neural network for object recognition H-DenseUNet: Hybrid Densely Connected UNet for Liver and Tumor Segmentation From CT Volumes SegAN: Adversarial Network with Multi-scale L1 Loss for Medical Image Segmentation Unsupervised X-ray image segmentation with task driven generative adversarial networks Attention U-Net: Learning Where to Look for the Pancreas ASDNet: Attention Based Semi-supervised Deep Networks for Medical Image Segmentation Multi-Scale Self-Guided Attention for Medical Image Segmentation Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks PHiSeg: Capturing Uncertainty in Medical Image Segmentation Medical Image Computing and Computer Assisted Intervention -MICCAI 2019 Confidence calibration and predictive uncertainty estimation for deep medical image segmentation An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, International Conference on Learning Representations Transfuse: Fusing transformers and cnns for medical image segmentation Unetr: Transformers for 3d medical image segmentation Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation Deformable detr: Deformable transformers for end-toend object detection Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation Swin transformer: Hierarchical vision transformer using shifted windows Axial-deeplab: Stand-alone axial-attention for panoptic segmentation Medical transformer: Gated axial-attention for medical image segmentation Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Feature Pyramid Networks for Object Detection Volumetric attention for 3D medical image segmentation and detection Unet++: Redesigning skip connections to exploit multiscale features in image segmentation Translating and Segmenting Multimodal Medical Volumes with Cycleand Shape-Consistency Generative Adversarial Network Data Augmentation Using Learned Transformations for One-Shot Medical Image Segmentation 3D Self-Supervised Methods for Medical Imaging Self-Supervised Pretraining with DICOM metadata in Ultrasound Imaging Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support Learning a similarity metric discriminatively, with application to face verification What uncertainties do we need in Bayesian deep learning for computer vision? A Noise-Robust Framework for Automatic Segmentation of COVID-19 Pneumonia Lesions From CT Images Towards Bridging Semantic Gap to Improve Semantic Segmentation Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation Cascaded Partial Decoder for Fast and Accurate Salient Object Detection IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Reverse Attention for Salient Object Detection ET-Net: A Generic Edge-aTtention Guidance Network for Medical Image Segmentation Semi-supervised segmentation of optic cup in retinal fundus images using variational autoencoder Multi-task attention-based semi-supervised learning for medical image segmentation DPA-DenseBiasNet: Semi-supervised 3D Fine Renal Artery Segmentation with Dense Biased Network and Deep Priori Anatomy Semisupervised Segmentation of Liver Using Adversarial Learning with Deep Atlas Prior A Topological Loss Function for Deep-Learning based Image Segmentation using Persistent Homology Semi-supervised Learning for Segmentation Under Semantic Constraint Shape-Aware Semi-supervised 3D Semantic Segmentation for Medical Images Error Corrective Boosting for Learning Fully Convolutional Networks with Limited Data Stacked cross refinement network for edge-aware salient object detection OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation ImageNet Large Scale Visual Recognition Challenge Automatic classification of pulmonary peri-fissural nodules in computed tomography using an ensemble of 2D views and a convolutional neural network out-of-the-box Automatic Detection of Cerebral Microbleeds From MR Images via 3D Convolutional Neural Networks Automatic coronary artery calcium scoring in cardiac CT angiography using paired convolutional neural networks You Only Look Once: Unified, Real-Time Object Detection Focal Loss for Dense Object Detection Fast R-CNN Deep Learning for Generic Object Detection: A Survey R-FCN: object detection via region-based fully convolutional networks YOLO9000: Better, Faster, Stronger, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) SSD: Single Shot MultiBox Detector CornerNet: Detecting Objects as Paired Keypoints CenterNet: Keypoint Triplets for Object Detection Associative embedding: end-to-end learning for joint detection and grouping DeNet: Scalable Real-Time Object Detection with Directed Sparse Sampling Automatic lung nodule detection using a 3D deep convolutional neural network combined with a multi-scale prediction strategy in chest CTs Automated pulmonary nodule detection in CT images using deep convolutional neural networks Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support Detecting and classifying lesions in mammograms with deep learning Lymph Node Gross Tumor Volume Detection and Segmentation via Distance-Based Gating Using 3D CT/PET Imaging in Radiotherapy Improving Deep Lesion Detection Using 3D Contextual and Spatial Attention Uldor: A Universal Lesion Detector For Ct Scans With Pseudo Masks And Hard Negative Example Mining Improving Computer-Aided Detection Using Convolutional Neural Networks and Random View Aggregation Multilevel Contextual 3-D CNNs for False Positive Reduction in Pulmonary Nodule Detection 3D Context Enhanced Region-Based Convolutional Neural Network for End-to-End Lesion Detection Evaluate the Malignancy of Pulmonary Nodules Using the 3-D Deep Leaky Noisy-OR Network Accurate Pulmonary Nodule Detection in Computed Tomography Images Using Deep Convolutional Neural Networks Very deep convolutional neural network based image classification using small training sample size Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: The LUNA16 challenge DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification Dual path networks Learning to detect lymphocytes in immunohistochemistry with deep learning Renal Cell Carcinoma Detection and Subtyping with Minimal Point-Based Annotation in Whole-Slide Images Knowledge-guided Pretext Learning for Utero-placental Interface Detection. Medical image computing and computer-assisted intervention : MICCAI FocalMix: Semi-Supervised Learning for 3D Medical Image Detection Propagating uncertainty in multi-stage bayesian convolutional neural networks with application to pulmonary nodule detection MULAN: Multitask Universal Lesion Analysis Network for Joint Lesion Detection, Tagging, and Segmentation Learning from Multiple Datasets with Heterogeneous and Partial Labels for Universal Lesion Detection in CT Deep Volumetric Universal Lesion Detection Using Light-Weight Pseudo 3D Convolution and Surface Point Regression Medical Image Computing and Computer Assisted Intervention -MICCAI 2020 Deep Lesion Graphs in the Wild: Relationship Learning and Organization of Significant Radiology Image Findings in a Diverse Large-Scale Lesion Database DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1) Joint learning for pulmonary nodule segmentation, attributes and malignancy prediction MVP-Net: Multi-view FPN with Position-Aware Attention for Deep Universal Lesion Detection Keypoints Localization for Joint Vertebra Detection and Fracture Severity Quantification ROSNet: Robust one-stage network for CT lesion detection You Only Learn Once: Universal Anatomical Landmark Detection Autoencoders for unsupervised anomaly segmentation in brain MR images: A comparative study Unsupervised pathology detection in medical images using conditional variational autoencoders Unsupervised anomaly detection with generative adversarial networks to guide marker discovery, International conference on information processing in medical imaging f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks Deep autoencoding models for unsupervised anomaly segmentation in brain MR images, International MICCAI Brainlesion Workshop Unsupervised lesion detection via image restoration with a normative prior Unsupervised lesion detection via image restoration with a normative prior Normative ascent with local gaussians for unsupervised lesion detection Cross-View Relation Networks for Mammogram Mass Detection Relation networks for object detection Cross-view Correspondence Reasoning based on Bipartite Graph Convolutional Network for Mammogram Mass Detection Fast ScanNet: Fast and Dense Analysis of Multi-Gigapixel Whole-Slide Images for Cancer Metastasis Detection Deep learning in medical image registration: a survey. Machine Vision and Applications 31 Deep learning in medical image registration: a review Image matching from handcrafted to deep features: A survey Deep similarity learning for multimodal medical images Learning deep similarity metric for 3D MR-TRUS image registration Training CNNs for image registration from few samples with model-based data augmentation BIRNet: Brain image registration using dual-supervised fully convolutional networks Unsupervised 3D end-to-end medical image registration with volume tweening network Unsupervised deformable image registration using cycle-consistent cnn Scalable High-Performance Image Registration Framework by Unsupervised Deep Feature Representations Learning Unsupervised learning of hierarchical representations with convolutional deep belief networks Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain VoxelMorph: A Learning Framework for Deformable Medical Image Registration Weakly-supervised convolutional neural networks for multimodal image registration Label-driven weakly-supervised learning for multimodal deformable image registration A deep learning framework for unsupervised affine and deformable image registration End-to-end unsupervised deformable image registration with a convolutional neural network, Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support Adversarial learning for mono-or multi-modal registration Quicksilver: Fast predictive image registration -A deep learning approach Unpaired image-to-image translation using cycle-consistent adversarial networks Artificial intelligence for mammography and digital breast tomosynthesis: current concepts and future perspectives Learning to navigate for fine-grained classification Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations, Deep learning in medical image analysis and multimodal learning for clinical decision support A novel focal tversky loss function with improved attention u-net for lesion segmentation Boundary loss for highly unbalanced segmentation, International conference on medical imaging with deep learning SACNN: Self-attention convolutional neural network for lowdose CT denoising with self-supervised perceptual loss network Good semi-supervised learning that requires a bad GAN TOP-GAN: Stain-free cancer cell classification using deep learning with a small training set Differentiable Augmentation for Data-Efficient GAN Training Training Generative Adversarial Networks with Limited Data Supervised Contrastive Learning A theoretical analysis of contrastive unsupervised representation learning Self-supervised pretraining improves self-supervised pretraining FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence Randaugment: Practical automated data augmentation with a reduced search space Realistic evaluation of deep semisupervised learning algorithms Safe deep semi-supervised learning for unseenclass unlabeled data Deep Nets: What have They Ever Done for Vision? Toward an integration of deep learning and neuroscience Neural architecture search with reinforcement learning Neural architecture search: A survey NiftyNet: a deep-learning platform for medical imaging nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies Federated machine learning: Concept and applications Federated learning: Challenges, methods, and future directions The future of digital health with federated learning Key challenges for delivering clinical impact with artificial intelligence International evaluation of an AI system for breast cancer screening The Pitfalls of Sample Selection: A Case Study on Lung Nodule Classification The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors gratefully acknowledge the following research support: Grant P20GM135009 from National Institute of General Medical Sciences, National Institutes of Health; Stephenson Cancer Center Team Grant funded by the National Cancer Institute Cancer Center Support Grant P30CA225520 awarded to the University of Oklahoma Stephenson Cancer Center.