key: cord-0625084-6oigf0ip authors: Chen, Zekai; Agarwal, Devansh; Aggarwal, Kshitij; Safta, Wiem; Balan, Mariann Micsinai; Sethuraman, Venkat; Brown, Kevin title: Masked Image Modeling Advances 3D Medical Image Analysis date: 2022-04-25 journal: nan DOI: nan sha: fa96902125df4f88a11ec9425e2eef51c598eed1 doc_id: 625084 cord_uid: 6oigf0ip Recently, masked image modeling (MIM) has gained considerable attention due to its capacity to learn from vast amounts of unlabeled data and has been demonstrated to be effective on a wide variety of vision tasks involving natural images. Meanwhile, the potential of self-supervised learning in modeling 3D medical images is anticipated to be immense due to the high quantities of unlabeled images, and the expense and difficulty of quality labels. However, MIM's applicability to medical images remains uncertain. In this paper, we demonstrate that masked image modeling approaches can also advance 3D medical images analysis in addition to natural images. We study how masked image modeling strategies leverage performance from the viewpoints of 3D medical image segmentation as a representative downstream task: i) when compared to naive contrastive learning, masked image modeling approaches accelerate the convergence of supervised training even faster (1.40$times$) and ultimately produce a higher dice score; ii) predicting raw voxel values with a high masking ratio and a relatively smaller patch size is non-trivial self-supervised pretext-task for medical images modeling; iii) a lightweight decoder or projection head design for reconstruction is powerful for masked image modeling on 3D medical images which speeds up training and reduce cost; iv) finally, we also investigate the effectiveness of MIM methods under different practical scenarios where different image resolutions and labeled data ratios are applied. The demand for deep neural networks that conduct analysis tasks on 3D medical image data has expanded dramatically in recent years as a result of technological advances in deep learning and hardware compute capabilities. 3D medical volumetric images show a lot of potential in healthcare, where it can help increase the speed and accuracy of diagnosing patient conditions. For instance, properly and swiftly discovering and measuring tumor lesions from MRI/CT scans would be critical to disease prevention, early detection and treatment plan optimization, and would also spur the development of more successful clinical applications that would ultimately improve patients' lives [6] . However, the high expense of expert annotation frequently stymies attempts to leverage advances in clinical outcomes using deep learning approaches. Annotations of 3D medical images at scale by radiologists are limited, expensive, and time-consuming to produce. Another barrier in 3D medical imaging is data volume, which is driven by the increased 3D image dimensionality and resolution, resulting in significant processing complexity. As a consequence, training deep learning models on 3D medical images from random initialization necessitates burdensome compute and data requirements. As a viable alternative, self-supervised learning [26] obtains supervisory signals from the data itself, and has recently been shown to successfully address the appetite for data and to be capable of learning generalizable dense representations of the input. Among contemporary approaches, masked signal modeling is one such learning task: masking a subset of input signals and attempting to forecast the masked signals. This paradigm has been extremely successful in NLP, since self-supervised learning algorithms based on the masked language modeling task have largely revolutionized the discipline [7, 13, 39, 40] , demonstrating that giant models such as BERT [13] and GPT [7, 39, 40] can be learned on unlabeled text data and adapted to a wide variety of applications. More importantly, with the introduction of Vision Transformers (ViT) [15, 50] , the architecture gap, where it was not intuitive to apply mask tokens [13, 50] using covolutions [27] , is no longer an obstacle. Following this philosophy, latest approaches based on masked image modeling (MIM) have demonstrated their efficacy in the development of scalable vision models [2, 23, 55] . Despite these accomplishments, masked image modeling based algorithms have received little attention in medical imaging modeling, and their applicability has not been thoroughly investigated. Naturally, we wonder whether masked image modeling will advance 3D medical imaging analysis as arXiv:2204.11716v1 [cs.CV] 25 Apr 2022 well. In this work, we aim to address this question from the following attempts: • Contrastive learning [5, 10, 19] has been proven in a few studies to be capable of learning generic representations of medical images that leverage the downstream tasks such as 3D segmentation and classification [1, 46, 47] . It is worthwhile to compare masked image modeling to contrastive learning approaches (see Fig. 1a for illustration) on medical images. • Natural images are raw, low-level signals with a significant degree of spatial redundancy; restoring some missing patches can be accomplished by directly copying surrounding patches with little high-level understanding of the objects and sceneries [23] . Particularly for certain CT/MRI scans with solid tumors, the majority of background tissues are comparable, making it even more difficult for the model to learn useful features about the lesion regions. As a result, we assess several masking strategies (masked patch size and masking ratio) in order to determine the most efficient way that promotes holistic comprehension beyond low-level data while avoiding excessive attention to features such as texture and materials. • In practice, medical image analysis is utilized in a variety of contexts with varying amounts of annotated data, accessible unlabeled data, and even image resolutions. As a result, it is also vital for us to extensively analyze how these elements affect the pertaining as well as the performance on downstream tasks. This paper investigates how masked image modeling based self-supervised learning can be utilized to improve 3D medical image analysis. It does so by conducting extensive experiments on two real-world benchmark datasets: multiorgan segmentation 1 and brain tumor segmentation [44] . Our experimental results demonstrate that masked image modeling is advantageous for modeling 3D medical images 1 https://www.synapse.org/#!Synapse:syn3193805/wiki/89480 by significantly speeding up training convergence (e.g. at most 1.4× training cost saving to reach the same dice score) and ultimately improved downstream performance (e.g. over 5% improvements on both segmentation without any hyperparameter tuning). Transfer Learning in Medical Image Analysis. Transfer learning from natural images is extensively utilized in medical image analysis [31, 34] , regardless disparities in image statistics, scale, and task-relevant characteristics. Raghu et al. [41] and [1] showed that transfer learning from ImageNet can accelerate convergence on medical images, which is especially useful when the medical image training data is limited. Transfer learning using domain-specific data can also assist in resolving the domain disparity issue. For instance, [9, 29] indicate improved performance following pretraining on labeled data from the same domain. However, this strategy is frequently impractical for a variety of medical scenarios requiring labeled data that is costly and time-consuming to gather. Recent improvements in selfsupervised learning offer a viable alternative, allowing for the utilization of unlabeled medical data, which is massive and commonly more accessible. Masked Image Modeling. Masked image modeling is a self-supervised learning method that learns representations via masking-corrupted images. It evolved in line with the MLM task in NLP but remained out of the mainstream for a long period. DAE [51] is a pioneering work in this domain, presenting masking as a noise type. The context encoder [37] predicts the missing pixels by inpainting a large rectangular area of the source images. Recent techniques [4, 8, 15 ] based on Transformers [50] are motivated by the success of NLP. iGPT [8] groups pixel values into different clusters and classify unknown pixels. The ViT study [15] investigates masked patch prediction for selfsupervised learning by predicting the mean color of images. BEiT [4] recently used a dVAE network to tokenize In this case, a ViT-Base backbone is applied for the encoder, the masked patch size is 16 (for all dimensions), and the masking ratio is 75% following [55] . In this case, a ViT-Large is applied as the encoder backbone, the masked patch size is 16 (for all dimensions), and the masking ratio is 75% following [23] . and forecast pixel values into discrete numbers [42, 49] . More importantly, MAE [23] adheres the spirit of raw pixel restoration, demonstrating for the first time that masking a high proportion of the input images can yield a non-trivial and meaningful self-supervisory task. It adopts a design of autoencoder with a lightweight decoder, which reduces training costs even more. SimMIM [55] takes it a step further and substitutes the entire decoder with a single linear projection layer, resulting in comparable results. The very recent approaches such as data2vec [2] and CAE [11] make predictions in the latent representation space from the visible patches to the masked patches, attempting to make MIM a more universal framework for self-supervised learning. Nonetheless, the techniques described above have only been shown to be useful for natural images modeling. In this work, we aim to investigate whether MIM approaches can also advance 3D medical image analysis. Self-Supervised Learning. Early work in self-supervised learning focuses on learning representations from unlabeled data so that a low-capacity classifier can achieve high accuracy using these embeddings [14, 16, 35, 36, 52, 56] . For years, contrastive learning [5, 10, 19, 24, 48, 53] has received much interest as one of the most popular and widespread self-supervised learning strategies. It models image similarity and dissimilarity (or solely similarity [12, 17] ) between two or more views, with data augmentation being crucial for contrastive and related approaches. Self-supervised learning has also been used in the medical field, according to several previous literature. Domain-specific pretext tasks [3, 45, 58, 59] , for example, have been studied, while other work [25, 28, 30, 57] focuses on tailoring contrastive learning to medical data. Taleb et al. [46] , in particular, examine a range of self-supervised learning strategies for 3D medical imaging in depth. MICLe [1] demonstrates that a model pretrained on ImageNet can also advance Dermatology image classification. Tang et al. [47] further combines inpainting [37] with contrastive learning for medical segmentation. Despite the fact that all of these methods have showed promise in medical imaging, masked image modeling-based methods have yet to be substantially investigated in this discipline. Masked image modeling approaches, in general, mask out a portion of input images or encoded image tokens and encourage the model to recreate the masked area. Many extant MIM models employ an encoder-decoder design followed by a projection head, such as BEiT [4] and MAE [23] . The encoder aids in the modeling of latent feature representations, while the decoder aids in the resampling of latent vectors to original images. The encoded or decoded embeddings will subsequently be aligned with the original signals at the masked area by a projection head. Notably, the decoder component has been suggested to be designed in a lightweight manner in order to minimize training time. A lightweight decoder, in our experience, not only reduces computing complexity but also increases the encoder's ability to learn more generalizable representations that the decoder can easily grasp, translate and convey. As a result, while the encoder is more important (only encoder would be inherited for finetuning), methods like Sim-MIM [55] simplifies the architecture even more by obviating the entire decoder with a single projection layer. In this work, we thoroughly investigate the effectiveness of different MIM models on 3D medical imaging data. The following components provide more details: Following ViT [15] , an image is divided into regular non-overlapping patches (e.g. a 96× 96× 96 3D volume will be divided into 216 patches of 16× 16× 16 smaller volumes), which are often considered as the basic processing units of vision Transformers. Multiple random masking methods have been proposed in the previous literature: 1) InPainting [37] introduced a central region masking strategy; 2) BEiT [4] proposed a complex block-wise masking strategy; 3) most recent approaches such as MAE [23] and SimMIM [55] followed a more straightforward uniformly random masking method at patch-level while investigating different masked patch sizes and masking ratios (see Fig. 1b and Fig. 1c , respectively). Many random masking schemes are also patch-based since it is more convenient to operate masking on a patch-by-patch basis, where a patch is either fully visible or masked. As demonstrated by these works, the uniformly random sampling with a high masking ratio effectively eliminates redundancy, resulting in a self-supervisory task that cannot be easily solved by extrapolation from visible neighboring patches. Meanwhile, a potential center bias (i.e. more masked patches near the image center) is avoided by the uniform distribution. Finally, the sparse input allows for the development of an efficient encoder, which will be discussed next. In this work, we also use the random patch masking approach for simplicity and efficacy. Encoders are responsible for modeling latent feature representations of the masked patches, which are then utilized to forecast the original signals in the masked area. The learned encoder should be capable of adapting to a wide range of vision tasks. We consider a variety of architectures in this paper, including two fundamental vision Transformer architectures: vanilla ViT [15, 50] and Swin-Transformer [32] , as well as one attentional visual network VAN [18] , which inherits the attention mechanism to derive hierarchical representations similar to SwinTransformer but using pure convolutions. All models are reimplemented to 3D versions in order to accommodate the 3D volume data. We simply refer to these models as ViT3D, SwinTrans-former3D and VAN3D. For methods that follow an auto-encoder design to reconstruct the image, the decoder takes the entire collection of encoded tokens, including 1) encoded visible patches and 2) mask tokens. Each randomly initialized mask token is a learnable vector that is jointly optimized to reveal the masked patches. The absolute positional embeddings [50] or relative positional embeddings [32] are also applied to these mask tokens corresponding to the backbone architecture. Additionally, all the masked patches are invisible to the encoder and only the decoder can see all tokens. As proved in [23] , this can save more computation and mem-ory while not interfering with training. Meanwhile, the decoder backbones are independent from the encoder backbones, which are likewise optional (see Fig. 1b ). By default, we follow [23] and use another series of Transformer blocks for decoding. Raw voxel value prediction. For 3D medical image, reconstructing the inputs by estimating the raw voxel values for each mask token is simple and intuitive. The distance between recovered and original images in voxel space can be computed using a loss function of either l 1 loss or l 2 loss. Furthermore, the loss is only computed on masked patches, preventing the model from engaging in self-reconstruction, which might potentially dominate the learning process and ultimately impede knowledge learning. Notably, most vision Transformer topologies will downsample the original image resolution. For 3D medical images, a 96× volume resolution will be downsampled to 9× (i.e. 1*9*9*9≈768 using ViT-Base) and 3× using SwinTransformer or VAN. Therefore, for vanilla ViT, we apply a single linear projection layer to transform the latent embeddings to the original voxel space; for SwinTransformer and VAN, we apply two-layers convolutional transpose to upsample the compressed embeddings to the original resolution. See Fig. 2 and Fig. 3 for the reconstruction of 3D lung CT scans from TCIA-COVID19 using SimMIM [55] and MAE [23] , respectively. Other predictions. Many earlier studies transform masked signals to clusters or classes rather than raw pixel values. For example, iGPT [8] uses k-means to divide the RGB values into 512 clusters and encourage the model to predict which cluster each pixel belongs to. BEiT [4] employs a discrete VAE (dVAE) to convert image patches to discrete tokens. The prediction objective is then based on the token identity. Medical images, on the other hand, are often sparse, and voxel values are not scale intensive. The fine-grained texture or materials information may be lost by replacing the original signals with a discrete class target. As a result, we concentrate on predicting raw voxel values in this work for the sake of simplicity and robustness. We evaluate masked image modeling methods on two separate 3D segmentation tasks that involve both CT and MRI imaging modalities. ter under the supervision of clinical radiologists. Each CT scan was performed in the portal venous phase with contrast enhancement and comprised of 80 to 225 slices with 512×512 pixels and a slice thickness of 1 to 6 mm. Each volume was pre-processed separately, with intensities in the range of [-175, 200 ] HU being normalized to [0, 1]. During pre-processing, all images are resampled to 1.5/2.0 mm (different resolutions for ablation study) isotropic voxel spacing. The multi-organ segmentation problem is formulated as a 13 class segmentation task with 1-channel input. We use the first 24 volumes for training and report on 6 validation images. BraTS/MRI. Brain tumor segmentation is another important task. This BraTS dataset [44] contains a training set of 387 multi-modal multi-site MRI data (FLAIR, T1w, T1gd, T2w) with ground truth labels of gliomas segmentation necrotic/active tumor and edema is used for brain tumor segmentation. The voxel spacing of MRI images in this task is 1.0×1.0×1.0 mm 3 . The voxel intensities are pre-processed with z-score normalization. The problem of brain tumor segmentation is formulated as a 3 class segmentation task with 4-channel input. We report on 97 validation images. TCIA-COVID19/CT. This is a public dataset [20] consisting of unenhanced chest CTs of patients with COVID19 infections. There are 771 volumes collected from 661 patients in total. All images are unannotated. We utilize this dataset as an extra unlabeled dataset for self-supervised learning in ablation study. All models in Tab. 1 are pretrained using a combination of this dataset and BTCV. In ablation study, we also compare the performance between pretraining with and without this dataset. Table 1 . Main results on multi-organ segmentation task. All models are pretrained on a combination of BTCV and TCIA-COVID19 [20] datasets. The BTCV validation set is utilized for validation consistently. Supervised Baselines. UNETR [22] is a U-shaped encoder-decoder architecture for medical segmentation that employs a ViT as the encoder backbone and a convolutional upsampling decoder following U-Net [43] design. It is one of the SOTA models in the domain of medical imaging segmentation that incorporates the vision Transformers as the backbone. UNETR-Base represents a ViT-Base [15] is applied as the encoder backbone. We adopt UNETR-B as the default supervised baseline in our ablation study. For other backbones (SwinTransformer and VAN) that produce hierarchical features, we adopt UPerNet [54] as the decode head by default for downstream segmentation. We use the Dice score to evaluate the accuracy of segmentation in our experiments. For a given semantic class, let G i and P i denote the ground truth and prediction values for voxel i. The Dice score is defined as: Table 2 . Main results on brain tumor segmentation. All models are pretrained on BraTS [44] training set without extra data source. is conducted on a single NVIDIA A10G instance for a total of 3000 epochs. For brain tumor segmentation, the batch size is set to 8 as the training is conducted on 4 NVIDIA A10G GPUs for a total number of 1000 epochs. We use a 100 epochs linear warmup and the optimizer settings are compatible with organ segmentation. Our Appendix provides additional information Sec. 6. We begin by evaluating 1) how masked image modeling methods compare to contrastive learning approaches and 2) how different masked image modeling approaches perform in comparison to one another using MAE [23] and Sim-MIM [55] and a conventional contrastive learning methodology SimCLR [10] . We evaluate a range of encoder back-bones with varying network sizes, including pure vision Transformer [15] , SwinTransformer [32] , and visual attentional network (VAN) [18] . For MAE, we use an 8-layer Transformer block with 512-d as the decoder; for SimMIM, we use a single linear layer as the projection head. We use a two-layer convolutional transpose as the projection head for pretraining and the UPerNet [54] for segmentation in both Swin3D and VAN3D. All other hyper-parameters were set identically in this investigation. Additionally, because the full 3D image volume is typically difficult to load directly into the GPU (memory explosion), we employ a sliding window training strategy [21, 22, 38] in which the original image is divided into several (96×96×96) small 3D windows. For all ViTs, a patch size of 16 is utilized by default. Tab. 1 demonstrates that masked image modeling approaches outperform contrastive learning methods in general, as both MAE [23] and SimMIM [55] achieve an average dice score of around 0.752∼0.758, while SimCLR achieves an average dice score of around 0.723, which is 4.5% lower than the best approach. The segmentation findings for BraTS in Tab. 2 follow a similar pattern. The average dice score for masked image modeling approaches is somewhat greater than 0.80, however SimCLR [10] obtains a dice value of 0.7739, which is 4.37% lower than the best approach comparable to Tab. 1. Another note is that, despite the similarity of the two MIM techniques, SimMIM [55] achieves slightly better performance than MAE [23] , as demonstrated by both Tab. 1 and Tab. 2. One explanation for this phenomena is because an efficient decoder (even a lightweight one) may be able to reconstruct the original image even if the encoder does not acquire generalizable representations, hence cyclically ease the motivation of encoder to learn more effective representations. Selfsupervised learning's ultimate goal is always to learn effective and generalizable representations of the data rather than self-convergence only. In comparison, SimMIM [55] employs an even lighter design by omitting the decoder entirely, which pushes the encoder to perform more complex reconstruction and learning tasks. Additionally, masked image modeling approaches dramatically increase the training speed and reduce the cost, as seen by Fig. 4 . SimMIM based architectures can obtain a 1.76× better dice score at the 1.3k training step. Moreover, MIM based approaches can reach a dice score of 0.7 with 1.4× less training time than the training time required for supervised baseline. Additionally, we investigate the effectiveness of different masked patch sizes and masking ratios on self-supervised learning performance. The performance of several MIM techniques at finetuning segmentation is summarized in Tab. 3 i) Consistent with the original MAE literature [23] , we conclude that a higher masking ratio is a non-trivial selfsupervised learning job that would continually drive the model to build generalizable representations that can be transferred effectively to downstream tasks. For example, the best dice scores on multi-organ segmentation and brain tumor segmentation tasks are obtained when a masking ratio of 0.75 is used across multiple patch sizes (e.g., 0.7183 for patch size 16 in Tab. 3, and 0.8041 for patch sizes 24 and 32 in Tab. 4). ii) A high masking ratio combined with a small patch size likewise results in a relatively good performance when used in conjunction with SimMIM [55] , similar to MAE [23] . As demonstrated by Tab. 3 and Tab. 4, when the patch size is equal to 16, the models perform optimally with dice scores of 0.7249 and 0.8077, respectively. iii) However, as the patch size increases, the Sim-MIM [55] method appears to be less sensitive to this masking ratio. For instance, when the patch size is 32, models can earn the highest dice score with a masking ratio of 0.15, the smallest possible masking ratio. One hypothesis is that medical images are typically raw, low-level signals with a large degree of spatial redundancy; recovering some missing patches can be performed by directly copying nearby patches with little comprehensive knowledge of the objects and surroundings. A single small masked patch is incapable of adequately masking complicated and intersecting structures or locations, but a high patch size may be able to hide more significant signals independently. As a result, a high masking ratio for small patch sizes is more critical than a high masking ratio for big patch sizes. In this section, we analyze the results to address the following three questions: i) Does increasing the amount of pretraining data improve downstream performance? ii) How do different pretrained resolutions affect downstream knowledge transfer? And iii) how do masked image learning approaches improve performance when using varying amounts of labeled data? All pretraining in Tab. 5 is based on the MAE [23] architecture, which utilizes a ViT-Base/16 as the backbone with a masking ratio of 75%, as demonstrated in Tab. 3 and Tab. 4. Differently labeled ratios indicate that we employ a varying percentage of annotated BTCV CT scans (e.g., 50% = 12 images, 100% = 24 images) for downstream finetuning, whereas the validation set of 6 images is consistent. In the majority of supervised learning cases, more training data results in improved performance. Given that the majority of medical images are similar from the bottom logic up, we ask if this holds true in the case of selfsupervised learning, and in particular, how much benefits can be gained through size of pretrain data when utilizing MIM for 3D medical analysis. We adopt multi-organ segmentation as the example downstream task and create two distinct training scenarios: one that uses both COVID19 and BTCV datasets and another that uses only BTCV. Tab. 5 demonstrates the constant tendency that models trained on more plentiful pretrained data outperform models trained on less pretrained data (e.g., 0.7534→0.7183: 4.9% improvements, 0.7338→0.7018: 4.6% improvements). This advantage is even more pronounced at lower image resolutions, as 0.6919 is 5.6% more than 0.6552 when only half labeled data is used. In Tab. 5, we also explore how different pretrained image resolutions affect the downstream task performance. Intuitively, a higher pretraining resolution should result in a better segmentation results [1] , as the images contain more granular information. Here, we utilize different downsampled ratios to represent the degree to which the original signals are compressed in all dimensions for each volume. Specifically, a bilinear interpolation function is used in conjunction with MONAI's 5 spacingd transformations. As can be observed from Tab. 5, pretrained models with higher resolutions (1.5x, 1.5x, 2.0x) generally perform better than pretrained models with lower resolutions (2.0x, 2.0x, 2.0x). For instance, 0.7338 dice score is 2.7% lower than the one pretrained using the same data source and labeled ratio but using a greater resolution. In practical situation, the majority of the medical images, such as CT/MRI scans, are left unannotated due to the high cost of labeling. However, the public data is freely available and abundant, the aforementioned results illustrate once again that pre-training on large datasets followed by finetuning with small samples is feasible. It also demonstrates that masked image learning can significantly improve the downstream task performance in a variety of contexts. This paper demonstrates how masked image modeling approaches in self-supervised learning leverage 3D medical image modeling by conducting extensive experiments on two sample segmentation tasks. We show how masked image modeling outperforms traditional contrastive learning by speeding up convergence and greatly improving downstream task performance. We also show how masked image modeling approaches can be utilized to advance 3D medical image modeling in a variety of situations. However, the fact that almost all medical images are weakly labeled (e.g. as little as few lines of text for description) rather than entirely unannotated is an open question we would like to investigate further in the future. We are then interested on comparing self-supervised learning to supervised learning with limited supervisory signals. Finally, we remain curious to see how self-supervised learning is integrated in other more challenging downstream tasks. 6. Appendix As can be seen from the reconstructed volumes, the large model has more restoration power than the tiny model, which supports the previous conclusion. The flattened dimensionality of 3D medical images is frequently very high, and a small model would unavoidably compress the original voxel space into a smaller voxel space, thereby losing a lot of information. config value optimizer AdamW [33] base learning rate 3e-4 weight decay 0.005 optimizer momentum beta1, beta2 = 0.9, 0.999 batch size 4 learning rate schedule linear warmup cosine annealing warmup epochs 300 total epochs 3000 augmentation RangeScaleIntensity Table 9 . Pretraining setting on MRI 3D volumes. In this case, a ViT-Base backbone is applied for the encoder, the masked patch size is 16 (for all dimensions), and the masking ratio is 75% following [55] . Vivek Natarajan, and Mohammad Norouzi. Big self-supervised models advance medical image classification data2vec: A general framework for self-supervised learning in speech, vision and language Self-supervised learning for cardiac mr image segmentation by anatomical position prediction. MICCAI Beit: Bert pre-training of image transformers Self-organizing neural network that discovers surfaces in random-dot stereograms Predicting cancer outcomes with radiomics and artificial intelligence in radiology Language models are few-shot learners Generative pretraining from pixels Med3d: Transfer learning for 3d medical image analysis. ArXiv A simple framework for contrastive learning of visual representations Context autoencoder for selfsupervised representation learning Exploring simple siamese representation learning Bert: Pre-training of deep bidirectional transformers for language understanding Unsupervised visual representation learning by context prediction Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR Unsupervised representation learning by predicting image rotations. ICLR Bootstrap your own latent: A new approach to self-supervised learning Dimensionality reduction by learning an invariant mapping Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images Unetr: Transformers for 3d medical image segmentation. WACV Masked autoencoders are scalable vision learners. ArXiv, abs Momentum contrast for unsupervised visual representation learning Sample-efficient deep learning for covid-19 diagnosis based on ct scans. medRxiv A survey on contrastive self-supervised learning Backpropagation applied to handwritten zip code recognition Imbalance-aware self-supervised learning for 3d radiomic representations A transfer learning method with deep residual network for pediatric pneumonia diagnosis. Computer methods and programs in biomedicine Align, attend and locate: Chest x-ray diagnosis via contrast induced attention network with limited supervision A deep learning system for differential diagnosis of skin diseases Swin transformer: Hierarchical vision transformer using shifted windows Decoupled weight decay regularization International evaluation of an ai system for breast cancer screening Unsupervised learning of visual representations by solving jigsaw puzzles Learning features by watching objects move Context encoders: Feature learning by inpainting A volumetric transformer for accurate 3d tumor segmentation Improving language understanding by generative pre-training Language models are unsupervised multitask learners Transfusion: Understanding transfer learning for medical imaging Zero-shot text-to-image generation U-net: Convolutional networks for biomedical image segmentation A large annotated medical image dataset for the development and evaluation of segmentation algorithms Improving cytoarchitectonic segmentation of human brain areas with self-supervised siamese networks 3d selfsupervised methods for medical imaging Self-supervised pre-training of swin transformers for 3d medical image analysis Representation learning with contrastive predictive coding Neural discrete representation learning Attention is all you need Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion Unsupervised learning of visual representations using videos Unsupervised feature learning via non-parametric instance discrimination Unified perceptual parsing for scene understanding Simmim: A simple framework for masked image modeling. ArXiv, abs/2111.09886 Colorful image colorization Comparing to learn: Surpassing imagenet pretraining on radiographs by comparing image representations Rubik's cube+: A self-supervised feature learning framework for 3d medical image analysis Self-supervised feature learning for 3d medical images by playing a rubik's cube. MICCAI