key: cord-0494238-vud3e82h authors: Ciga, Ozan; Xu, Tony; Martel, Anne L. title: Resource and data efficient self supervised learning date: 2021-09-03 journal: nan DOI: nan sha: 9d6a7306dd239230b8451d0c760397c1bbb160e1 doc_id: 494238 cord_uid: vud3e82h We investigate the utility of pretraining by contrastive self supervised learning on both natural-scene and medical imaging datasets when the unlabeled dataset size is small, or when the diversity within the unlabeled set does not lead to better representations. We use a two step approach which is analogous to supervised training with ImageNet initialization, where we pretrain networks that are already pretrained on ImageNet dataset to improve downstream task performance on the domain of interest. To improve the speed of convergence and the overall performance, we propose weight scaling and filter selection methods prior to second step of pretraining. We demonstrate the utility of this approach on three popular contrastive techniques, namely SimCLR, SWaV and BYOL. Benefits of double pretraining include better performance, faster convergence, ability to train with smaller batch sizes and smaller image dimensions with negligible differences in performance. We hope our work helps democratize self-supervision by enabling researchers to fine-tune models without requiring large clusters or long training times. The ability to infer meaning and identify patterns from unstructured data, or unsupervised learning, has been a goal of machine learning researchers which predates the advancements in deep learning (Xu and Wunsch, 2005) . A generalized data-independent framework for learning features and patterns from visual input became possible with the advent of self- * These authors contributed equally supervised techniques. The early examples of these techniques commonly employ the supervised learning objective and generate the supervision signal from the raw input. More recently, self-supervised methods based on contrastive learning have consistently outperformed their predecessors in various vision tasks and achieved state-of-the-art results on the popular ImageNet classification benchmark (Tian et al., 2019; He et al., 2020; Chen et al., 2020a) . These methods are exhaustively pretrained on large amounts of data and are deployed for further use in downstream tasks similar to transfer learning based on supervised training (e.g., ImageNet pretraining). While current techniques are not overly complex in nature, they require access to hardware such as multiple GPUs or TPUs for large batch training for a large number of epochs that is not readily available to most researchers and practitioners. Furthermore, in most domains, acquiring large amounts of unlabeled data is not straightforward and may be subject to regulation (e.g., medical images) (Srinidhi et al., 2020) . Finally, the quality of learned representations is dependent on identifying visual differences between contrasted samples. This may be challenging in complex tasks where most images do not exhibit high diversity, such as cancer detection from large resolution biopsy images where a single cell tumor may change the decision outcome. In such cases, learning directly from raw data generally encodes noisy features which are not superior to a randomly initialized network in downstream tasks (Ciga et al., 2020) . This paper proposes a simple transfer learning approach based on pretraining an already pretrained network on a new domain to improve performance on downstream tasks. We demonstrate the utility of this approach on three contrastive self-supervised methods. Using each method, we first pretrain a Resnet50 using natural-scene images from ImageNet ILSVRC-2012 dataset (Russakovsky et al., 2015) . Learned representations are then fine-tuned by a second stage of pretraining on data from a second domain. In this work, we select the second stage data according to the downstream task. For example, if the task is to identify different brands of cars from images, the second stage dataset is a set of unlabeled car images. We verify the efficacy of this procedure compared to both self-supervised and ImageNet initializations in multiple domains, regardless of the similarity between first and second stage datasets (e.g., natural-scene images to medical images). This approach improves downstream task performance even when there is limited data for pretraining, which allows for shorter training times, has better downstream task performance when trained for the same amount as pretraining from scratch, and is able to work with smaller batch sizes and can be pretrained using smaller images, mitigating the hardware requirements. Self-supervised learning for images has become more popular in recent years due to its promise of alleviating the requirements for labeled data. Its aim is to learn latent-space representations through a learning objective without human annotations which can be used in downstream supervised tasks such as classification. Early context-based methods exploit the spatial regularity in images by altering an image in total or in part to train a network to recover proper arrangement, alignment, or orientation from the modified image (Doersch et al., 2015; Noroozi and Favaro, 2016; Gidaris et al., 2018) . While these techniques are superior to their contemporary unsupervised counterparts, they are based on handcrafted pretext tasks, which can bias and limit the learning. Contrastive self-supervised methods replace such heuristically determined pretext tasks by comparing multiple images to each other and assigning each pair into so-called positive and negative classes. These methods assume that two augmented versions of the same image are positive, whereas each pair with distinct samples is considered negative. The learning objective is then to bring latent-space representations of positive pairs closer in some metric space and to push the negative pairs' representations apart. Negative instances can be stored in a dynamic memory bank per training instance to avoid recomputing feature vectors (Wu et al., 2018; Bachman et al., 2019; He et al., 2020; Chen et al., 2020b) . Augmentations such as scaling, affine transformations, or adjusting color properties of images are widely used in contrastive learning to avoid trivial solutions and to improve robustness of learned features (Tian et al., 2019; Henaff, 2020; Misra and Maaten, 2020; Chen et al., 2020a) . A recent class of methods rely on large minibatch training where each sample is a negative for the other samples in the batch except for the sample's augmented view. These methods apply different architectural and design choices such as temperature based loss functions, exponential moving average of the model weights, and comparing prototype vectors instead of raw latent representations to improve the performance on downstream tasks (Chen et al., 2020a; Grill et al., 2020; Caron et al., 2020) . Transfer learning, the ability to utilize information encoded in a network to improve performance on another task (Pan and Yang, 2009; Weiss et al., 2016) , is widely used in variety of machine learning tasks, including computer vision (He et al., 2016) , natural language processing (Radford et al., 2018; Devlin et al., 2018) , and speech recognition (Kunze et al., 2017) . In a broad sense, domain adaptation approaches, where transferring the learning of a source with the same label space as the target, can be considered a form of transfer learning (Ganin and Lempitsky, 2015; Shu et al., 2018) . In addition to the architectural, conceptual, and methodological modifications (Shimodaira, 2000; Ganin and Lempitsky, 2015; Maicas et al., 2018; Saito et al., 2017; Shu et al., 2018) , weight regularization has been also employed to achieve transfer learning. Approaches such as hypothesis transfer learning (Kuzborskij and Orabona, 2013, 2017) and multi-model knowledge transfer (Tommasi et al., 2013) , aim to transfer knowledge in a fast and a stable manner by limiting the weight drift with a penalty term on the difference between the target and source domain weights, or by introducing a penalty term which resembles the moving average of weights (e.g., , where β is the target weights,β is the source weights, and α is a nonzero momentum parameter). Recently, Takada and Fujisawa (2020) showed that 1 regularization can be used for transfer learning, which simultaneously enforces sparsity while dampening down all the weights. In the context of current machine learning techniques, a network can encode useful information by pretraining, which can be either supervised or unsupervised. The resulting pretrained network is then S i m c l r S w a v B y o l 10 4 10 3 10 2 10 1 10 0 0 10 0 10 1 10 2 10 3 10 4 10 5 bn2.running_var Figure 1 : Comparison of filter weight distributions for different methods of pretraining. "conv" weights are summarized by taking the Frobenius norm for each layer, whereas other quantities represent raw numbers. Resnet50 trained with supervision from ImageNet dataset (Sup) has only 0.5% of its filters with Frobenius norm above 1, whereas it is 96% for both SimCLR and SwaV, and 68% for BYOL. used as initialization for use in downstream tasks for further training. Recently, Gururangan et al. (2020) proposed a two stage pretraining approach for natural language processing tasks. In the first stage, a general language model is trained which is fine tuned in a subsequent stage for domain-specific tasks. The authors showed that this two tiered approach can be used for transfer learning in downstream tasks where the domain of interest is similar or same as the data used for second stage of training. Summary Our method is analogous to supervised training with ImageNet initialization, where we use the Resnet50 pretrained on images from the ILSVRC-2012 as initialization. We use the same pretraining method used to obtain the initialization weights and pretrain a second time with images from the domain of interest. We found this straightforward approach sometimes requires a longer second stage of pretraining and propose an adjustment to the originally pretrained weights, as described below. The modifications described below do not lead to any performance gains for pretraining from scratch, as most networks using Xavier (Glorot and Bengio, 2010) or He normalization (He et al., 2015) by default eschew problems such as dead filters or exploding weights. Exploding weights We compare the convolutional filter weights and batch normalization parameters and statistics of supervised and self-supervised networks that are trained on ImageNet. We find that the magnitudes of parameters for the supervised network are significantly smaller (Fig. 1) , likely due to the small weight decay employed in contrastive techniques compared to supervised training (Chen et al., 2020a; Caron et al., 2020) . We argue this may lead to optimization issues for further pretraining tasks, and scale weights and batch normalization parameters by the Frobenius norm of each corresponding layer, if the norm is greater than 1. We illustrate the scaling operation using the batch normalization equation. Assume the input x is convolved using the function f c (·) at layer i: fc(x)−µrm √ σ 2 rv + * γ + β, where µ rm , σ 2 rv are the running mean and variance, and γ, β are the affine parameters for batch normalization at i, and is a small number for numerical stability. Prior to second pretraining step, we calculate the Frobenius (matrix) norm s for the layer. We scale each component by a function of s, which modifies the previous equation so that the output of batch normalization is the same, however the numerical values of each parameter are reduced by a factor of √ s or s 2 . We use the square root to be able to scale the convolutional filters, the running mean, and the affine batch normalization parameters simultaneously. We scale the convolutional filter weights instead of x (the input), as the convolution is linear. Prior to using Frobenius norm scaling, we also experimented with a single universal scaling term that was used to scale all weights by the same amount; however, this did not improve the two-step pretrain- ing. Similarly, using a larger weight decay to dampen weights also led to suboptimal performance. We found that scaling led to faster convergence. We obtained [0.7%, 5.5%] improvement over straightforward training in validation experiments without weight scaling when we only pretrained for 100 epochs at the second stage. The difference between scaled and non-scaled initializations has diminished as the networks were pretrained for more epochs, reaching virtually the same validation accuracy (see Fig. 2 ). Dead filters We also found some convolutional filters at each layer of the Resnet50 network having Frobenius norms below 0.1 (∼ 1%, 0.2% and 1% for SimCLR, SwaV and BYOL, respectively), which leads to zero response for that filter. Similar to the dying ReLU problem, this may cause gradient updates to be zero for that filter, leading to capacity underuse. Replacing these filters by randomly initialized filters ameliorates the problem; however, we heuristically found randomly copying filters at the same layer that have Frobenius norms above 0.1 and replacing them with these dead filters to work better. The experiments comparing different filter selection schemes for the SimCLR pretraining method validated on five datasets (see Section 4) are shown in Table 1 . In this section, we give a brief overview of the three contrastive self-supervised methods that are used for the two-step pretraining. As each method shares similarities, we use a common notation and only define terms once unless significant differences exist between different methods. Simclr Chen et al. (2020a) proposes a technique to maximize the agreement between representations of two stochastically augmented views of the same image. Initially, the image i is given two views us-ing the stochastic augmentation function f aug (·). Set of augmentations include rotations, flips, color jittering, cropping an image and resizing to its original size. The encoder network f θ (·) (Resnet50 for this work) with parameters θ and an auxiliary projection layer pθ(·) with parametersθ is used to obtain projected representations z m , z n = pθ(f θ (f aug (i))), where z m and z n are different vectors (∈ R 128 ) due to the stochastic augmentation function f aug (·). The difference between 2 normalized feature representations of the two views per image (z m and z n ) are minimized while representations of other images in the same training batch are maximized from the image i through a contrastive loss function called NT-Xent (the normalized temperature-scaled crossentropy loss), defined as where τ is the temperature parameter, N is the batch size during pretraining, 1 is the indicator function, and the similarity function is the cosine similarity defined as similarity(u, v) = u T v/ u v . Chen et al. (2020a) shows NT-Xent performs better when used in downstream tasks than alternative contrastive loss functions such as margin (Schroff et al., 2015) or logistic (Mikolov et al., 2013) losses. pθ(·) is a single hidden layer MLP that projects the pre-activation layer output into a lower embedding space. Specifically, given x, the MLP converts its input f θ (f aug (i)) by applying the function pθ(x) = W (2) σ(W (1) x) where, for Resnet50, W (1) ∈ R 2048×2048 , W (2) ∈ R 2048×128 , and σ(·) is the rectifying linear unit activation (Nair and Hinton, 2010). Comparing z i and z j was found to be more effective in learning representations than directly comparing the pre-activation layer outputs. Once pretraining is complete, the projection layer is discarded and the encoder is used for the downstream task. SwaV Caron et al. (2020) proposes a modification to the SimCLR framework by comparing clustering assignments for each projection vector as opposed to directly comparing projection vectors, which removes the requirement of pairwise comparisons. Each auxiliary projection is clustered using an online clustering algorithm (Cuturi, 2013) into a fixed number of clusters while maintaining consistency between cluster assignments for the multiple views of the same image. Unlike SimCLR, this technique uses a multicrop strategy, where in addition to contrasting two Epoch same size images (in this work, 224 × 224 pixels), one enforces consistency between smaller crops of the same image (96 × 96 pixels). The authors find this approach does not add a substantial memory or computational overhead; however, it improves the downstream performance. While the iterative Sinkhorn-Knopp algorithm used for clustering is computationally demanding, this method can work with smaller batch sizes and achieve similar downstream performance with fewer epochs than the aforementioned method. BYOL Grill et al. (2020) proposes a method to learn representations from raw data without relying on negative samples. Two encoders with identical architectures (a non-trainable target and a trainable online network) are used in a feedback loop for preventing learning a collapsed solution. As in previous techniques, two augmented views of the same image are generated, and one is passed through the online network (resulting in b m ), while the other is passed through the target (resulting in b n ). Differently from the other techniques, the online network output is transformed a second time using a mapping function q(·), and the distance between q(b m ) and b n is minimized. The target network parameters are then updated using an exponential moving average of the previous online network parameters until a desired number of iterations have been reached. Datasets We experiment with natural-scene, satellite, and medical images from different modalities. For pretraining, we use a collage of 40,000 histopathology images extracted from 60 publicly available datasets, denoted as "Histo" ( (Zhang et al., 2017) . These datasets were chosen differently from their pretraining counterparts to verify that the target dataset does not have to come from the same distribution as the source to learn useful representations. All other networks are validated on the same dataset they were pretrained on, where only the training split was used for pretraining. We report the multi-class classification accuracy on each dataset. For validation, we compare baselines pretrained on ImageNet using SimCLR, SwaV, and BYOL, as well as networks pretrained with P 1X and P 2X approaches, where each network is fine-tuned with the validation training set samples. For reference, we report downstream task performances with randomly initialized networks (no pretraining) as well as supervised (Sup) initialization, where a Resnet50 is trained on ImageNet dataset with supervision for 1000-class classification. In other words, "Sup" refers to weights obtained by supervised training on the Imagenet dataset. In contrast, SimCLR, SwaV, and BYOL refer to the weights obtained using self-supervision that are fine-tuned on validation datasets without a second stage of pretraining. We use the Adam optimizer with the learning rate 3e-4, and weight decay of 1e-4 for each experiment. The results are shown in Table 2 . We train each setting (P1X and P2X) for 1000 epochs with a tile length of 224 × 224 pixels. Each method is compared within itself, i.e., the method pretrained on ImageNet dataset is compared with pretraining from scratch (P1X) using the same method, and pretraining from already pretrained weights (P2X). An important and attractive property of selfsupervised techniques is their ability to perform significantly better under low data regimes compared to randomly initialized networks (Henaff, 2020; Grill et al., 2020; Caron et al., 2020) . We use 10% of each validation training set and compare different approaches to understand if further benefits are achievable with the second stage of pretraining on the domain of interest. Each experiment is run thrice with a different 10% of the same dataset to avoid selecting a portion that favors any specific method. We report the average accuracy over three runs. The results are shown in Table 3 . In supervised learning, researchers have previously reported using pretrained initialization leads to faster convergence, but with a trade-off of lower final accuracy (Liu et al., 2017). We examine if using pretrained networks in the two-step pretraining approach exhibits the same problem. We randomly select three medical and two natural-scene image validation datasets and compare validation accuracy on each method when one pretrains for only 100 epochs vs. 1000 epochs. The results are shown in Fig. 3 . We found that the two-step pretraining can improve classification performance for multiple do- mains, shorten the pretraining duration, and allow for pretraining with smaller batch sizes and smaller images. Furthermore, we found that for some datasets, the two-step approach significantly outperforms pretraining from scratch regardless of the number of epochs the network is pretrained, indicating either the limitations of the current contrastive techniques or that certain datasets are not conducive to learning a rich set of features that can be used for the downstream tasks. Furthermore, we find two-step pretraining is especially useful when only a fraction of the training data is used. One may achieve nearpeak accuracy at only 100 epochs when the P2X procedure is applied; therefore, two-step pretraining can be considered as an inexpensive strategy for boosting performance that surpasses both self-supervision and ImageNet pretraining (denoted as "Sup" in our experiments). We found that the pretraining converges faster for P2X, which is reflected in the results from Fig. 3 , where pretraining for 100 epochs versus 1000 epochs has led to a minor improvement compared to P1X for most datasets. We also found that the performance gains reported by the self-supervised methods investigated in this paper do not necessarily hold for smaller datasets from different domains. For instance, while SwaV performs better than SimCLR when pretrained on ImageNet, it was not superior to SimCLR in our single-step pretraining (P1X) experiments. Overall, we find transferring pretrained features to be an effective strategy to bypass the factors limiting the widespread adoption of contrastive selfsupervised learning under the low-resource setting. Our results highlight that while using larger batches for pretraining from scratch may be necessary, finetuning these features for the domain of interest does not require large batch training. Moreover, starting from an already pretrained baseline leads to significantly faster convergence with better final downstream task performance. In this work, we proposed a simple transfer learning approach based on fine-tuning an already pretrained network to improve downstream task performance. The method we present assumes a network pretrained on a larger dataset with rich visual diversity (e.g., the ImageNet dataset) can be used to improve task performances on a domain of interest that may not exhibit such diversity. Particularly, this method aims to ameliorate the limitations of contrastive objective when the contrasted images lead to learning representations that may not be optimal for the downstream task. As most classification tasks only consist of images from a narrow subset of the visual domain (e.g., medical image analysis on a specific modality or brand identification from car images), using an already pretrained network as initialization which may have encoded more generalized features can perform better on the downstream task compared to pretraining with images only from the domain of interest. While most current methods are pretrained on Im-ageNet dataset, which is limited to natural-scene images, it is possible to pretrain these methods on much larger datasets that cover a wider range of domains (e.g., natural-scene beyond what is represented in Im-ageNet, medical images, moving images such as movie snapshots). Once these pretrained networks are available, their weights can be used as initialization for further pretraining with the procedure described in this paper. Moreover, it is reasonable to assume this procedure can be applied to future techniques that improve the state-of-the-art. We believe that the transfer learning approach can also help democratize the efforts in unsupervised learning. While most practitioners do not have access to expensive hardware such as multiple GPUs or TPUs, the method presented here allows for pretraining on GPUs with smaller memory and shorter training times. Furthermore, we have shown that even with very small datasets (e.g., the ultrasound dataset with 780 images), two-step pretraining can improve the downstream task performance. In cases where even unlabeled images are scarce (e.g., medical images that are subject to regulations and ethics board approvals), data efficiency becomes an attractive property of the proposed method. Finally, the mitigation of labeled data requirements is an important and necessary milestone for the adoption of machine learning into practice, especially when labeling efforts cannot be easily crowdsourced for applications that require expert annotations, such as medical imaging. In such applications, even incremental improvements (e.g., a few percentage points for classification) without additional data can help expedite clinical adoption. We have shown that pretrained networks can be used to achieve various degrees of gains over pretraining from scratch. Furthermore, we have shown that the benefits of two-step pretraining are more pronounced compared to singlestep pretraining when only a fraction of the labeled data is used. Time and memory requirements for pretraining with images of size 224 × 224 pixels are significantly more than for images of size 96 × 96 pixels. For the former, one may fit 128 images (per batch) on a 32 GB memory when pretraining a Resnet50 with the SimCLR method, whereas 350 images can be fit for the latter. We found that pretraining with larger images take three times longer for the three methods we investigated in this paper, when Tesla V100 GPUs are used. Using smaller images allows for larger batch sizes as well as shorter training times; however is only beneficial if the difference in downstream task accuracy is negligible. Interestingly, we found that the downstream accuracy did not deteriorate when we pretrained with Table 5 , as we believe these findings may be of interest to researchers due to the possible memory and time savings, when either method is used. TensorFlow: Large-scale machine learning on heterogeneous systems Dataset of breast ultrasound images Rotation equivariant cnns for digital pathology Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases A survey of transfer learning Unsupervised feature learning via nonparametric instance-level discrimination Survey of clustering algorithms Age progression/regression by conditional adversarial autoencoder This work was funded by Canadian Cancer Society (grant #705772) and NSERC. It was also enabled in part by support provided by Compute Canada (www.computecanada.ca). In this section, we conduct two additional experiments to show the efficacy of our approach under low resource settings. The results presented in tables 4 and 5 are relative figures compared to the results presented in Table 2 , Section 4. Most contrastive techniques require large batch training or memory banks to learn representations. This section examines if using pretrained weights as initialization for self-supervision can mitigate these requirements. As opposed to our original experiments, where the batch size for training was set to be 1024, we use a batch size of 32 and pretrain P1X and P2X approaches as described before. We chose a batch size of 32 as it was the largest value (2 x where x is an integer) that could fit a GPU with 8 GBs of memory with the most demanding method investigated in this work (SwaV). The results shown in Table 4 indicate the difference in accuracy compared to the corresponding entries in Table 2 . For instance, −4.0 for the Simclr ∆ P1X Histo entry indicates the accuracy drop when the models are pretrained using a batch size of 32. Since the value in Table 4 is 79.5, the accuracy for this entry is 75.4. Conversely, if the accuracy has improved, the difference is positive. When comparing P1X vs. P2X, the more positive value is considered better, regardless of the initial accuracy when one pretrains with a batch size of 1024.We found that using a smaller batch size has a more pronounced negative impact on the P1X compared to P2X for most settings (Table 4 ). In few settings where P1X achieves a slightly better result compared to P2X (e.g., for SimCLR, MRI and Aircraft datasets, or for SwaV, CelebA dataset), the differences are comparable (within 3 to 4%). In contrast, the degradation gap between two schemes (P1X and P2X) is relatively high for others (e.g., Cars dataset for SimCLR and SwaV, or SOP and Aircraft dataset for SwaV). Therefore, we conclude P2X is more stable for settings where the pretraining is done with a small number of images. Finally, the differences between the two schemes are not as significant for BYOL for most validation datasets. BYOL does not require negative pairs for self-supervision; therefore, BYOL is impacted less when fewer pretraining samples per iteration are used.