key: cord-0439083-0kplak82 authors: Yap, Boon Peng; Ng, Beng Koon title: Semi-weakly Supervised Contrastive Representation Learning for Retinal Fundus Images date: 2021-08-04 journal: nan DOI: nan sha: 5298fe9078e35f5412b7f75ac22e1782bc88f79b doc_id: 439083 cord_uid: 0kplak82 We explore the value of weak labels in learning transferable representations for medical images. Compared to hand-labeled datasets, weak or inexact labels can be acquired in large quantities at significantly lower cost and can provide useful training signals for data-hungry models such as deep neural networks. We consider weak labels in the form of pseudo-labels and propose a semi-weakly supervised contrastive learning (SWCL) framework for representation learning using semi-weakly annotated images. Specifically, we train a semi-supervised model to propagate labels from a small dataset consisting of diverse image-level annotations to a large unlabeled dataset. Using the propagated labels, we generate a patch-level dataset for pretraining and formulate a multi-label contrastive learning objective to capture position-specific features encoded in each patch. We empirically validate the transfer learning performance of SWCL on seven public retinal fundus datasets, covering three disease classification tasks and two anatomical structure segmentation tasks. Our experiment results suggest that, under very low data regime, large-scale ImageNet pretraining on improved architecture remains a very strong baseline, and recently proposed self-supervised methods falter in segmentation tasks, possibly due to the strong invariant constraint imposed. Our method surpasses all prior self-supervised methods and standard cross-entropy training, while closing the gaps with ImageNet pretraining. T RANSFER learning via the pretraining and fine-tuning paradigm is one of the most popular approaches in training deep convolutional neural networks (CNN) for medical imaging tasks, e.g., diabetic retinopathy detection [1] . In a typical transfer learning setting, a CNN is first initialized with weights pretrained on a large generic dataset, after which the weights are fine-tuned to solve a specific task on a smaller dataset. Although there might be domain mismatch between the pretrain dataset and the fine-tune dataset, the pretrained weights can still provide a good starting point for fine-tuning; in most cases models initialized with pretrained weights are able to converge faster and require less annotated samples compared to their randomly initialized counterparts. Traditionally, pretraining is performed in a fully supervised manner on a large hand-labeled dataset such as ImageNet [2] . Recently there is a push for self-supervised pretraining methods which seeks to extract generalizable representations without requiring human-annotated samples. Notable examples from the natural image domain include relative position The computational work for this article was fully performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg). BP. Yap and BK. Ng are with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. prediction [3] , image inpainting [4] , rotation angle prediction [5] , contrastive learning [6] , [7] , and other consistency-based methods [8] - [10] . SimCLR [6] popularizes the contrastive learning objective for self-supervised pretraining on image data. On several classification datasets, models fine-tuned on representations learned by SimCLR outperform or match the performance of supervised ImageNet pretraining. In the medical imaging domain, Model Genesis [11] introduces a reconstruction-based objective for 3D medical images and achieves superior performance on various 3D image segmentation tasks compared to 2D models pretrained on ImageNet. In between fully supervised and self-supervised learning, there is another training paradigm known as weakly-supervised learning, which involve training on weak labels obtained from potentially noisy source. Examples of weak labels include labels extracted from clinical reports accompanied by medical images [12] , [13] , and pseudo-labels generated by models pretrained on small hand-labeled datasets. In contrast to expert annotations which are prohibitively expensive to acquire, weak labels can be obtained in large quantities with minimal manual effort. In this work, we explore the efficacy of weaklyannotated labels as training signals for generic representation learning. In particular, we propose a Semi-Weakly supervised Contrastive Learning (SWCL) framework for learning transferable visual representations from a semi-weakly annotated fundus patch dataset. We first propagate labels from OIA-arXiv:2108.02122v1 [cs.CV] 4 Aug 2021 ODIR [14] , a relatively small dataset consisting of imagelevel annotations for a wide variety of retinal diseases, to a large set of fundus images from the Kaggle-EyePACS [15] dataset through semi-supervised learning. Kaggle-EyePACS is one of the largest public datasets for diabetic retinopathy (DR) grading. Despite being labeled for DR, the images in this dataset contain other types of diseases that have not been explicitly labeled by the dataset provider [16] . Therefore, we treat the entire Kaggle-EyePACS dataset as unlabeled and use it along with the labeled OIA-ODIR dataset to train a semisupervised pseudo-labeler. After the pseudo-labeler is trained, we extract the class activation maps (CAMs) [17] for every image in both labeled and unlabeled datasets to obtain a set of image-CAM pairs which will be used to construct a patchlevel dataset for pretraining. To learn spatially consistent representations with positionspecific lesion information, we construct a Semi-Weakly Annotated Patch dataset for retinal images (retinal-SWAP) from the image-CAM pairs. As shown in Figure 1 , we crop each image and its CAM into five fixed patches and assign each patch a lesion score and a (relative) position label. The patch-wise lesion scores are computed from the crops of CAMs while the position labels are the relative positions of the crops with respect to the whole image. For normal (healthy) images from OIA-ODIR, we infer the lesion scores directly from the ground-truth labels, i.e., setting them to zero, while the rest of the patches depends on the activation values of the CAMs extracted from the pseudo-labeler, hence the constructed dataset is semi-weakly annotated. Using retinal-SWAP as pretrain dataset, we propose a new training objective based on the supervised contrastive learning objective [18] . The training objective encourages the model to learn representations such that patches with the same labels are pulled together in a projected vector space, while patches with mismatched labels are pushed away from each other. Each patch in the pretrain dataset contains retinal structures specific to each relative position. For example, optic discs and macula are typically concentrated at the center, while blood vessel endings are found towards the edge of the fundus images. We hypothesize that learning to identify lesion regions from similar background anatomical structures via contrastive learning would benefit both downstream classification and segmentation tasks. We empirically verify this hypothesis on seven public retinal fundus datasets and compare the transfer learning performance with self-supervised pretraining, standard in-domain cross-entropy pretraining, and large-scale supervised pretraining methods. We release the codes and pretrained models at https://github.com/BPYap/SWCL. Our contributions in this paper are three-fold: 1) We demonstrate how to construct a semi-weakly annotated patch dataset for pretraining by propagating image-level annotations from a small dataset to a large unlabeled dataset in the form of class activation maps. 2) We formulate a multi-label contrastive learning objective to enforce consistency between patches with similar lesion and position information. 3) We conduct extensive transfer learning and ablation experiments to assess the effectiveness of our approach. We perform benchmarking on seven retinal fundus datasets that cover a wide range of tasks, including joint coarse-and fine-grained diabetic retinopathy classification, glaucoma classification, retinal vessel segmentation, and joint optic disc and optic cup segmentation. Our approach outperforms state-of-the-art selfsupervised methods and standard cross-entropy objective while being competitive with large-scale ImageNet pretraining. In this section, we review previous works on two common paradigms of pretraining deep neural networks: supervised and self-supervised pretraining. Supervised pertaining involves training a model on large generic hand-labeled datasets, with anticipations that the model learns useful representations that can generalize to other tasks. Models pretrained on the ImageNet dataset [2] , usually on the smaller ILSVRC-2012 split [19] , has been extensively utilized in the medical imaging literature. Some applications include detection of COVID-19 [20] and bone age estimation [21] on X-rays, lung disease classification [22] and organ segmentation [23] on CT scans, breast cancer detection [24] , [25] on mammograms, melanoma classification [26] , [27] on dermoscopic images, and DR [1] and glaucoma [28] classification on retinal fundus images. Prior empirical observations show that weights pretrained on ImageNet provide a good starting point for solving downstream tasks; they greatly speed up model convergence and can even match or surpass human-level performance after fine-tuning on small hand-labeled datasets. Following the recent advances in network architecture design and training algorithms, Kolesnikov, Beyer et al. [29] revisit supervised pretraining on ImageNet and propose BiT, a series of models pretrained on a modified version of the ResNet [30] architecture. Our work in this paper is based on this modified ResNet architecture and we benchmark against the public release of the pretrained BiT models in the experiment section. On the other hand, Khosla et al. [18] extends self-supervised contrastive learning objective (discussed in the next section) to the supervised setting. The supervised contrastive learning objective were experimentally shown to outperform the standard cross-entropy objective in terms of classification accuracy and robustness to noise. Research in self-supervised learning has gained a lot of popularity in recent years. This approach aims to learn representations from a vast number of unlabeled samples using labels automatically inferred from the data itself. Earlier approaches on self-supervised learning for images rely on manually designed pretext task, such as predicting relative position of image patches [3] , inpainting missing parts of images [4] , and predicting angle of rotation applied to images [5] . More recent approaches seek to directly impose consistency constraint on the representations. Contrastive learning [6] , [7] learns to minimize the distance between two similar representations (typically two augmented views of a same image) in a projected vector space, while maximizing the distance between two dissimilar representations. BYOL [8] discards the notion of dissimilar representations by training a student network to predict the representations generated by an exponential moving average version of the student network, otherwise known as the teacher network. Barlow Twins [9] applies the redundancyreduction principle [31] to self-supervised learning by forcing the cross-correlation matrix computed from two batches of representations to be close to the identity matrix. DINO [10] performs self-distillation on two encoded views of the same image using a similar teacher-student setup from BYOL. These methods have largely closed the performance gap between self-supervised learning and supervised learning. However, our experiment results on retinal fundus datasets reveal that models pretrained on these consistency-based methods suffer performance drop when transferring to segmentation tasks with limited annotations. This might be partly attributed to the invariance property imposed by strong data augmentations, which can cause the learned representations to have poor visual grounding [32] . Several works have been proposed to improve the localization ability of self-supervised learning, which include aligning feature maps of image crops [33] and using unsupervised saliency maps [32] or off-the-shelf region proposal tool [34] . These unsupervised saliency methods are not directly applicable to the medical imaging setting due to the small subtleties and large variations of lesions typically encountered in medical images, which prompts us to explore the use of weak labels to inject information about lesions into image patches. In the medical imaging domain, Model Genesis [11] outperforms ImageNet initialization in several 3D segmentation tasks with an image reconstruction-based objective designed specifically for 3D images. ConVIRT [13] explores contrastive pretraining on image-text pairs and observe improvements in classification and zero-shot retrieval tasks. For MRI images, Chaitanya et al. [35] extends the contrastive learning objective to 3D images by partitioning each image volume into several partitions and forms positive pairs using partitions from the same location across different volumes. y-Aware InfoNCE loss [36] incorporates continuous image meta-data such as age information into contrastive pretraining, based on the intuition that images of similarly-aged patients should have similar representations. For retinal fundus images, Li et al. endows learned representations with modality-invariant [37] and rotation-invariant [38] properties, through cross-modal image synthesis and rotation angle prediction task respectively. Briefly, our semi-weakly supervised pretraining framework consists of three main steps: 1) training a semi-supervised pseudo-labeler to propagate labels from a hand-labeled dataset to a large unlabeled dataset; 2) generating a semi-weakly annotated patch dataset for pretraining using CAMs from the pseudo-labeler; 3) pretraining on the generated dataset with multi-label contrastive learning objective. Below, we describe each step in more details. We train a semi-supervised binary CNN classifier using OIA-ODIR [14] as the labeled dataset and images from Kaggle-EyePACS [15] as unlabeled dataset. Both OIA-ODIR and Kaggle-EyePACS consists of 10,000 and 88,702 fundus images from the left and right eyes of 5,000 and 44,351 patients respectively. Dataset preprocessing. The left-right fundus pairs from OIA-ODIR were initially annotated with multi-hot labels from eight categories: normal, diabetic retinopathy, glaucoma, cataract, age-related macular degeneration, hypertension, myopia, and others. In addition, each individual fundus images were also annotated with diagnostic keywords in text form. To get an one-hot label for each fundus image, we convert the diagnostic keywords into a binary label by treating keywords associated with normal diagnosis as "normal" and other keywords as "abnormal". Through this step, we convert the multi-label dataset into a binary classification dataset. Next, we discard the labels in Kaggle-EyePACS and treat the images as unlabeled because the original labels only describe the severity of DR and did not account for other disease types [16] . Lastly, we resize all images from both labeled and unlabeled datasets to 448 pixels along the shorter side using bilinear sampling. Semi-supervised learning. We adopt the self-supervised semi-supervised learning (S 4 L) algorithm [39] as the learning algorithm. S 4 L is a simple and effective algorithm with a selfsupervised regularization term for unlabeled data. In particular, the learning objective of S 4 L is given as: where L is a standard cross-entropy loss on the labeled dataset D , L u is a loss defined for the unlabeled dataset D u , w is a non-negative scalar weight and θ is the parameter of the model. We set w to 1 and implement L u as the hard triplet loss [40] with a soft margin of 0.5. Training details. Using the S4L algorithm, we train a ResNet-34 [30] classifier (the "pseudo-labeler") from scratch on the preprocessed datasets. To obtain CAMs at a higher resolution (required for the next step), we reduce the stride in the last down-sampling block of ResNet-34 to 1 to prevent the feature maps from shrinking too much. At each optimization step, we generate four augmented views from each image and apply the loss function to each view. The data augmentation scheme includes random cropping to the size of 384×384, random horizontal flipping and random color jittering 1 . Following Zhai et al. [39] , we also apply L u to the labeled images. We use the "off-site" split of OIA-ODIR as a hold-out validation set for hyperparameter tuning. The model is optimized using stochastic gradient descent (SGD) with Nesterov momentum, with a batch size of 256, a momentum of 0.9 and a weight decay of 0.001. The initial learning rate is set to 0.03 and decayed by a factor of 10 at epochs 140, 160, and 180. After training for 200 epochs, the classifier achieves an AUC-ROC score of 72% on the hold-out validation set. After the pseudo-labeler is trained, we pass the center crop of size 448×448 of each image from the concatenation of OIA-ODIR and Kaggle-EyePACS to the pseudo-labeler and extract the class activation maps (CAMs) for both normal and abnormal class. CAMs can be interpreted as heatmaps that highlight the most discriminative regions for a particular class. Following Zhou et al., [17] , we compute CAMs as the weighted sum of the feature maps of the last convolutional layer. Specifically, the value of a CAM at spatial location (i, j) for a class c, denoted as M c (i, j), is computed by: where k is the number of feature maps (channels) in the last convolutional layer, w c k is the weight of the fully connected layer connecting the k-th feature map to the output neuron of class c, and f k (i, j) is the value of the k feature map at spatial location (i, j). For each image, we first compute CAMs for both the abnormal class, M a , and normal class, M n , before computing the normalized CAM for the abnormal class,M a via a softmax operation: M a (i, j) = exp (M a (i, j)) exp (M a (i, j)) + exp (M n (i, j) ) . To construct the patch dataset for pretraining, the images along with their normalized CAMs are cropped into five patches of equal size at five positions: top-left, top-right, center, bottom-left and bottom-right, as illustrated in Figure 1 . To make the retinal structures aligned in each patch, we flip the right-eyed images and their CAMs horizontally before applying the crops. Each patch is then assigned two labels: a position label corresponding to the relative position of the crop and a lesion score label computed by averaging the values of croppedM a . For the patches in OIA-ODIR whose image ground truth label is normal, we manually set the lesion scores to 0. A total of 493,470 semi-weakly annotated patches were generated through this annotation process. The goal of contrastive visual representation learning is to learn generalizable representations by discriminating positive image pairs against the negative pairs in a representation space. In self-supervised setting, positive pairs are typically defined as two differently augmented views of the same image and negative pairs are defined as any pairs of views that do not originate from the same image. Without the knowledge of image labels, this definition can result in false negative pairs where different images belonging to the same underlying class are treated as negative pairs. This causes their representations to be pushed away from each other in the representation space, leading to worse representations. To mitigate this issue, supervised contrastive learning [18] adapts the self-supervised contrastive learning objective [6] to the fully supervised setting, by defining positive pairs as views from the same visual class. We further extend this supervised objective to the multilabel setting. Given N randomly sampled examples from a multi-label dataset with L labels, {x k , y k,1 , . . . , y k,L } k=1...N , where x k is the k-th image and y k, is the -th label of the k-th image, we first generate two augmented views for each image, such that each minibatch contains 2N samples, {x k ,ỹ k,1 , . . . ,ỹ k,L } k=1...2N , wherex 2k andx 2k−1 are two different views of x k andỹ 2k, =ỹ 2k−1, = y k, for ∈ [1, L]. To prevent clutter, we useỹ k to denote the set of L labels for x k . Within each minibatch, we minimize the following loss function: where Nỹ i is the number of instances in the minibatch that have the same labels as the anchor, i, 1 condition is an indicator function that returns 1 if condition evaluates to true and 0 if false, and τ is a temperature parameter. We use the default value of 0.1 for τ throughout all experiments. Vector z is the projection vector ofx, computed by a small projection network attached to the output of an encoder network (more details are provided in Section IV-B). The loss objective has the same form as the supervised contrastive learning objective, except that the class label y is a L-dimension vector instead of a scalar value and yi = yj if and only if each element in yi matches yj, i.e., y i, = y j, for all ∈ [1, L]. The multi-label formulation makes use of different types of labels presence in each image. As shown in Figure 2 , each patch of the retinal fundus image consists of two labels -a position label and a lesion score label assigned during the semi-weakly annotation process. We convert the lesion scores into binary abnormality labels by applying a threshold, t, on the scores, in which scores greater than or equal to t are assigned the abnormal label and scores lesser than t are assigned the normal label. We define positive pairs as patches with the same position and abnormality label. The intuition behind this definition is based on the observations that different position of the retina is usually associated with certain types of lesions, e.g., exudates surrounding the macula and hemorrhage near the blood vessels. Learning to encode positive pairs defined this way allows the encoder to capture features of position-specific lesions from the background structures, leading to a better representation quality. Practical consideration for patch alignment. Although we have applied horizontal flipping to the right-eyed images to make sure they are roughly aligned with their left-counterparts, there is still a considerable misalignment among the images from Kaggle-EyePACS, which contains images from multiple screening sites captured under vastly different conditions, e.g., different lighting and zoom level. From our ablation study (see Section IV-D), we found that simply matching the position and abnormality labels across all patches would result in lower quality representations. Therefore, without resorting to complex image registration methods, we mitigate the misalignment issue by adding a patient-wise constraint to the definition of positive patches. Specifically, we treat two patches as positive pair only if they originated from the same patient while having matching position and abnormality labels. We implement this constraint by adding a patient identifier as the third label to each patch. Furthermore, to ensure there are sufficient inter-image positives in each minibatch, we sample the minibatch by pairing each left-eyed patch with the corresponding right-eyed patch of the same patient until there are N patches in the minibatch. To evaluate the quality of learned representations, we fine-tune the models in an end-to-end fashion on seven retinal fundus datasets with very limited annotations. The details of each dataset are summarized in Table I . We resize each image from all classification datasets until the shorter side is 350 pixels. For segmentation datasets, we only resize the images from REFUGE-seg to 514×514 pixels. Baselines. We compare our method (SWCL) against other weight initialization methods, including: • random initialization (no pretraining) -random • self-supervised baselines (pretrained on the concatenation of OIA-ODIR and Kaggle-EyePACS without labels) -SimCLR [6] , BYOL [8] , DINO [10] • supervised baselines (pretrained on hand-labeled datasets) -OIA-ODIR, Kaggle-EyePACS, BiT-S [29] , Bit-M [29] • multitask cross-entropy baseline (pretrained on retinal-SWAP) -CE-multitask • image-level SWCL baseline (pretrained on retinal-SWAP, but the patch-level abnormality labels are replaced with image-level labels) -SWCL-image We obtain the pretrained weights for BiT-S and BiT-M from the official BiT repository 2 . Both models were pretrained on large-scale image classification tasks, in which BiT-S was trained on the ILSVRC-2012 [19] dataset consisting of 1.3M images while BiT-M was trained on the ImageNet-21k [2] dataset consisting of 14M images. We use both of these pretrained models as the upper bound baselines for large-scale supervised pretraining. Model architecture. For the main experiments, we use the BiT [29] variant of ResNet-50x1 as the encoder backbone. This architecture is an improved version of the ResNet-v2 architecture [48] with all its Batch Normalization [49] layers replaced with Group Normalization [50] and Weight Standardization [51] . For the projection network, we use a 2-layer multi-layer perceptron (MLP) with an output dimension of 128 (same as the one used in SimCLR [6] ). When fine-tuning on downstream tasks, the projector network is discarded and replaced with either task-specific linear classification layers (for classification tasks) or a segmentation network (for segmentation tasks). We use the DeepLabv3+ [52] architecture as the segmentation network. Data augmentation scheme. The pretraining stage follows the same data augmentation scheme used in SimCLR [6] and BYOL [8] , which includes random cropping, random horizontal flip, random color jittering, random gray scale, Gaussian blurring and solarization. In the fine-tuning experiments, we use random crop (224×224 for classification tasks, 384×384 for segmentation tasks), random horizontal flip and random gray scale in the training stage. During inference, we resize the images to 256 pixels along the shorter side before taking the 224×224 center crops as inputs for the classification tasks; for segmentation tasks, we stich together patches of 384×384 segmentation masks obtained in a sliding window fashion. After applying data augmentations, the color channels of all images are normalized by the mean and standard deviation of colors computed on the Kaggle-EyePACS dataset. Pretraining details. All models are pretrained from randomly initialized encoder-projector networks using the AdamW [53] optimizer. For our main method (SWCL), we set the base learning rate to 0.001 and train the model for 40 epochs with a batch size of 496 and a weight decay of 1e-4. The learning rate is linearly increased to the base value in the first 5 epochs before gradually decaying towards 0 using a cosine scheduler. For the baseline methods, we separately tune the values of base learning rate and weight decay. Since retinal-SWAP contains five times more training examples than the image-level datasets, we train the self-supervised baselines for 200 epochs such that the number of optimization steps is similar. Fine-tuning details. For datasets without validation split (i.e., IDRiD, DRIVE, STARE, CHASE DB1), we randomly select 20% of the samples from the training split as validation split; for dataset I: Details of the datasets used in fine-tuning experiments. † Following Sánchez et al. [41] , we treat Messidor as a coarsegrained classification dataset by grouping DR grade 0 and 1 as non-referable class and DR grade 2 and 3 as referable class. ‡ Following the conventional train-test split [42] . *Accuracy when all predictions within a test sample are correct. (Abbreviations: DR -diabetic retinopathy; DME -diabetic macular edema; OD -optic disc; OC -optic cup) Dataset Task # classes # train # validation # test Performance measure Messidor [43] Joint DR & DME classification DR: 2 † , DME: 3 1196 --Joint accuracy * IDRiD [44] Joint DR & DME classification DR: 5, DME: 3 413 -103 Joint accuracy * REFUGE-cls [28] Galucoma classification 2 400 400 400 AUC-ROC DRIVE [45] Retinal vessel segmentation 2 20 -20 ‡ F1-score STARE [46] Retinal vessel segmentation 2 10 -10 ‡ F1-score CHASE DB1 [47] Retinal vessel segmentation 2 20 -8 ‡ F1-score REFUGE-seg [28] Joint OD & OC segmentation OD: 2, OC: 2 400 400 400 Average F1-score without both validation and test split (i.e., Messidor), we report results from 5-fold cross validation. During fine-tuning, we initialize each neural network with the parameters of a pretrained network and optimize the loss using SGD with Nesterov momentum with a momentum of 0.9. Models in classification tasks are trained with a batch size of 64 for up to 120 epochs while models in segmentation tasks are trained with a batch size of 8 for up to 300 epochs. For each task and initialization method, we perform grid search to select the best learning rate and weight decay parameters. After selecting the best-performing hyperparameters, we train a new model for each task using the selected parameters on the train split with early stopping applied to the validation split. We repeat each experiment five times using five different random seeds and report the performance measure on the test split. The transfer learning performance measured on the test sets are reported in Table II . Below we highlight several key observations from the table. All pretraining methods outperforms training from scratch in classification tasks. Under very low data regime, models initialized from pretrained weights consistently outperform a randomly initialized model on all three disease classification tasks, with improvements as large as 22.14% joint accuracy on the IDRiD benchmark with BiT-S initialization. This suggests that pretraining on either labeled or unlabeled data in general can provide a great starting initialization for downstream classification tasks. The number of in-domain training examples matters as well -the Kaggle-EyePACS baseline which was trained on a larger set of labeled retinal fundus images achieve comparably better downstream performance than the OIA-ODIR baseline. On the contrary, this is not the case in ImageNet pretraining where BiT-S which is pretrained on the smaller subset of ImageNet outperforms BiT-M on two benchmarks (i.e., IDRiD, DRIVE). SWCL improves segmentation performance. Despite obtaining good transfer performance in classification tasks, models pretrained using self-supervised methods have poor transfer performance when it comes to dense prediction tasks. For example, BYOL and DINO achieve slightly worse or comparable segmentation performance to the random baseline on the DRIVE, STARE and REFUGE-seg benchmarks. The drops in segmentation performance can be partly attributed to the invariance properties imposed at the image-level [33] . During pretraining, self-supervised objectives force different augmented patches of the same image to have consistent representations regardless of the relative position of the patches. This reduces the amount of position-related information captured by the encoder, which in turn causes performance drop in tasks that depend on such information, i.e., dense prediction tasks. In contrast, with the guidance of position-specific lesion information and patch alignment, SWCL is able to learn representations that benefit both classification and segmentation tasks under the same data augmentation scheme. SWCL outperforms in-domain cross-entropy objectives. Compared to pretraining with standard cross-entropy objectives on single-task hand-labeled datasets (OIA-ODIR, Kaggle-EyePACS) and the generated multi-task retinal-SWAP dataset (CE-multitask), SWCL consistently achieves better performance across almost all benchmark datasets. Notably, SWCL improves upon the next best method (Kaggle-EyePACs) by 4.51% and 4.85% joint accuracy on the Messidor and IDRiD benchmark, respectively, and 4.06% AUC-ROC on the REFUGE-cls benchmark. Interestingly, CE- Replacing semi-weakly annotated patch labels with image-level labels. We also experiment with another way of assigning patch-wise abnormality labels by replacing the semi-weakly annotated labels with image-level labels from the original datasets. Similar to SWCL, models initialized with weights trained on the replaced abnormality labels (SWCL-image) surpass both self-supervised and in-domain cross-entropy pretraining, which shows that the addition of position information and contrastive learning plays a crucial role in representation learning for retinal fundus images. By injecting patch-specific lesion information extracted from the CAMs of a semi-supervised pseudo-labeler, our method (SWCL) achieves even better performance on the classification benchmarks while maintaining competitive segmentation performance. SWCL closes the gap with large-scale pretraining. With improved architectural design (i.e., group normalization [50] , weight standardization [51] ) and large hand-labeled datasets, BiT-S and BiT-M achieve remarkable transfer learning performance on a wide variety of natural image classification tasks [29] . Our experiment results show that the dominant transfer learning performance of BiT-S and BiT-M is also applicable to retinal fundus images. In five out of seven benchmarks (i.e., Messidor, IDRiD, DRIVE, STARE, REFUGE-seg), models initialized with weights from BiT-S or BiT-M achieve the best overall performance. By training on a relatively small semi-weakly annotated dataset, SWCL closes the performance gap in these five benchmarks and outperforms the large-scale pretraining methods in the remaining two benchmarks (i.e., REFUGE-cls, CHASE DB1). We conduct ablation studies for different values of t and different labelling schemes in the multi-label contrastive objective. For each configuration, we pretrain a ResNet-18 model from scratch and report the fine-tuning performance on the validation split of REFUGE-cls. Each model is trained for 40 epochs with a batch size of 512. The rest of the hyperparameters follow the values described in Section IV-B. The ablation results are given in Table III . Optimal threshold. Under the full multi-label scheme for contrastive representation learning (i.e., with position, abnormality, and patient labels), the AUC-ROC score improves by 1.49% and peaks at 96.52% as t is increased from 0.3 to 0.4. When t is increased to 0.5, the classification performance drops to 92.91% (a decrease of 3.61%), due to the increase in false negatives as more patches are labeled as normal. Label matching scheme. We gradually remove the labels from the multi-label contrastive learning objective to assess how each label impact the representation quality. When only position or patient label is removed, the performance drop is lesser than 1%. In contrast, when the abnormality label is removed, the performance drop increases to 3.52%, demonstrating that generic lesion information plays an important role in learning representations that can generalize to other diseases. However, if we remove all labels except the abnormality label, the AUC-ROC score drops below the performance of a ILSVRC-2012 baseline (88.18%) to 87.51%. Coupled with the performance of in-domain pretraining baselines discussed in Section IV-C, the results suggests that abnormality information alone is not sufficient for generic visual representation learning, indicating the importance of position alignment. Class activation maps. We visualize the CAMs extracted for the abnormal class from the pseudo-labeler for randomly selected images in Figure 3 . From the figure, we can see that the pseudo-labeler assigns high activation values to regions with noticeable exudates (e.g., towards the center of the second image in the second row), as well as the swollen optic disc appearing in the center of the fourth image in the last row. Compared to image-level labels, the patch-level lesion scores derived from CAM patches prevent normal patches from being classified as abnormal when the image-level label is abnormal. These position-specific labels can provide a more informative training signals for the contrastive learning objective as the model has to discriminate between normal and abnormal patches from whole images. Lesion scores distributions. Figure 4 plots the distribution of the lesion scores computed by the pseudo-labeler on the concatenation of OIA-ODIR and Kaggle-EyePACS. The lesion scores among the unlabeled images form a bell-shaped curve with peaks around the interval between 0.2 and 0.3. The optimal score threshold, i.e., t = 0.4, produces 142,957 abnormal patches, which accounts for around 29% (out of 493,470 patches) of the patches used in the pretraining stage. Majority of the patches from OIA-ODIR has scores between 0 to 0.1 because we have assumed that normal patches (whose imagelevel label is normal) contain no lesion region and explicitly set the score of each normal patch to 0. We propose a semi-weakly supervised representation learning framework for retinal fundus images. Through semi-supervised train- Fig. 4 : Distribution of lesion scores in the semi-weakly annotated patch dataset. ing, we propagate existing image-level annotations from a small handlabeled data to another large unlabeled dataset in the form of class activation maps, which are then used to generate a large patch-level pretraining dataset with semi-weak annotations. Together with multilabel contrastive learning objective, our framework is competitive with large-scale supervised ImageNet pretraining and can produce highly transferable representations for a wide variety of downstream tasks such as joint disease classification and anatomical structure segmentation. Our pretraining framework can also be easily extended to other imaging modalities with high structural similarities among images, e.g., CT scans and radiographic images. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs Imagenet: A large-scale hierarchical image database Unsupervised visual representation learning by context prediction Context encoders: Feature learning by inpainting Unsupervised representation learning by predicting image rotations A simple framework for contrastive learning of visual representations Momentum contrast for unsupervised visual representation learning Bootstrap your own latent-a new approach to self-supervised learning Barlow twins: Self-supervised learning via redundancy reduction Emerging properties in self-supervised vision transformers Models genesis Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weaklysupervised classification and localization of common thorax diseases Contrastive learning of medical visual representations from paired images and text A benchmark of ocular disease intelligent recognition: One shot for multi-disease detection Eyepacs: An adaptable telemedicine system for diabetic retinopathy screening Retinal abnormalities recognition using regional multitask learning Learning deep features for discriminative localization Supervised contrastive learning Imagenet large scale visual recognition challenge Covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Fully automated deep learning system for bone age assessment Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning Deep q learning driven ct pancreas segmentation with geometry-aware u-net Automated analysis of unregistered multi-view mammograms with deep learning Deep neural networks improve radiologists' performance in breast cancer screening A convolutional neural network trained with dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image classification task A deep learning system for differential diagnosis of skin diseases Refuge challenge: A unified framework for evaluating automated methods for glaucoma assessment from fundus photographs Big transfer (bit): General visual representation learning Deep residual learning for image recognition Possible principles underlying the transformation of sensory messages Casting your model: Learning to localize improves self-supervised representations Spatially consistent representation learning Detreg: Unsupervised pretraining with region priors for object detection Contrastive learning of global and local features for medical image segmentation with limited annotations Contrastive learning with continuous proxy meta-data for 3d mri classification Self-supervised feature learning via exploiting multi-modal data for retinal disease diagnosis Rotationoriented collaborative self-supervised learning for retinal disease diagnosis S4l: Self-supervised semi-supervised learning In defense of the triplet loss for person re-identification Evaluation of a computer-aided diagnosis system for diabetic retinopathy screening on public data Unsupervised ensemble strategy for retinal vessel segmentation Feedback on a publicly distributed database: the messidor database Indian diabetic retinopathy image dataset (idrid) Ridge-based vessel segmentation in color images of the retina Locating blood vessels in retinal images by piece-wise threshold probing of a matched filter response An ensemble classification-based approach applied to retinal blood vessel segmentation Identity mappings in deep residual networks Batch normalization: Accelerating deep network training by reducing internal covariate shift Group normalization Weight standardization Encoderdecoder with atrous separable convolution for semantic image segmentation Decoupled weight decay regularization