key: cord-0135751-efi1mqky
authors: Ravula, Sriram; Smyrnis, Georgios; Jordan, Matt; Dimakis, Alexandros G.
title: Inverse Problems Leveraging Pre-trained Contrastive Representations
date: 2021-10-14
journal: nan
DOI: nan
sha: 5781773f46d2a9012734c2d31d74008b6add0a0e
doc_id: 135751
cord_uid: efi1mqky

We study a new family of inverse problems for recovering representations of corrupted data. We assume access to a pre-trained representation learning network R(x) that operates on clean images, like CLIP. The problem is to recover the representation of an image R(x), if we are only given a corrupted version A(x), for some known forward operator A. We propose a supervised inversion method that uses a contrastive objective to obtain excellent representations for highly corrupted images. Using a linear probe on our robust representations, we achieve a higher accuracy than end-to-end supervised baselines when classifying images with various types of distortions, including blurring, additive noise, and random pixel masking. We evaluate on a subset of ImageNet and observe that our method is robust to varying levels of distortion. Our method outperforms end-to-end baselines even with a fraction of the labeled data in a wide range of forward operators.

Modern representation learning networks like CLIP [35] are showing incredible performance for image classification, even for zero-shot problems with labels not seen during training. Training these encoders comes at a staggering cost and requires datasets and computing resources only available to very few organizations. In this paper we show how to leverage this pretrained power for a new family of problems in the presence of image corruptions or other types of measurements.

Inverse problems involve reconstructing an unknown vector x from measurements y = A(x). Typically, the forward operator A corrupts the unknown vector x and reduces its dimension, i.e. the observations y live in a lower-dimensional space compared to x. In the special case of linear inverse problems, the forward operator is simply a matrix and the measurements are in the form y = Ax + noise. Special cases of linear inverse problems include image denoising, inpainting, super-resolution, compressed sensing used in medical tomography, seismic geological imaging and many others, see e.g. [31] for a recent overview. observe highly corrupted images and use a simple linear probe to classify in ImageNet-100 labels. We present the top 3 classes from our models as well as those from the end-to-end supervised baselines trained with the same amount of labeled data, for select images. For three different types of forward operators (90 percent missing pixels, strong Gaussian noise and high Blurring) our robust encoders classify correctly and also produce reasonable top 3 alternatives. On the contrary, the supervised baselines completely fail even though they were fine-tuned on exactly this task to classify corrupted images, starting from a powerful ImageNet pretrained ResNet-101. We also expect that most humans would fail to classify such highly corrupted images -more examples are included in the Appendix.

In this paper, we introduce the study of a new family of inverse problems: reconstructing the representation of an image given a corrupted or measured input. Formally, if a (clean) image is x and its CLIP representation is R(x), we would like to obtain that representation by only observing a highly corrupted input A(x). This is impossible if the forward process A removes information needed to obtain the representation. Surprisingly, we show that we can recover representations that are useful for downstream tasks, even from extremely corrupted versions of the image.

We introduce a robust encoder S that is trained to imitate the behavior of the pretrained CLIP encoder acting on clean images x. However, the input to the robust encoder is only corrupted images A(x) that are created by applying the forward operator on x. Our approach is illustrated in Figure 2 . The teacher encoder is the pretrained CLIP, and the student encoder is our robust encoder operating on corrupted images. Formally, the robust encoder S(A(x)) is trained to approximate R(x) using a contrastive loss.

Many applications, such as object recognition with low-cost cameras, remote sensing and aerial imaging rely on noisy or blurry data and can face occlusions or sensor corruptions. As we demonstrate in our experiments in the Appendix, normal CLIP fails on highly corrupted images. Our procedure allows us to transfer the power of CLIP to heavily corrupted images in downstream tasks, with relatively little extra training.

We show that our method is able to obtain useful representations even under extreme corruptions such as removing 90% of the pixels as shown in the top panel of Figure 1 . The highly corrupted images enter our robust encoder and the obtained representation is used in a linear classifier to produce ImageNet-100 labels. Our main result is that our method outperforms a pretrained ResNet (of the same size as our robust encoder) fine-tuned end-to-end on labeled distorted images. Using less labeled data For some corruption levels, we are able to outperform end-to-end fine-tuned ResNets using as little as 10% of labeled samples. This is even when the fine-tuned baseline uses 100% of ImageNet-100 labels for training. The primary advantage of our model is that our robust encoder observes representations from the pretrained CLIP encoder, which was trained with a much bigger dataset compared to ImageNet. Still, the fact that this implicit advantage in the representations and 10% of labeled data is sufficient to outperform a supervised trained ResNet fine-tuned with 10 times more labeled data is very surprising and illustrates the power and versatility of pretrained representation learners.

Robustness to noise and data shifts Our method is very robust to changes in the forward operators, data statistics and label shifts. We experiment with three classes of forward operators: random pixel masking, additive Gaussian noise, and Gaussian blurring distortions. For each, we train and test on a wide-range of distortion levels, from slight to severe corruption. We show that our robust encoder produces useful representations even when the level of corruption is outside its training domain.

We show that our representations are useful for a wide range of tasks, without requiring knowledge of the task when the robust encoder is trained. We illustrate excellent classification accuracy across five datasets, frequently outperforming end-to-end supervised baselines trained with knowledge of the target task. Our experiments include a chest X-ray COVID pneumonia task which has very different morphology compared to ImageNet. Surprisingly, the same universal representations, combined with a custom linear probe are very successful across all tasks.

Contrastive versus MSE training We formulate the contrastive student training method as a regularization upon the simple mean squared error loss between student embeddings of a distorted image and teacher embeddings of the clean version of that image. We analyze the effects of this regularization on the training dynamics. Our results empirically show that simple MSE is worse in most cases, further strengthening our argument for the usefulness of contrastive learning in this setting.

Robust Image Recognition It has been shown that classifiers that perform well on clean image tasks are not robust to common image distortions [21] . Several datasets have been proposed specifically to benchmark generalization of classifier performance to natural distortions [18, 38] . Related to our work, [45] and [12] fine-tune pretrained classifiers on distorted data, which seems to yield better performance than end-to-end training. However even under these training processes, modern classifiers exhibit inferior performance to human vision on distorted data [12] .

Inverse problems There is significant recent literature on solving inverse problems including denoising, inpainting and deconvolution for deblurring. While classical techniques rely on sparsity based priors e.g. [30, 37] , recent techniques include data-driven deep-learning methods [1, 10, 13, 33] as well as combinations of sparsity and generative methods [11] and untrained deep nets [16] . Our work focuses on recovering the representation of an image, as opposed to an image itself, so it is also related to task-aware sensing [23] . However, our approach is fundamentally different from all previous inverse problems since instead of trying to reconstruct the image x, we aim to reconstruct a representation R(x), which lives in a different space. The other important distinction is that even though our corruption processes are linear in the pixel space, they are non-linear with respect to the representation vector we try to recover. In effect, we are solving a non-linear inverse problem in a supervised way using a contrastive loss. In the Appendix, we compare to methods that attempt to solve the inverse problem in pixel space, then apply a classifier on the recovered image.

Contrastive Representation Learning Self-supervised representation learning has recently exploded in popularity, largely due to the success of constrastive losses in learning representations from unlabeled data [4, 6, 7, 15, 17, 43] . These techniques are able to generate highly general embeddings of images that are effective for many types of downstream tasks, even on domains that were not explicitly considered in training [27] . Contrastive losses generally operate based on a simple push-pull principle: images desired to be close in embedding space are pushed together, while unrelated images are pulled apart. One particularly popular choice of contrastive loss is the InfoNCE loss, derived from techniques for noise contrastive estimation [14] , and popularized in the self-supervised setting in [32] . Several works have considered adversarial training frameworks to yield representations that are more robust to adversarial attacks [20, 22, 25] . However these works only consider adversarial robustness and not robustness to common corruptions. Our approach is similar to that of [24] , which employs a variant of the InfoNCE loss for a supervised setting. This differs from our work in that we exclusively focus on robustness to natural corruptions. Moreover, our contrastive training step can be performed without any task-specific labeled data. This is made possible by the use of the powerful embeddings provided by CLIP, and allows the embeddings to be used for downstream tasks on multiple datasets.

Knowledge Transfer Methods Our work is closely related to prior works which aim to distill, reduce, or transfer the knowledge from one network to another for a specific task [3, 19, 29, 44] . Of note is [40] , where the authors use a contrastive objective to transfer representations from a teacher network to a student network. Our work diverges in that we do not transfer from a larger, more powerful teacher to a smaller student, but rather transfer between a teacher and student of the same architecture initialized from the same weights. In addition, although the authors test on cross-modal transfer tasks such as transferring between color channels, transferring representations between clean and distorted images is a different task: we try to extract the same high-level information from less data as opposed to different, but related data.

Our problem is to recover the embeddings of clean images when we only have access to highly corrupted versions of the images. The encoder R(·) is assumed to yield high-quality representations for a variety of domains. The distortion process A(·) is assumed to be a known forward operator that greatly distorts images. We assume A(·) is sufficiently severe as to inhibit the performance of the encoder R(·), but not so severe that recovery is impossible. From a collection of input images

, we are only given access to the distorted images and the representations of the clean images

Our approach is to learn a student function S so that S(A(·)) is equally useful as the teacher representation R(·). We measure the utility of a representation by the performance on an unspecified downstream supervised learning task.

Least squares Loss One potential approach for the task of recovering the teacher's embeddings from the corrupted inputs is to minimize the expected 2 distance in embedding space between the clean teacher embeddings R(·) and the predicted student representations S(A(·)). If both R(·) and S(·) are constrained to have 2 -normalized outputs, then the empirical least squares loss becomes:

where we have expanded and subtracted out constant terms. However, this process may not yield the most effective embedding of the corrupted data. Due to either limitations in the training process, the severity of the distortion process, or the generalization properties of the approximate minimizers of L MSE , the learned embeddings S(A(·)) may not be as useful for downstream tasks as embeddings learned using the other losses we consider.

Contrastive Loss Inspired by recent advances in self-supervised representation learning, we learn S by minimizing a contrastive loss. We consider the following variant on the popular InfoNCE loss:

where K(i, j) := S(A(x i )), R(x j ) measures the similarity between the learned embedding of A(x i ) and the clean embedding of x j , and τ is a temperature hyperparameter. We follow [5, 42] and rewriteL contr in terms of explicit 'pull' and 'push' terms as :

where the second term, referred to as the uniformity term, is defined as:

The first term ofL contr is simplyL MSE which encourages alignment of S(A(x i )) with R(x i ), and theL unif term encourages the learned representations for corrupted data to be dissimilar from all other representations of clean data.

The primary difference between Equation 3 and the InfoNCE loss commonly used in self-supervised learning is the choice of the similarity measure K(·, ·). Without access to specified target embeddings, K(i, j) is typically chosen to be the inner product between projections of the embeddings of x i and x j , (or between x i and a positive example in the case of K(i, i)). Here, the uniformity term is necessary to prevent representation collapse, where in our settingL MSE alone suffices to prevent this degenerate case. We note that alternative choices for K(·, ·) inL unif may be employed. For example, we could also contrast the student embeddings S(A(·)) across two different images. We ablate against other choices in the experimental section and find that our choice of K(·, ·) is one of several that exhibits comparable performance.

Effect of uniformity term To examine the effect of the uniformity term of the loss on the training dynamics, we consider the gradients ofL contr with respect to the parameters of the encoder S. We first decomposeL unif to consider the contributions of each individual data point:

Then the gradient ofL contr with respect to the parameters of S may be written as

This can be interpreted as follows. The weighting of the first term by τ −1 balances the gradients ∇ S K(i, i) and j ∇ S K(i, j), as is common with other choices of contrastive losses (c.f. Theorem 2 in [41] ). Noting that for each i, the weights w i (j) sum to 1, the gradient of each individual uniformity termL unif i is a convex combination of the gradients of the similarity terms K(i, ·). The individual similarity terms are weighted exponentially proportionally to each K(i, j)'s contribution to the total uniformity loss. As τ decreases, the weights place greater emphasis on terms that are most similar. As τ is commonly chosen to be less than 1, the dynamics automatically reflect the influence of the 'hardest' negative examples.

When is perfect recovery possible? Finally, we describe conditions which allow for perfect recovery of the clean embeddings from the corrupted images in the training set. The first condition is that the corruption process should not be too destructive: it suffices to assume that the implication

holds for all x i , x j in the training set, i.e., there exists a function which attains exact recovery. In this case, any S * that minimizesL MSE has S * (A(x i )) = R(x i ) for all x i in the training set. To argue for the same recovery guarantees when optimizingL contr , we need to assume that the teacher R(·) provides a well-separated embedding of the training data. Proposition 1. If R is a minimizer of the uniformity term

then any encoder S * ∈ arg min SL contr (S; τ, R, A) exactly recovers the target embedding,

Training procedure Our final method consists of (1) the contrastive step, in which the student learns representations for the distorted images and (2) a fine-tuning step, where we train a linear classifier on top of the learned representations. During the second step, the student encoder is kept frozen.

We show that the proposed training process recovers useful representations from corrupted inputs for a variety of forward operators, evaluating several different classification tasks. We start by evaluating the representation quality when the distortion process and data distribution are the same during both training and inference, examining the label-efficiency of our approach.

Then we evaluate our approach when the severity of the distortions changes at test time, when the test data distribution differs from the one used during training, and when the labels are shifted. Finally, we ablate against alternative formulations of our loss function.

For all experiments, we perform contrastive training for the robust encoder using a 100-class subset of ImageNet, which we refer to as ImageNet-100, [36, 39] to reduce computational resources. Our target representations are attained from the CLIP ResNet-101. We find robust representations with contrastive learning and evaluate by training a linear classifier on top of the frozen representations. In all experiments, the distortions are applied randomly to each batch of images, and independently for each image in the batch. Where applicable, we report our results over 10 different instances of random corruptions applied on the evaluation images. Our baselines are built on a ResNet-101 initialized with weights from supervised training on the full ImageNet dataset. The final fully-connected layer is replaced with a 100-dimensional output. We fine-tune the whole model in a supervised fashion using distorted inputs and their correct labels from ImageNet-100. The baseline is trained for 25 epochs with a batch size of 64. Our robust encoder is trained for 25 epochs with a batch size of 256, and the linear probe on top of it is trained for 10 epochs. Additional experiments, as well as further details on training and hyperparameter choices are discussed in the appendix and in our provided code.

In this section, we evaluate the quality of the learned robust representations for classifying images from the validation set of ImageNet-100, using the same distortions during training and inference. These experiments demonstrate the usefulness of our method for vision inverse problems where the data distribution and forward operator are both known at training time. We also demonstrate the label-efficiency of our approach by evaluating our learned representation when only few labeled samples are available to train the linear classifier.

Setup We train the robust contrastive model and the baseline as described above. Eight different distortion processes are examined: Gaussian blur with (kernel size, standard deviation) of (21, 5) and (37, 9) ; additive Gaussian noise with standard deviations 0.1, 0.3, and 0.5; and random pixel mask with 50%, 75%, and 90% of the pixels missing. We evaluate our method against the baseline in top-1

Fraction of Labels accuracy on the validation set of ImageNet-100, under the same distortion used to train each model. To demonstrate the label-efficiency of our method, we also train a linear probe with only 10% of the labeled data, and for two of our models we train linear probes using various amounts of labeled data.

In Table 1 , we see that training a linear probe on top of the representations learned by our procedure greatly improves accuracy compared to the supervised baseline. This further solidifies our original motivation. Furthermore, we can see that using only 10% of labeled samples is sufficient for our model to outperform the baseline in most cases. More fine-grained label-efficiency results can be seen in Figure 3 , where we show that in two of these cases, using just 5% of the labeled data makes our model outperform or be competitive with the baseline trained using all the labels.

In many settings, the data distribution and the type of distortion will be known at training time, but the severity of the distortion will be unknown. We consider the case where the severity of the distortion at test-time is greater than what was seen during training.

Setup First, we train a model using training images with between 50 and 95% of pixels randomly masked. In addition, we train a model using images with additive Gaussian noise with random standard deviation between 0.1 and 0.3. Once fully trained, we fit a linear classifier on top of the learned representations for each model using the same distortions. We also train two supervised baselines end-to-end with the same distortions. We evaluate the models trained with pixel masking on images with a fixed level of 96 to 99% percent missing pixels, and the networks trained with noise on images with additive Gaussian noise using a fixed standard deviation between 0.35 and 0.5. Top-1 Accuracy (%)

Ours Baseline Figure 4 : Accuracies for images with varying corruption levels using models trained on a range of levels.

In the left figure, we compare our robust model with a baseline, both trained on images with 50% to 95% random pixel masking. In the right figure, each model is trained on images with additive Gaussian noise with random standard deviation from 0.1 to 0.3. We evaluate the models on images with more severe corruptions than applied during training. Results are averaged over 10 random instantiations of corruptions on the ImageNet-100 validation dataset. We omit error bars as standard error is insignificant. Results In Figure 4 , we see that with an increase in noise levels, the accuracy of the models does decrease. However, the linear probe trained on the entire dataset achieves better results than the baseline end-to-end supervised model. This shows that our model is more robust, even when the distortions are greater than those expected during training.

In this section we evaluate how well ImageNet pretraining with distortions allows the learned representations to transfer to different datasets. We use the same forward operator during training and inference to isolate the quality of the embeddings learned during the contrastive step even in the presence of distortion.

Setup Five datasets are chosen to evaluate transferability of robust representations. (1) CIFAR-10 and (2) CIFAR-100 [28] , and (3) STL-10 [8] . (4) The COVID-19 Chest X-ray dataset [9] , (5) We generate another random 100-class subset of ImageNet [36] from the remaining 900 classes we did not use for ImageNet-100, which we refer to as ImageNet-100B.

We train the same models under the same distortions as outlined in Section 4.2, then fit linear classifiers for the new datasets on top of the fixed representations. Preprocessing details for each dataset may be found in the Appendix. We calculate top-1 accuracy of each model for classifying distorted images from the validation or test set of each of the new datasets. The images are distorted using the same forward operators used to train the networks.

Results Table 2 shows that our approach can achieve good results in a variety of datasets. In CIFAR-10 and CIFAR-100, we get results comparable to the baseline. In STL-10, we have greater top 1 accuracy for both noise settings. For the COVID X-ray dataset, we get mixed results, where we beat the baseline for the random masking model, but we lose for the additive Gaussian noise model. The most surprising result is the vast increase in accuracy for the alternative ImageNet-100B dataset. This shows that due to the supervised training of the baseline, some information from the labels leaks into the representation, leading to worse performance on a related, but ultimately different dataset.

An important factor in determining the robustness of a model is how gracefully it fails in the presence of unseen data modalities at inference time. For instance, if a network trained to distinguish dogs from cars is shown an image of a cat, the network should produce a embedding which is more likely to be classified as a dog than a car. This is even more important for inverse problems: a single distorted image may result from the same forward operator being applied to any number of original images. If a distorted image from a class outside of the training classes makes a network output an unexpected or uninformative representation, then the network is likely also brittle to shifts in the data distribution within the classes it knows at inference time. We evaluate the quality of representations produced by our method for distorted images from classes outside of the training dataset.

Setup We use the same models described in Section 4.2. To simulate images from unseen classes, we identify 5 classes from the validation set of full ImageNet that were not used in our training data, but are similar to classes within our training data. We replace the labels of these new images with the label of the classes in our training data that are similar to them, as seen in Table 3 . Evaluation is done on images from these replacement classes, under fixed levels of distortion, with 50% to 99% missing pixels on the random masking model and 0.05 to 0.5 standard deviation for the Gaussian noise model.

In Figure 5 , we can see that, for the random pixel mask case, we consistently outperform the baseline across all noise levels. For the Gaussian noise case, our model has slightly lower top-1 accuracy for small noises. This can be explained by the fact that small distortions do not drastically alter the image, so the baseline is bolstered by its ImageNet pretraining. However, as the noise level increases, the baseline model degrades much more quickly. Figure 6 : Ablation study. We evaluate the model trained on 50% to 95% random masking of pixels, using several variants of the uniformity term in the contrastive loss. Results are averaged over 10 random instantiations of corruptions on the ImageNet-100 validation dataset. We omit error bars as standard error is insignificant.

Finally, we ablate against various choices for the uniformity term in the contrastive loss. We consider variants ofL unif where we compare representations of i) the student and teacher, ii) the student and the student, iii) a sum of both of the above. We also considerL MSE , i.e. no uniformity term, as well as one variant where every representation is contrasted with every other representation (irrespective of student or teacher), which we note is the same formulation as NT-Xent as used in SimCLR [6] .

Setup We use the same random masking model described in Section 4.2, with a similar evaluation on fixed masking levels of 90-99%. Explicit formulations of each of the losses is in the appendix.

We draw two main conclusions from Figure 6 . First, the MSE loss performs worst, indicating the benefit of the uniformity term in the loss. Even though we have access to the pretrained representations, it is not simple for the model to exploit the encoded information. If this were the case, MSE would be as effective as our contrastive loss. Second, we see that all student comparisons perform roughly equivalently, but are clearly more effective than MSE and NT-Xent.

In this work, we propose a method for training image representation networks which are robust to various distortions on the input data. Our method has potential for improving the practical applications of powerful pre-trained models. Indeed, images in real-world settings are rarely exemplary: every stage of the imaging process, from capture to storage to transmission and display, can introduce noise or distortions in the images. Moreover, our process helps reduce the cost of training such models, both with respect to computation since we can add robustness to a pre-trained model instead of training one from scratch, as well as with respect to label efficiency since our method mainly relies on large amounts of inexpensive unlabeled data.

Several significant open problems emerge from this work. As we can see in Appendix D.8, our method requires some prior knowledge of the type of distortions that will occur to the image. This is due to the fact that our method relies on training the student to match the representations of the teacher, which is significantly more difficult when the type of corruption (and thus, the representation itself) changes drastically. As future work, we aim to extend our method beyond this limitation, possibly by finetuning the teacher as well as the student, to produce representations that are more easily matched under different types of distortions. In addition, another interesting research direction is how well this method performs under adversarially-chosen inputs.

[42] T. Wang and P. Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, 2020. 

As previously mentioned, the goal of our work is to create an image classification system which is robust to various distortions on the input data, which has potential applications on improving the practical use of neural network architectures. However, we have to point out some ethical considerations for the use of our work. On the one hand, allowing more applications to access the power of deep learning may not necessarily have a positive impact, since this largely depends on the goals of the application itself. On the other hand, care should be taken when choosing the dataset on which to train the student encoder, as well as the model to serve as the teacher encoder. Since our proposed process aims to distill information from the teacher, as well as use a specific dataset to perform classification, the presence of biases in either may carry over to our student encoder, which is undesirable. It is paramount for the ethical application of our work to distill information from models without any such biases.

B Proof of Proposition 1 Proposition 2. If R is a minimizer of the uniformity term

then any encoder S * ∈ arg min SL contr (S; τ, R, A) exactly recovers the target embedding, S * (A(x i )) = R(x i ) for all x i in the training set.

Proof. We are interested in the minimizers of the functionL contr . To simplify notation, we denote

Since each S i only appears in one term of the above sum, it suffices to show that F i is uniquely minimized by the argument R i . The only assumption we make on R i follows from [42] which shows that if R minimizes the uniformity term, then j R j = 0.

We prove the claim directly by showing that F i (S i ) − F i (R i ) > 0 for any S i with S i = 1 and S i = R i . Fix some S i with norm 1 and suppose that S i , R i = 1 − δ. Then from an argument by cosine distance, we see that replacing R i with S i cannot alter the dot product R i , R j by more than δ for any j: | R i − S i , R j | ≤ δ for all R j . Using optimality of R, we claim there exists some j such that this is strict on one side. There must exist a j such that

because if not, then R j , S i = R j , R i − δ for all j. Summing both sides over j and using the fact that j R j = 0 generates the contradiction that 0 = −N δ.

where the inequality follows from exp (

where it is strict for at least one j. We can now directly compare

Hence, f i has a unique global minimizer at R i .

As mentioned in the main text, in Section 3, we consider variations on the uniformity term inL contr . We explicitly define these variations here. Recall that

In the main text, we use the 'student vs. teacher' uniformity loss. We explicitly define this variant and all others considered in the following list. We abuse notation slightly and use the function K τ (·, ·) defined as:

noting that in the main text the arguments to K were indices and here they the embedding vectors themselves. As above, we let S(A(x i )) and R(x i ) be denoted by S i , R i respectively.

• Student vs. Teacher: This loss compares the noisy student embeddings to the clean teacher embeddings, denoted aŝ

• Student vs. Student: This loss compares pairs of noisy student embeddings:

• Student vs. Both: This loss combines the above two losses, where the combination occurs inside the logarithm: 

Datasets For all experiments, we pre-train the robust encoder as well as the baseline using a randomly chosen 100-class subset of the ImageNet dataset [36] . ImageNet consists of 1,000 classes of objects, with 1.2M training images and 50K validation images. The portion we use is the same subset of ImageNet used by [39] , and contains 126,689 training and 5,000 validation images. We refer to this dataset as ImageNet-100.

For the transfer learning task, we make use of five datasets: (1) CIFAR-10 [28] , (2) CIFAR-100 [28] , and (3) STL-10 [8] . (4) The COVID-19 Chest X-ray dataset [9] , which consists of X-rays taken of patients with healthy lungs as well as lungs affected by pneumonia resulting from COVID-19. The dataset contains 5,286 training images and 624 test images. (5) We generate another random 100-class subset of ImageNet [36] from the remaining 900 classes we did not use for ImageNet-100, which we refer to as ImageNet-100B. This subset contains 128,987 training images and 5,000 validation images.

Image Pre-Processing During contrastive training, we randomly crop the image and resize it to a height and width of 224 pixels, then apply a random horizontal flip. The resultant image is normalized to have a mean of 0 and standard deviation 1 in each color channel before being fed to the input of the teacher. A copy of the randomly-cropped and flipped image is distorted with a given forward operator and normalized to feed to the student input. We pre-process the images for training the baseline by applying a random crop, resizing to 224 × 224 pixels, applying a given distortion, and normalizing the pixel values. During validation and testing, we resize each image to a height and width of 256 pixels, take a center crop of 224 × 224 pixels, apply a distortion, and finally normalize the image. We vary the train and test distortions depending on the setting we evaluate.

For images from CIFAR-10, CIFAR-100, and STL-10, we pre-process the training images by resizing to 224 × 224 pixels, applying a random horizontal flip, distorting with a given forward operator, and finally normalizing the images. For validation images from these datasets we perform the same process without the random flip. For training images from the COVID-19 X-ray dataset and ImageNet-100B, we apply a random crop, resize to 224 × 224 pixels, apply a given distortion, and normalize the pixel values. For validation images from these datasets, instead of a random crop and resize, we resize to 256 × 256 pixels and take a 224 × 224 center crop of the image, then apply a distortion and normalize.

We apply random pixel masking, Gaussian blur, and additive Gaussian noise as our distortions in the various settings we evaluate. Random pixel masking sets the intensity of a randomly-chosen set of pixels in an image to 0. Gaussian blur convolves the image with an isotropic Gaussian kernel, and additive Gaussian noise applies additive white gaussian noise to each each pixel in the image.

Training Hyperparameters We train our method and the supervised baseline using the Adam optimizer with default values for β and a cosine learning rate schedule [26] . For the baseline, we use a learning rate of 0.001 and a batch size of 64. For our method, we use a learning rate of 0.0003, weight decay of 0.0001 and a batch size of 256, and we set the temperature τ from Eq (3) to be 0.1. We train both the baseline and our method for 25 epochs. The baseline is optimized using cross entropy loss.

We train a linear classifier on top of the learned representations for our method and the baseline for each set of experiments. The linear classification layer is optimized with the Adam optimizer, using a learning rate of 0.001 and a batch size of 128, and is trained for 10 epochs. For label-efficiency experiments, we lower the batch size to 8 to compensate for the decrease in the data used. We freeze the learned backbone network during training of the classifier. The classifier is trained using images that have the same distortions as were used to train the backbone network. During transfer learning using the baseline, we remove the classification layer used on ImageNet-100 pre-training and replace it with a randomly-initialized layer with appropriate dimension.

We choose the hyperparameters for each model using a linear search over several values for each hyperparameter, based on the highest mean validation accuracy we achieve on ImageNet-100 after one epoch of training on each distortion type. Due to computational limitations, we did not use a separate validation set for hyperparameter tuning, but rather extrapolated from these results over a single epoch of training. The values searched over are as follows:

• The learning rate for our method was searched in the range of [10 −4 , 10 −2 ]. For the baseline, the search was over [10 −5 , 5 · 10 −1 ].

• The weight decay was searched in the range [10 −4 , 10 −3 ].

• The temperature parameter τ was searched in the range [0.1, 1].

The batch size was set as high as possible with our computing hardware. For our method, this was done since this form of contrastive loss benefits from greater batch size. For the baseline, this choice improved performance. All experiments were performed on a system with 4 Nvidia Quadro RTX5000 GPUs, 2 Intel Xeon E5-2620 v4 CPUs, and 128GB of RAM. Experiments using CLIP networks were run using 16-bit floating point models and data. We use the PyTorch implementation of the supervised ImageNet-trained ResNet-101 [34] .

For sake of completeness, we provide the full metrics for our experiments, which also include top-5 accuracies for the experiments performed. The conclusions we can draw from the top-1 accuracy results do not change when we look at the top-5 accuracies.

For the transfer learning task on COVID X-ray data, we present the area under the receiver operating characteristic curve (AUC). This dataset presents a binary classification problem, so top-5 metrics do not apply.

Fraction of Labels In the left figure, we compare our robust model with a baseline, both trained on images with 50% to 95% random pixel masking. In the right figure, each model is trained on images with additive Gaussian noise with random standard deviation from 0.1 to 0.3. We evaluate the models on images with more severe corruptions than applied during training. Results are averaged over 10 random instantiations of corruptions on the ImageNet-100 validation dataset. We omit error bars as standard error is insignificant. 

An alternative way of solving the inverse problem presented in the paper is to use methods which operate directly on the pixel space of the image. Instead of training a classifier which operates on distorted images, one could try to recover the original images from the distorted version. The recovered images can then be given to a classifier trained on clean images. We compare our method with two such baselines which operate in pixel space:

1. We apply Non-Local Means (NLM) [2] denoising to images corrupted by additive Gaussian noise. 2. We use Deep Decoder [16] with default parameters and 5000 optimization steps to perform inpainting on images with random missing pixels. Deep Decoder is a method which randomly initializes an under-parameterized generative network with upsampling and 1 × 1 convolution layers, then optimizes over the weights of the network to fit a single distorted image. Since the network is underparameterized, it cannot fit noise very well, while its upsampling and convolution layers bias it to produce natural-looking images. The result is that the network produces a reconstructed version of the original image, despite never having been trained on any other data.

For evaluation, we corrupt images from the ImageNet-100 validation set with Gaussian noise and random pixel masks, then apply NLM and Deep Decoder, respectively. We feed the recovered images as input to a classification model trained on clean image data. The classification model we use is an ImageNet pre-trained ResNet-101 backbone with a linear classifier trained on clean ImageNet-100 images. We compare the performance of these pixel-space inverse methods with that of our method.

The results for denoising are seen in Table 7 and the results for inpainting in Table 8 . We can see that the inverse methods acting in pixel space are not reliable at reconstructing images for classification.

In the case of denoising the accuracy degrades quickly with increasing noise, and for inpainting the accuracy is poor for all masking levels. To demonstrate the ability of our method to retrieve good representations from the teacher, we perform the following experiment. We train a robust encoder on distorted images using the method we propose. We then train a linear classifier on top of the pre-trained, non-robust CLIP backbone using clean images. Finally, we transfer this linear classifier for clean images to the robust encoder.

The results can be seen in Table 9 . We see that our technique achieves good results, even without finetuning the linear classifier on distorted images. This means that the representations learned by the student for distorted images are sufficiently close to those of the teacher for clean images. Extending the results from Section 4.2, we evaluate both the baseline and our robust encoder in the setting where the noise levels seen during testing are lower than those seen during training. The results can be seen in Table 10 , and they are in accord with the rest of our observations: accuracy for both models is higher (due to lower noise), and our method still outperforms the baseline. We examine a variant of our baseline model, where instead of using a ResNet pretrained on the full ImageNet dataset, we instead train our ResNet on ImageNet-100 to get a good classifier on the clean images, and then use that as a starting point for our baseline model. More specifically, the chosen architecture is again ResNet-101, trained in a supervised fashion for clean images for 90 epochs, with an SGD optimizer, a learning rate of 0.1, momentum of 0.9 and batch size of 256. 2 These results can be seen in Tables 11 and 12 . We can see that these results are comparable to (and in most cases worse than) our original baseline. This is to be expected, since the model which is initialized with full ImageNet weights was trained on roughly 10 times more data than this new model. In any case, results are still worse than our method, which means that the latter is capable of outperforming the stronger of the two baselines. Table 11 : Comparison of initial weights for supervised baseline, for the fixed noise experiment. Training of the supervised baseline can be done starting from a ResNet-101 which was trained on the full ImageNet, or from one which was trained on only ImageNet-100. We can see that the first choice, which is the one used in the rest of this paper, is the better of the two. In any case, both baselines provide worse results than our method (compare to Table 1 ).

Baseline Initial Weights From ImageNet-100

Baseline Initial Weights From Full ImageNet

To understand how much we can gain from performing contrastive training to make image representations more robust, we would like to see how well the original CLIP network performs for classifying distorted images. We train a linear classifier on top of the pre-trained, non-robust CLIP backbone using distorted images (i.e. the network has not been trained with a contrastive step). We also train a linear classifier on top of the ImageNet pre-trained ResNet-101 backbone using distorted images. We present the results in Table 13 . Clearly, both CLIP and ImageNet pre-trained ResNet-101 produce poor representations for distorted images resulting in low classification accuracy. These results highlight the need for a training procedure to make these pre-trained representations more robust. To train our model and the baseline, we use a randomly-chosen 100-class subset of the original ImageNet dataset. This subset is the same as used in [39] . We present the wnid of each of the classes in the subset in Table 15 D.10 ImageNet-100B Classes

For the transfer learning experiments, we select a random subset of 100 classes from ImageNet which are mutually exclusive with the classes found in ImageNet-100. We term this subset ImageNet-100B, and present the wnid of each of the classes used in this subset in Table 16 . Table 15 : List of ImageNet-100 Classes. We present the wnid of each of the classes used in ImageNet-100.

These classes are randomly sampled from the original ImageNet dataset and are the same classes used in [39] .

ImageNet-100 Classes n02869837 n02086910 n03785016 n02483362 n03837869 n01749939 n02859443 n03764736 n04127249 n03494278 n02488291 n13040303 n03775546 n02089973 n04136333 n02107142 n03594734 n02087046 n03017168 n03794056 n13037406 n02085620 n07836838 n02093428 n03492542 n02091831 n02099849 n04099969 n02804414 n02018207 n04517823 n01558993 n04592741 n02396427 n04067472 n04589890 n04493381 n03891251 n04418357 n03930630 n03062245 n02109047 n02701002 n02172182 n03584829 n01773797 n04111531 n03379051 n01729322 n02123045 n01735189 n02877765 n02259212 n02113978 n04229816 n07831146 n04429376 n07715103 n03787032 n02100583 n07753275 n02009229 n03947888 n02089867 n03642806 n03085013 n01978455 n04026417 n02119022 n04336792 n04485082 n02106550 n02326432 n03777754 n03259280 n02105505 n01820546 n03637318 n04238763 n02116738 n01983481 n01692333 n01980166 n02231487 n02108089 n02788148 n07714571 n02113799 n03032252 n03424325 n03530642 n02974003 n02086240 n02138441 n01855672 n04435653 n02114855 n03903868 n02104029 n02090622 Table 16 : List of ImageNet-100B Classes. We present the wnid of each of the classes used in ImageNet-100B.

These classes are randomly sampled from the original ImageNet dataset and are mutually exclusive with the classes in ImageNet-100.

ImageNet-100B Classes n02088364 n02840245 n04258138 n03670208 n02013706 n03000134 n01688243 n02280649 n03483316 n02797295 n03544143 n03920288 n02492660 n02777292 n04366367 n03388043 n02488702 n03782006 n03602883 n03857828 n02165105 n03884397 n03495258 n03982430 n04243546 n02321529 n07745940 n04254120 n02808440 n03891332 n01819313 n01484850 n02391049 n03207743 n03796401 n03187595 n04147183 n04254777 n02096177 n03314780 n01667114 n04356056 n07716906 n01742172 n04039381 n02097130 n01644900 n03888605 n03792782 n01498041 n02104365 n02132136 n01843065 n01795545 n01990800 n02279972 n02999410 n02643566 n03534580 n03976657 n12985857 n02457408 n04515003 n03814906 n02107683 n01773549 n04540053 n03125729 n02342885 n02229544 n12057211 n02776631 n04179913 n03692522 n03063599 n02791270 n02107574 n03788365 n03272010 n03127747 n01491361 n02930766 n02098286 n09468604 n03179701 n02169497 n04263257 n02109525 n03720891 n03016953 n02793495 n03724870 n02966193 n03717622 n09193705 n02281787 n04081281 n03929660 n02177972 n04033901

In this section, we present several examples of the fixed distortions we apply during training and testing. All images are taken from the ImageNet-100 validation set. Figure 11 : Visualization of the various fixed distortions we test.

Invertible generative models for inverse problems: mitigating representation error and dataset bias

Non-local means denoising

Model compression

Unsupervised learning of visual features by contrasting cluster assignments

Intriguing properties of contrastive losses. CoRR, abs

A simple framework for contrastive learning of visual representations

Big self-supervised models are strong semi-supervised learners

An analysis of single-layer networks in unsupervised feature learning

Covid-19 image data collection

Measuring robustness in deep learning based compressive sensing

Modeling sparse deviations for compressed sensing using generative models

A study and comparison of human and deep learning recognition performance under visual distortions

Fast and provable admm for learning with generative priors

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

Momentum contrast for unsupervised visual representation learning

Deep decoder: Concise image representations from untrained nonconvolutional networks

Data-efficient image recognition with contrastive predictive coding

Benchmarking neural network robustness to common corruptions and perturbations

Distilling the knowledge in a neural network

Contrastive learning with adversarial examples

Google's cloud vision api is not robust to noise

Robust pre-training by adversarial contrastive learning

Task-aware compressed sensing with generative adversarial networks

Adversarial Self-Supervised contrastive learning

Adam: A method for stochastic optimization. CoRR, abs/1412

Contrasting contrastive Self-Supervised representation learning models

Learning multiple layers of features from tiny images

Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks

Sparse mri: The application of compressed sensing for rapid mr imaging

Deep learning techniques for inverse problems in imaging

Representation learning with contrastive predictive coding

Inference with deep generative priors in high dimensions

Pytorch: An imperative style, highperformance deep learning library

Learning transferable visual models from natural language supervision

Imagenet large scale visual recognition challenge

Limits on support recovery with probabilistic models: An informationtheoretic framework

Cure-or: Challenging unreal and real environments for object recognition

Contrastive multiview coding

Contrastive representation distillation

Understanding self-supervised learning with dual deep networks. CoRR, abs

Similar to Table 11, we compare the two choices for the baseline for the experiment with varying noise levels. Again, training with the full ImageNet dataset provides a stronger baseline in most cases

Distortion Baseline Initial Weights From ImageNet-100

D.8 Experiments on ImageNet-100C

As a final benchmark, we compare our methods on a subset of ImageNet-C [18] with the same classes as those of ImageNet-100, henceforth referred to as ImageNet-100C. We compare two models: • The first is a baseline ResNet-101

• The second is a version of our student encoder, which is initialized from CLIP and is trained on distorted images

If the type of noise is altered for the student, then it is difficult for it to match the teacher representations, which are fixed. Indeed, altering the type of noise on an image is expected to greatly affect its representation. Thus, at its present iteration, our technique relies on some prior knowledge about the type of distortion encountered