key: cord-0634635-y9ncz5fv
authors: Singla, Sahil; Moayeri, Mazda; Feizi, Soheil
title: Core Risk Minimization using Salient ImageNet
date: 2022-03-28
journal: nan
DOI: nan
sha: 71854d12f72a2d62105b2e0f3e24fb6cada55f7c
doc_id: 634635
cord_uid: y9ncz5fv

Deep neural networks can be unreliable in the real world especially when they heavily use spurious features for their predictions. Recently, Singla&Feizi (2022) introduced the Salient Imagenet dataset by annotating and localizing core and spurious features of ~52k samples from 232 classes of Imagenet. While this dataset is useful for evaluating the reliance of pretrained models on spurious features, its small size limits its usefulness for training models. In this work, we first introduce the Salient Imagenet-1M dataset with more than 1 million soft masks localizing core and spurious features for all 1000 Imagenet classes. Using this dataset, we first evaluate the reliance of several Imagenet pretrained models (42 total) on spurious features and observe that: (i) transformers are more sensitive to spurious features compared to Convnets, (ii) zero-shot CLIP transformers are highly susceptible to spurious features. Next, we introduce a new learning paradigm called Core Risk Minimization (CoRM) whose objective ensures that the model predicts a class using its core features. We evaluate different computational approaches for solving CoRM and achieve significantly higher (+12%) core accuracy (accuracy when non-core regions corrupted using noise) with no drop in clean accuracy compared to models trained via Empirical Risk Minimization.

Decision making in high-stakes applications such as medicine, finance, autonomous driving, law enforcement and criminal justice is increasingly driven by deep learning models, thereby raising concerns about the trustworthiness and reliability of these systems in the real world. A root cause for the lack of reliability of deep models is their heavy reliance on spurious input features (i.e., features that are not (Bottom) Images perturbed to visually amplify the feature. The feature (likely mountain) is spurious for 3 Imagenet classes: Alpine Ibex, Ski, Marmot. Surprisingly, the feature is also core (essential) for 3 different classes: Mountains, Valley, Volcano. essential to the true label) in their inferences. For example, DeGrave et al. (2021) discovered that a convolutional neural network (CNN) trained to detect COVID-19 from chest radiographs uses spurious text-markers for its predictions. Similarly, Zech et al. (2018) observed that a CNN trained to detect pneumonia from Chest-X rays had unexpectedly learned to identify particular hospital systems with nearperfect accuracy (e.g. by detecting a hospital-specific metal token on the scan) with poor generalization to novel hospital systems. The list of such examples goes on (Beery et al., 2018; de Haan et al., 2019; Bissoto et al., 2020) .

To highlight the complexity of this issue, in Figure 1 , we show example of a spurious feature that is common across 3 classes. Surprisingly, this feature is also core (essential) for 3 other classes showing that while deep models excel at pattern recognition, they can struggle in discerning which patterns are core for a class, at times incorrectly making use of spurious patterns recognized elsewhere.

The standard Empirical Risk Minimization (ERM) paradigm for training deep neural networks is brittle when the test distribution is different from the training distribution because of spurious features. Recently, Arjovsky et al. (2020) proposed a framework called Invariant Risk Minimization (IRM) to address this problem. IRM and its variants (Krueger et al., 2021; Xie et al., 2020; Mahajan et al., 2021) posit the existence of a feature embedder such that the optimal classifier on top of these features is the same for every environment from which data can be drawn. However, Rosenfeld et al. (2021) show that IRM can fail catastrophically unless the test distribution is sufficiently similar to the training distribution. In such cases, however, IRM would no longer be required; we would expect ERM to perform just as well.

The above methods use single-label supervision (i.e. an image is labeled only by class index). One can argue that such limited annotations may restrict the model's ability to learn from meaningful features in its predictions since the model is not given the information regarding which features are essential/core and which ones are redundant/spurious. Much of the prior work on discovering spurious features (Nushi et al., 2018; Zhang et al., 2018; Chung et al., 2019; Xiao et al., 2021) require expensive human-guided labeling of visual attributes, which is not scalable for datasets with a large number of classes and images such as Imagenet. However, Singla & Feizi (2022) recently introduced an approach for discovering spurious features at scale using the neurons of robust models as visual attribute detectors. An application of their approach on a subset of Imagenet (Deng et al., 2009) resulted in a dataset called the Salient Imagenet whose samples, in addition to class labels, are annotated by two sets of masks: core masks that highlight core/essential attributes (with respect to the true class) and spurious masks that highlight attributes co-occurring with the object but not a part of it. While valuable for evalation, the size of the Salient ImageNet dataset (232 classes, ∼52k images) limits its utility for training models.

In this work, we significantly expand the size of the Salient Imagenet dataset in two steps (Section 4). First, for each class i ∈ Y − T 1 (the set of remaining 1000 − 232 = 768 classes), we identify the top-5 penultimate layer neurons of a robust model highly predictive of i. We then conduct a Mechanical Turk (MTurk) study for each of these (class, neuron) pairs to determine whether the neuron is core or spurious for the class, resulting in new annotations for 768 × 5 = 3840 pairs, for a total of 5000 core/spurious annotations (4370 core and 630 spurious), when combined with those of Singla & Feizi (2022) . Second, for each (class=i, feature=j) pair, we conduct another MTurk study to validate that the neural activation maps (NAMs) for these neurons highlight the same visual attribute for a large number of images. For images with label i, we select the subset with the top-260 values of feature j. Next, we ask workers to validate whether the NAMs of 15 images (randomly selected from 260) focus on the same visual attribute. We validate that for 95.26% of pairs, NAMs indeed focus on the same visual attribute. The resulting dataset, called Salient Imagenet-1M, contains more than 1 million core/spurious mask annotations.

Using our test set, we study a diverse set of pretrained Imagenet models and training paradigms (42 models total) in Section 6. We find that: (i) transformers are more sensitive to spurious features compared to Convnets, (ii) adversarial training makes Resnets more sensitive to spurious features, (iii) zero-shot CLIP transformers are highly susceptible to spurious features, (iv) models with the same clean accuracy can have vastly different core accuracy (i.e., accuracy when the non-core regions are corrupted using noise).

We next aim to train models that mainly use core features 1 Y denotes the set of all 1000 Imagenet classes and T the set of classes analyzed by Singla & Feizi (2022). in their predictions. To this end, we propose a training paradigm called Core Risk Minimization (CoRM) in Section 7. We note that in some cases, spurious features can be useful. Consider the class "matchstick": the brightness of a flame (spurious) may obscure core features beyond recognition, but the flame itself provides evidence for the presence of a matchstick (see Appendix B). Thus, when core features are absent or not known to be in the image, we want our objective to reduce to standard ERM. Based on this desideratum, we formulate our CoRM objective as follows:

where

Here, c denotes the core mask (i.e., c i,j = 1 iff x i,j is a core pixel), y is the ground truth label for x, f θ (x * ) denotes the logits, is the cross entropy loss and θ is the model parameters. Note that x * i,j = x i,j ∀ i, j : c i,j = 1 which ensures that x * will have the same core features as x. However, the non-core 2 regions are corrupted using Gaussian noise with variance σ 2 . Note that one can easily modify our CoRM formulation to have corruptions using other noise distributions or even adversarial corruptions on non-core regions. We use Gaussian noise because of its simplicity and because it easily allows us to control the degree of contextual information from non-core regions in the image using the parameter σ. When the core masks are unknown, we can set c i,j = 1 (∀ i, j) so that x * = x. This allows us to recover the standard ERM objective. Finally, note that our CoRM objective can also be used to train models when the core masks c are soft (e.g., 0 ≤ c i,j ≤ 1 not ∈ {0, 1}).

In our CoRM objective (1), we need to compute the inner expectation over the Gaussian distribution in a high dimensional space which can be difficult. Thus, to train models using the Salient Imagenet-1M dataset, we evaluate different variations for training using CoRM: (i) randomly adding Gaussian noise to the non-core regions during training, (ii) saliency regularization that penalizes the gradient norm in the non-core (i.e., 1 − c) regions. We show that by combining these two techniques, we achieve significantly higher (+12%) core accuracy, while improving the clean accuracy compared to ERM trained models.

In summary, we make the following contributions:

• We introduce the Salient Imagenet-1M dataset with core and spurious masks for more than a million images in all Imagenet classes.

• We comprehensively study the reliance on spurious features for 42 pretrained Imagenet models and training 2 We can also corrupt the spurious mask s (instead of the noncore mask: 1 − c) in our formulation. However, we choose to use non-core masks because the Salient Imagenet dataset contains significantly larger number of core masks than spurious masks. procedures, discovering interesting trends.

• We introduce Core Risk Minimization (CoRM), a new learning paradigm to train models that mainly rely on core features in their predictions, leading to strong empirical results (+12% core, +0.58% clean accuracy in Table 1 ).

Interpretability: Most of the existing works on post-hoc interpretability techniques focus on inspecting the decisions for a single image (Zeiler & Fergus, 2014; Mahendran & Vedaldi, 2016; Dosovitskiy & Brox, 2016; Yosinski et al., 2016; Nguyen et al., 2016; Adebayo et al., 2018; Zhou et al., 2018; Chang et al., 2019; Olah et al., 2018; Yeh et al., 2019; Carter et al., 2019; O'Shaughnessy et al., 2019; Sturmfels et al., 2020; Verma et al., 2020) . These include saliency maps (Simonyan et al., 2014; Sundararajan et al., 2017; Smilkov et al., 2017; Singla et al., 2019) , class activation maps (Zhou et al., 2016; Selvaraju et al., 2019; Bau et al., 2020; Ismail et al., 2019; 2020) , surrogate models to interpret local decision boundaries such as LIME ( can only analyze the failures of robust models which achieve lower accuracy than standard (non-robust) models. Barlow (Singla et al., 2021) can analyze the failures of any model (standard/robust) but is not useful for highly accurate models. The framework of Singla & Feizi (2022) addresses these limitations and is discussed in Section 4.1.

Domain generalization: In this setting, we aim to learn predictors which generalize to test distributions different from training data. The frameworks studied either make assumptions about covariate/label shifts (Widmer & Kubát, 2004; Bickel et al., 2009; Lipton et al., 2018) , or that test distribution is in some set around training data (Bagnell, 2005; Rahimian & Mehrotra, 2019) , or that training data is sampled from distinct distributions (Blanchard et al., 2011; Muandet et al., 2013; Sagawa* et al., 2020) . Many works provide formal guarantees by assuming invariance in the causal structure of the data (Tian & Pearl, 2001; Didelez et al., 2006; Peters et al., 2016; Heinze-Deml et al., 2018; Heinze-Deml & Meinshausen, 2021; Christiansen et al., 2021) . IRM (Arjovsky et al., 2020) was also designed for this setting but lacked strong theoretical guarantees. Since 3 then, there have been many works on improving the IRM objective (Xie et al., 2020; Chang et al., 2020; Ahuja et al., 2020; Krueger et al., 2021; Mahajan et al., 2021) and comparing ERM and IRM from theoretical (Ahuja et al., 2021; Rosenfeld et al., 2021; 2022) and empirical perspectives (Gulrajani & Lopez-Paz, 2021) .

The activation vector in the penultimate layer (after global average pooling) of a trained neural network is called the neural feature vector. Each element of this vector is called a neural feature. For an image x and neural feature j, we can obtain the Neural Activation Map or NAM (similar to CAM by Zhou et al. (2016) ) that provides a soft mask for the highly activating pixels in x for the feature j. The corresponding heatmap can be obtained by overlaying the NAM on top of x so that the red region highlights the highly activating pixels. The feature attack (Engstrom et al., 2019) is generated by optimizing the image x to increase the value of feature j. These methods are discussed in more detail in Appendix D. For each class i ∈ Y, we define core features (denoted by C(i)) as the set of features that are always a part of the object i, spurious features (denoted by S(i)) are the ones that are likely to co-occur with i, but not a part of it. Example visualizations for core and spurious features (using heatmaps, feature attack) are in Figures 2 and 3 respectively.

The original Salient Imagenet dataset introduced by Singla & Feizi (2022) is limited to 232 classes (denoted by T ) with a total size of 52, 521 images (≈ 226 images per class). Each instance in the dataset is of the form (x, y, M c , M s ) where y is the ground truth label and M c /M s denote the set of core/spurious masks for the image x respectively. By adding noise to the core/spurious regions using these masks and observing the drop in the accuracy, this dataset can be used to test the sensitivity of any pretrained model to different visual features. While useful for evaluating models, the relatively small size of this dataset limits its usefulness for training large models. Thus, our first goal is to significantly expand the size of the Salient Imagenet dataset so that we can use it for training deep models that mainly rely on core features for their inferences.

For each class i ∈ T , using an adversarially trained (robust) model, Singla & Feizi (2022) first identified the 5 neural features that are most predictive of i using the Neural Feature Importance scores (details in Appendix C) resulting in 232 × 5 = 1160 (class, feature) pairs. Next for each class i ∈ T , they annotated each of these 5 features as core or spurious using a Mechanical Turk (MTurk) study.

To obtain the core/spurious annotation for each (class=i, feature=j) pair, they showed the MTurk workers two panels: one describes the class i while the other visualizes the feature j. To describe the class i, they showed the object names (Miller, 1995) , object supercategory (Tsipras et al., 2020) , object definition, wikipedia links and 3 images with label i from the Imagenet validation set. The feature j is visualized using 5 images with predicted class i that maximally activate the feature j, their heatmaps and feature attack visualizations (Examples in Figures 2 and 3 ).

Next, they asked the workers to determine whether the visual attribute (inferred by visualizing j) is a part of the main object (i.e. class i), some separate objects, or the background. They also required the workers to provide reasons for their answers and rate their confidence on a likert scale from 1 to 5. The design for this study is shown in Appendix Figure  11 . Each of these Human Intelligence Tasks (or HITs) were evaluated by 5 workers. The HITs for which majority of the workers (i.e. ≥ 3) voted for either separate object or background were deemed to be spurious and the ones with main object as the majority vote were deemed to be core.

Discovering core/spurious features for 768 classes: We used the same procedure discussed in Section 4.1 to obtain core/spurious annotations for the remaining 768 classes (denoted by Y − T ). Out of the total 768 × 5 = 3840 (class, feature) pairs that we evaluated, 3, 372 are deemed to be core and 468 to be spurious by the workers. On merging the annotations from Singla & Feizi (2022), we obtain such annotations for all 1, 000 classes of Imagenet. In total, we obtain 4, 370 core and 630 spurious class, feature pairs. For 357 classes, we discover at least 1 spurious feature. For 15 classes, all 5 features were found to be spurious (shown in Figure 4 ). We visualize several spurious features in Appendix K.1 (background) and K.2 (foreground).

Discovering large sets of images containing core/spurious features: To further expand the Salient Imagenet dataset, we validate that for the (class=i, feature=j) pair, the visual attribute inferred by visualizing the feature j (using top-5 images with prediction i) is also highlighted by the NAMs for the images with ground truth label i and top-k (k 5) values of feature j. This is expected because in the standard ERM paradigm for training deep models, a model will learn to associate a visual attribute with the class i only if the dataset contains a sufficiently large number of images with ground truth label i containing the same attribute.

To validate that NAMs indeed focus on the desired attributes, we conducted another MTurk study. For the (class=i, feature=j) pair, we first obtain the training images with label i (≈ 1300 images/label in Imagenet) and top-260 activations (20% of 1300) of the feature j. From this set of 260 images, we selected 5 images with the lowest activations of feature j and randomly selected 10 images from the remaining set (excluding the already selected images). We show the workers three panels. The first panel shows images and heatmaps with the highest 5 activations, the second with next 5 highest activations and the third with lowest 5 activations. For each heatmap, workers were asked to determine if the highlighted attribute looked different from at least 3 other heatmaps in the same panel. Next, they were asked to determine if the heatmaps in the 3 different panels focused on the same visual attribute, different attributes or if the visualization in any of the panels was unclear. The design of the study is shown in Appendix Figure 12 . For all (class, feature) pairs (1000 × 5 = 5000 total), we obtained answers from 5 workers each. For 4763 pairs (i.e. 95.26%), majority of workers selected same as the answer to both questions. For core features, we observe significantly higher validation rate of 96.48% (4216/4370) than for spurious features 86.83% (547/630). These results indicate the quality of our annotated masks (specially the core ones) is high.

In Section 4.2, for each class i ∈ Y, we obtain a set of core and spurious features denoted by C(i) and S(i), respectively. We also validated that the NAMs highlight the same visual attribute for a large number of images in 95.26% of all (class=i, feature=j) pairs. These results enable us to significantly expand the size of the Salient Imagenet dataset and use it for training reliable deep models.

For the Imagenet dataset, the test set was constructed by selecting 50 images per class resulting in the test set of 50 × 1000 = 50, 000 images. However, such a test set may not be adequate for testing the sensitivity of a trained model to spurious features because for each class i ∈ Y, we want our test set to include: (i) a large number of images per spurious feature that the class i is vulnerable to, and (ii) masks for core/spurious regions in these images so that by adding noise to these regions, we can test the sensitivity of the model to these features for making its predictions.

Test set. To construct the test set, for each i ∈ Y, j ∈ C(i) ∪ S(i), we first define D(i, j) as the set of images with label i and top-65 activations of j, and their NAMs. The NAMs for these images act as the soft masks that highlight the visual attribute encoded in j. If j ∈ C(i), these are called core masks and if j ∈ S(i) then spurious masks. By taking the union of these sets, i.e., ∪ j∈C(i)∪S(i) D(i, j), we obtain the desired test set for class i. In total, the test set contains 226, 946 images across all 1, 000 classes.

Training set. To construct the training set for class i ∈ Y, we follow the same procedure as above. However, to keep the training and test sets disjoint, we only select images with label i that have not already been included in the test set for class i. This results in 1, 054, 221 training images for 1000 classes. We plot the number of images in the training set for classes with at least 3 spurious features in Figure 4 . We note that the NAM validation procedure discussed in Section 4.2 has been performed for top-260 images per (class,feature) pair and remaining masks in the training set may not have the same level of quality. However, one can easily specify a constant k to select masks for each (class=i, feature=j) pair, only for the images with label i and top-k values of j.

The union of training and test sets is the Salient Imagenet-1M dataset. We use 1M because the validated dataset contains more than 1M mask annotations (see Appendix G).

In this section, we use Salient Imagenet-1M's test set to measure the sensitivity of several pretrained models to core/spurious features by computing the degradation in model perfor-5 mance due to Gaussian noise in core/spurious image regions. The premise here is that if a model does not use the content of a region, then adding noise to the region should have no effect. Contrapositively, if adding noise to a region degrades performance, then the model does make use of the region. Singla & Feizi (2022) conducted a similar analysis, introducing the concepts of core accuracy and spurious accuracy, where core/spurious accuracy is (informally) defined to be the model accuracy on images with added noise in the spurious/core regions, respectively. However, their core and spurious accuracy are evaluated only on images that contain the required mask (i.e. only images with spurious masks were included for measuring the core accuracy). In their dataset (as well as our test set), a sample x contains a mask for feature j only if its activation of j is among the top-65 for images from its class, resulting in significantly different data over which core and spurious accuracy were computed.

This results in the following problems: (i) unequal datasets where core/spurious accuracies are computed (357 classes have at least 1 spurious feature while 985 have at least 1 core feature), resulting in incomparable numbers, (ii) spurious masks tend to generalize worse than core masks (Section 4.2), thus the computed core accuracy may not be reliable (iii) core and spurious masks can overlap since they are computed using NAM as the soft segmentation masks. To address these limitations, we evaluate each metric only on images with at least 1 core mask, and compute the spurious mask as the complement of (i.e. 1−) the mask used for core regions. Furthermore, we employ a new metric, the Relative Core Sensitivity, that combines core and spurious accuracy to quantify model reliance on core features, while controlling for general noise robustness. Lastly, our analysis is significantly larger than that of Singla & Feizi (2022), both in the number of classes and models considered.

Each image in the Salient Imagenet-1M test set may have up to five NAMs for core features. Similar to Singla & Feizi (2022) , for each image, we take the elementwise maximum of the NAMs for its core features to come up with a single consolidated core mask per image (referred to as c). We also observe that in practice, the core masks often do not cover the entirety of the core region ( Figure 5 top left). To ameliorate this, we apply a dilation transform that iteratively replaces each pixel value with the maximum pixel value within a small square kernel ( Figure 5 , second column).

Definition 6.1. (Dilated Core Mask) For mask m, 1 iteration of dilation with square kernel of side 2k + 1 is defined:

The dilated core mask, denoted asc, is obtained by applying 15 iterations of dilation using k = 2 on the core mask c. (2022) to compute spurious and core accuracies, while second column corresponds to masks used in this work. Third column shows noise (σ = 0.25) applied.

In Figure 5 , we visualize the difference in core/spurious mask computation procedures between our work and Singla & Feizi (2022). In the left column, we see that using spurious masks obtained in the same manner as core masks (i.e. by taking the max over NAMs of spurious features) may introduce an incongruity in what is considered core and spurious. Specifically, while the spurious mask in the bottom left focuses on the background, it also covers much of the core region. However, the 1 −c mask (middle column) by design has low overlap with the dilated core mask (c).

Using these dilated core masks for each sample in the Salient Imagenet-1M test set, we can now define our revised versions of core and spurious accuracy as follows: Definition 6.2. (Core and Spurious Accuracy) The Core Accuracy, acc (C) for a model h is defined as follows:

We use σ = 0.25 for all experiments ( Figure 5 last column).

A limitation of using noise to measure model sensitivity to different image regions is that models extremely robust to noise corruptions will have high core and spurious accuracy and thus less gap between the two (regardless of its use of spurious features). Thus, we introduce a new metric: Relative Core Sensitivity that quantifies the model reliance on core features while controlling for general noise robustness (adapted from a similar metric in Moayeri et al. (2022) Definition 6.3. (Relative Core Sensitivity) We define the Relative Core Sensitivity or RCS as follows:

Here,ā acts as a proxy for the general noise robustness of the model under inspection. We can show that for all models withā noise robustness, 2min(ā, 1−ā) is the maximum possible gap between the core/spurious accuracy. Thus, RCS normalizes the gap in the current model by the total possible gap. In Figure 6 , higher RCS corresponds to lying higher above the diagonal (high core and low spurious accuracy). We derive this metric in more detail in Appendix H.

We study a large and diverse set of pretrained Imagenet models and training paradigms (42 models in total), namely ConvNets ( Figure 6 shows the core and spurious accuracy for all model categories evaluated. We observe that transformer models (squares) lie closer to the diagonal than convolutional models (triangles) suggesting they rely more on spurious features. We hypothesize that the lack of a proper inductive bias in transformers may lead to this phenomenon. We show the average RCS values in Figure 7 (high RCS implies high core and low spurious accuracy). We again validate that the transformer models have significantly lower RCS (0.23 compared to 0.41 for convolutional models). In Appendix Table 2 , we observe that zero-shot CLIP ViTs yield the lowest RCS value. We conjecture that the use of text tokens in zero-shot CLIP models may introduce an additional source of spurious vulnerabilities. We also observe that adversarial training in ResNets decreases RCS from 0.44 to 0.38. A similar result was also observed by Moayeri et al. (2022) in a different setup (details in Appendix I). Moreover, our analysis indicates that the standard accuracy is not sufficient to fully characterize model quality; i.e., different models may have different core accuracy even with the same standard accuracy. For example, EfficientNet-B4 and Inception-V4 (Appendix Table 2 ) have almost the same clean accuracy (0.37% gap) but vastly different core accuracy (9.23% gap).

ERM yields classifiers that achieve impressive accuracy but it cannot guide models to learn that certain image regions should inform the class label more than others. Models that use spurious features can give a false of sense of performance, as accuracy can drop dramatically in a new domain where correlations between class labels and spurious features are broken. A model that faithfully learns concepts should rely more on core features than on spurious ones.

We formalize this notion in Core Risk Minimization, defined in optimization (1). CoRM seeks to minimize the expected loss over samples with Gaussian noise added in non-core regions. When all image regions are deemed to be core, CoRM reduces to ERM. However, when informative core masks are available, CoRM requires that the optimal classifier remains accurate in spite of corruption in the spurious regions. Cost of data collection previously inhibited the pursuit of CoRM-like learning. Salient Imagenet-1M's rich core/spurious annotations have the potential to enable training of models that make predictions while avoiding spurious shortcuts. We outline our relaxations to the CoRM objective that lead to significant increases in core accuracy and RCS.

First, we approximate the inner expectation of CoRM with a single sample. That is, for each input (x, c) with label y, we draw a random noise vector z ∼ N (0, σ 2 I), and use

Next, to obtain c for any sample x ∈ i in Salient Imagenet, we use the NAMs for all core features C(i), regardless of the activation of x on the feature. This differs from the test-set computation of c, and introduces noisier core masks, but facilitates a massive increase in training set size. We note that c can be dilated (or eroded) to any degree, introducing a hyperparameter allowing the practitioner to choose the amount of surrounding context the model can use without penalty. While σ alters the magnitude of corruption, dilation alters the region of corruption, introducing a spatial bias in favor of spurious features that are near the core ones. In our experiments, we train on masks c with no dilation applied.

We explore two efficient approaches to perform CoRM and ultimately reduce the reliance on spurious features: (i) random 1 − c noising and (ii) saliency regularization. Following directly from the relaxed formulation of CoRM, 1 − c noising minimizes risk on samples augmented with additive Gaussian noise scaled by 1 − c. In practice, we find that deep networks overfit to noise in the non-core regions, leading to degraded clean accuracy. However, witholding the additive noise for randomly selected batches (e.g., with probability with p = 0.5) during training leads to a better trade-off between clean and core accuracies.

A second approach utilizes gradient information to perform saliency regularization. Such regularization has been shown to improve generalization (Simpson et al., 2019), robustness to distribution shift in non-core regions (Chang et al., 2021) , and model interpretability (Ismail et al., 2021) . Formally, for a sample x with core region c and label y, saliency regularization introduces the following loss term:

where is the classification loss. We compute the saliency penalty after a full forward and backward pass, as the input gradients are then readily available. Model parameters θ are then updated to minimize L sal . Notice that because saliency regularization and spurious noising affect opposite ends of the training pipeline (pre-forward pass vs. post-backward pass), they can be combined easily.

We train Resnet-50 models on Salient Imagenet-1M, employing the two aforementioned methods and their combinations, as well as a baseline model that uses the standard paradigm, ERM. We seek to demonstrate the feasibility of methods toward achieving CoRM's objective, not to obtain highest possible accuracies. Thus, we do not perform data augmentation. We provide details about how data augmentation can be used with Salient Imagenet-1M masks in Appendix J. In addition to clean accuracy, we present core and spurious accuracy, as well as RCS, each computed over the test set of Salient Imagenet-1M, following the evaluation protocol of Section 6. Table 1 summarizes the results. We find that random 1 − c noising increases core accuracy by 7.21% relative to baseline. Saliency regularization has a more modest improvement in core accuracy, but improves clean accuracy. Combining the two methods yields a 11.84% increase in core accuracy, while also marginally improving clean accuracy. Moreover, spurious accuracy decreases significantly, causing a large improvement of 0.13 in RCS. These preliminary results suggest that the goals of achieving high clean accuracy while also maintaining high core accuracy and having a low spurious accuracy can be made feasible through Salient Imagenet-1M. We hope that our introduced dataset and training methods will lead to the development of deep models that mainly rely on core features for their inferences. 8 Table 2 . Complete results for evaluation of several pretrained Imagenet models. We present core accuracy, spurious accuracy, and RCS, as described in section 6. All metrics are computed over the test set of Salient-Imagenet-1M.

In Section 6, we present a framework for evaluating the reliance on spurious features of any model pretrained on ImageNet, and present results for a breadth of models. We now share greater detail on the models studied and the the results obtained in Table 2 . For consistency, we obtain nearly all pretrained weights from the timm framework (Wightman, 2019), with the exception of self-supervised and CLIP model weights, which were obtained directly from the original sources, and adversarially trained networks, obtained from (Salman et al., 2020).

We study a large set of pretrained ImageNet models (42 total), spanning various architectures and training paradigms: Sandler et al., 2018) . We refer to this group as ConvNets.

Vision Transformer-based models: ViT (Dosovitskiy et al., 2020) , DeiT (Touvron et al., 2021) , ConViT (d'Ascoli et al., 2021) , and Swin Transformers (Liu et al., 2021) . We refer to this group as ViTs.

Robust ResNets: Adversarially trained Resnets with 2 projected gradient descent (Salman et al., 2020) . We refer to this group as Robust ResNets. We follow the zero-shot evaluation procedure of comparing the dot product of an image encoding to the average encoded vector of eight template text captions per class.

Model size has small and inconsistent effects on RCS, which allows for comparing averages across categories with varying model sizes. The primary trend of transformers having lower RCS than ConvNets is validated in the smaller cohort of CLIP models, where the average RCS decreases from 0.25 for CLIP ResNets to 0.13 for CLIP ViTs. However, for self-supervised models MoCo-v3 and DINO, transformers and ResNets yield similar sensitivities. Interestingly, for ResNet backbones, self-supervised training decreases RCS (-0.15), while the reverse is true for transformers (+0.10).

Surprisingly, we observe that adversarial training in ResNets decreases RCS from 0.44 to 0.38, despite their objective of making models more reliable through increasing adversarial robustness. Further, the attack budget used during training, with higher attack budget leading to more robustness, validates our observed trend, as the gap in RCS between Robust Resnets and Resnets is larger for robust models with = 3 than for those with = 1.

Lastly, we find that increasing patch size in some ViTs leads to large drops in RCS. Specifically, changing patchsize from 16 to 32 reduces RCS from 0.14 to 0.12 for CLIP ViTs, and from 0.29 to 0.19 for (Dosovitskiy et al., 2020)'s original ViT.

We stress that detailed experiments are necessary to rigorously make further claims regarding our observations. However, we are intrigued by the low RCS in transformers. Further, the conflicting effect of self-supervision across architectures suggests there may be more factors at play. Vision transformers are emerging rapidly, and most transformer-based models (including DeiT, ConViT, and Swin) follow the training procedure of Touvron et al. (2021), where data-efficiency was achieved via heavy augmentation. Augmentation is also used in self-supervised learning, to create multiple views of the same image. Seeing as many modern augmentations potentially corrupt core regions, we ponder if the increased gain in test accuracy may come at the cost of raising sensitivity to spurious regions. Further experimentation in this direction may be of interest. 

In this section, we motivate the study of spurious correlations through an error analysis. Importantly, this analysis makes no use of additive noise to gauge sensitivity, offering a distinct and complementary perspective to our other findings. Specifically, we use the annotations of neural nodes in the Robust Resnet-50 used to generate NAMs, directly inspecting feature activations. We first outline the intuition and key takeaways, before delving into detailed discussion of methods and results.

B.1. The Matchstick Example: Spurious features can help or hurt, depending on class

In Figure 8 , we display two samples where the spurious feature value of the predicted class is much higher, but core feature values are significantly lower, when compared to the average for correctly classified images of the class. Thus, the prediction is made due to the activations of the class' spurious features. For the matchstick, recognizing the spurious feature of the flame compensated for the fact that most of the matchstick lies outside of the image region, leading to a correct classification. However, the fireboat is misclassified as an airship, due to the spurious feature of what seems like a cloudy sky.

The analysis of this section inspects core and spurious feature values for correct and incorrect classifications on a classwise basis. We find that spurious features hurt (i.e. activate higher for misclassified samples, even when core features correctly activate lower) much more often than they help (i.e. activate higher for true instances of the class, even when core features incorrectly activate lower). Specifically, for classes where the difference in spurious and core feature values between misclasssified and correctly classified samples have opposite sign (101 classes), the spurious features hurt 85% of the time.

This offers quantitative evidence to how spurious features lead to misclassification in a model trained with single-label supervision. However, we also demonstrate the how the role of spurious correlations varies, even from class to class. In summary, spurious features can be harmful, but true understanding of their roles requires careful analysis. We hope Salient Imagenet-1M opens the door to inquiries of this kind.

For an image x from class i with core features C(i) and spuriuos features S(i), denote the representation vector (i.e. activations of neurons in penultimate layer of Robust Resnet-50) for x as r(x). For a set X , we measure the average activation of core features as follows: 16

Definition B.1. (Core Feature Value) We define core feature value over a set of inputs X , for both a class-feature pair (i, j) and a single class i (denoted CFV i,j and CFV i respectively) as follows:

We can analogously define spurious feature value for a class i and set X by replacing C(i) with S(i) in B.1.

We now define groups CC i , MC i , corresponding to correctly and incorrectly classified samples to class i (i.e. by prediction). That is, with h denoting the Robust Resnet-50,

Using the above notation, we obtain the metrics presented in the figure 8. Specifically, Relative CFV Difference refers to:

This metric lies between −1 and 1. When the relative CFV difference of a class i is positive, that entails that misclassified samples activate the core features for the class more than correctly classified samples. Relative SFV Difference is defined analogously by replacing CFV with SFV.

For classes with at least one spurious and one core feature (324 total), we evaluate relative CFV difference and relative SFV difference, using images in the test set of Salient Imagenet-1M (though the analysis does not require core and spurious masks). We display the kernel destiny estimate of the distribution of (relative SFV difference, relative CFV difference) pairs in figure 8 .

For 84 classes, relative SFV difference is positive while the relative CFV difference is negative (red quandrant). This suggests that the incorrect prediction of misclassified samples is due to high activation of spurious features, and not core features. Conversely, for 17 classes, the reverse is true (green quadrant in fig. 8 ), suggesting that core features erroneously activate higher for misclassifications. However, the spurious feature activations are higher for instances of the class, correcting the mistakes of the core features. Thus, spurious features can both help and hurt classifiers, and their role varies significantly based on class. We highlight this result, as it reflects the complicated nature of spurious features, and their roles in deep models. We believe the complete removal of spurious features may have unintended consequences, and suggest careful analysis, with respect to class, when addressing spurious features.

The neural feature vector (i.e the vector of penultimate layer neurons) can have a large size (2048 for the Robust Resnet-50 used in this work and the prior work of Singla & Feizi (2022)) and visualizing all of these features for a particular class (say class i) to determine whether they are core or spurious for i can be difficult. Thus, we select a small subset of these features that are highly predictive of class i and annotate them as core or spurious for i using the same method as used in the prior work of Singla & Feizi (2022).

We first select a subset of images (from the training set) on which the robust model predicts the class i. We compute the mean of neural feature vectors across all images in this subset denoted by r(i). From the weight matrix w of the last linear layer of the robust model, we extract the i th row w i,: that maps the neural feature vector to the logit for the class i. Next, we compute the hadamard product r(i) w i,: . Intuitively, the j th element of this vector (r(i) w i,: ) j captures the mean contribution of neural feature j for predicting the class i. This procedure leads to the following definition: Definition C.1. The Neural Feature Importance of feature j for class i is defined as: IV i,j = (r(i) w i,: ) j . For class i, the neural feature with k th highest IV is said to have the feature rank k.

We then select the neural features with the highest-5 importance values (defined above) per class.

Heatmap is generated by first converting the neural activation map (which is grayscale) to an RGB image (using the jet colormap). This is followed by overlaying the jet colormap on top of the original image using the following lines of code:

import cv2 d e f c o m p u t e h e a t m a p ( img , fam ) : hm = cv2 . a p p l y C o l o r M a p ( np . u i n t 8 ( 2 5 5 * nam ) , cv2 . COLORMAP JET ) hm = np . f l o a t 3 2 ( hm ) / 255 hm = hm + img hm = hm / np . max ( hm ) r e t u r n hm Figure 9 . Figure describing the Neural Activation Map generation procedure. To obtain the neural activation map for feature j, we select the feature map from the output of the tensor of the previous layer (i.e the layer before the global average pooling operation). Next, we simply normalize the feature map between 0 and 1 and resize the feature map to match the image size, giving the neural activation map. This figure is from Singla & Feizi (2022) and included here for completeness. Figure 10 . Figure illustrating the feature attack procedure. We select the feature we are interested in and simply optimize the image to maximize its value to generate the visualization. ρ is a hyperparameter used to control the amount of change allowed in the image. For optimization, we use gradient ascent with step size = 40, number of iterations = 25 and ρ = 500. This figure is from Singla & Feizi (2022) and included here for completeness.

The design for the Mechanical Turk study is shown in Figure 11 . The left panel visualizing the neuron is shown in Figure  11a . The right panel describing the object class is shown in Figure 11b . The questionnaire is shown in Figure 11c . We ask the workers to determine whether they think the visual attribute (given on the left) is a part of the main object (given on the right), some separate object or the background of the main object. We also ask the workers to provide reasons for their answers and rate their confidence on a likert scale from 1 to 5. The visualizations for which majority of workers selected either background or separate object as the answer were deemed to be spurious. Workers were paid $0.1 per HIT, with an average salary of $8 per hour. In total, we had 137 unique workers, each completing 140.15 tasks (on average).

The design for the Mechanical Turk study is shown in Figure 12 . The three panels showing heatmaps for different images from a class are shown in Figure 12a . The questionnaire is shown in Figure 12b . For each heatmap, workers were asked to determine if the highlighted attribute looked different from at least 3 other heatmaps in the same panel. We also ask the workers to determine whether they think the focus of the heatmap is on the same object (in the three panels), different objects or whether they think the visualization in any of the panels is unclear. Same as in the previous study (Section E), we ask the workers to provide reasons for their answers and rate their confidence on a likert scale from 1 to 5. The visualizations for which at least 4 workers selected same as the answer and for which at least 4 workers did not select "different" as the answer for all 15 heatmaps were deemed to be validated i.e for this subset of 260 images, we assume that the neural activation maps focus on the same visual attribute. Using the Mechanical Turk study in Section 4.2, for 4216 (out of 4370) core pairs i.e. (class=i, feature=j) pairs where j is core for i, we validate that NAMs for the top-260 images indeed focus on the desired visual attribute. This directly results in 4216 × 260 = 1, 096, 160 validated core masks.

Similarly, by taking the union of images in the sets D(i, j), again where j is core for i and (class=i, feature=j) is among the 4216 validated pairs, we obtain 565, 950 unique images.

In this work, we use a novel metric to facilitate comparisons of core and spurious accuracies across a diverse set of models. Specifically, relative core sensitivity (RCS) is designed to address the potential lurking variable of general noise robustness.

For example, a model that is generally very robust to noise will see small degradation due to noise anywhere. Thus, the absolute difference between core and spurious accuracy will be small, regardless of the relative model sensitivity to either region. To normalize against this limitation, we scale the absolute gap by the total possible gap (made precise below).

We note that the metric Relative Core Sensitivity (RCS) adapted here is same as the metric called Relative Foreground Sensitivity (RFS) in Moayeri et al. (2022) . We include two detailed derivations (one inspired from the original geometric derivation in Moayeri et al. (2022) , and a new algebraic derivation) here for brevity.

Consider a model with core and spurious accuracies acc (C) , acc (S) respectively. We defineā = (acc (C) +acc (S) ) 2 , and useā as a proxy for general noise robustness. The gap between acc (C) and acc (S) reflects sensitivity to noise in the non-core regions relative to core regions.

We seek to compute the maximum gap between acc (C) and acc (S) for a fixed general noise robustness ofā. The arguments maximizing the linear objective acc (C) − acc (S) will occur at the boundary of the feasible region. There are two non-trivial cases to compare (the gap is obviously not maximized if acc (C) ≤ acc (S) so we ignore these cases):

• The maximum gap occurs when acc (C) = 1. Thus, acc (S) = 2ā − 1, yielding a gap of acc (C) − acc (S) = 2(1 −ā).

• The maximum gap occurs when acc (S) = 0. Thus, acc (C) = 2ā, yielding a gap of acc (C) − acc (S) = 2ā.

However, notice that the feasability of the above cases is contingent onā. Specifically, acc (C) can only be 1 ifā ≥ 0.5, and acc (S) can only be 0 ifā ≤ 0.5. Thus, as a piecewise function with respect toā, the maximum gap is 2ā forā ≤ 0.5 and 2(1 −ā) forā ≥ 0.5. Now, observe that this piecewise definition can be consolidated as 2 min(ā, 1 −ā).

Hence, defining RCS to be the ratio of the absolute gap between core and spurious accuracy, and the total possible gap for any model with general noise robustness ofā, yields the original formula RCS = acc (C) −acc (S) 2 min(ā,1−ā) .

RCS can also be viewed geometrically as the ratio of the distance of point (acc (S) , acc (C) ) above the diagonal over the maximum distance from the diagonal for models with fixed general noise robustnessā.

First, observe the distance of (acc (S) , acc (C) ) from the diagonal is given by the distance between the point and (ā,ā), yielding:

Distance to Diagonal = √ 2(ā − acc (S) ) = √ 2(acc (C) − acc (S) ) 2

assuming that acc (C) > acc (S) (though otherwise the sign would simply be flipped).

The maximum distance from the diagonal is constrained by the fact that 0 ≤ acc (C) , acc (S) ≤ 1. Becauseā is fixed, we necessarily lie on the line acc (C) = 2ā − acc (S) . Notice that whenā ≤ 0.5, we intersect the boundary on the y-axis, at point (0, 2ā). Whenā ≥ 0.5, we intersect the boundary defined by y = 1, at the point (1 − 2ā, 1). The corresponding maximum distances are then:

Max Distance to Diagonal = √ 2ā forā ≤ 0.5, and √ 2(1 −ā) forā ≥ 0.5

As in the algebraic derivation, the piecewise formula can be resolved as √ 2 min(ā, 1 −ā). Therefore, in the final ratio of distance of (acc (S) , acc (C) ) to diagonal over maximum distance to diagonal for fixedā, the √ 2 terms cancel, yielding the formula for RCS. For a pictographic geometric derivation, we refer readers to Moayeri et al. (2022) .

Certain aspects of our analysis are similar to previous work. Namely, the RCS metric is adapted from the relative foreground sensitivity metric of Moayeri et al. (2022), and both Moayeri et al. (2022) and Singla & Feizi (2022) conduct noise-based analyses to discern model sensitivity to image regions. Further, some of our observations on pretrained models were also noted in Moayeri et al. (2022) (i.e. lower sensitivity to core/foreground regions in transformers and adversarial trained Resnets, relative to Resnets). We acknowledge the inspiration taken from these efforts, and highlight two key distinguishing aspects of our work that we believe significantly add to the prior findings.

The first is scale, in both data and models: our evaluation includes 226k images from 985 classes, compared to 5k from 20 classes (organized into ten subsets called RIVAL10) in Moayeri et al. (2022) The second is that evaluating RCS on the test set of Salient Imagenet-1M can be done without making any changes to pretrained models, where as Moayeri et al. (2022) require models to be finetuned on the ten class subset of images they consider. While finetuning is a standard procedure, it changes the weights of neural features used to perform classification, which may introduce biases. Moreover, certain core features may be discarded when the classification task is simplified to the much coarser labels of RIVAL10. The evaluation in Singla & Feizi (2022) does not attempt to control for varying noise robustness, as models are not compared to one another directly.

Thus, we believe the findings of our experiments, due to the scale and lack of modifying pretrained models, may be empirically stronger than those of Moayeri et al. (2022) . Nonetheless, we find it encouraging that we corroborate the findings of Moayeri et al. (2022) on a separate, larger dataset.

J. Using data augmentation with Salient Imagenet-1M

We follow the fast training procedure for the baseline in Wong et al. (2020) . Cyclic learning rates are applied from 0.1 to 0.004 for an SGD optimizer over 15 epochs. The only augmentation is to resize and center crop images to 224 × 224. We discuss this choice below.

Random cropping, a common augmentation technique, was not directly possible for the Salient ImageNet-1M version used in this work, as the masks were obtained for images after undergoing the standard ImageNet test transformations of resizing and taking a square center crop. However, there are multiple approaches for incorporating augmentation going forward.

First, one can generate the core masks by generating NAMs on the fly using the same robust model (used in this work for generating the NAMs for Salient Imagenet-1M) during training. That is, after performing any augmentation on the original image, one can compute NAMs for the relevant features on the augmented image, and use these directly as done in this work.

Second, one can precompute NAMs for all original images. This would require computing NAMs using training images that have been resized such that the shorter side is 224. Then during training, the same random cropping transformation can be applied to the image and the mask to obtain the masks for the relevant core/non-core regions.

Thus, while augmentation was not used in this work, it is certainly feasible for Salient Imagenet-1M and will be explored in future works. 24 For Resnet-50, accuracy drop: -3.077% (initial: 98.462%). For Efficientnet-B7, accuracy drop: -1.539% (initial: 98.462%).

For CLIP VIT-B32, accuracy drop: -24.615% (initial: 87.692%). For VIT-B32, accuracy drop: -9.231% (initial: 96.923%).

Sanity checks for saliency maps

Examples of spurious features For each (class=i, feature=j) pair where j is known to be spurious for the class i, we analyze the sensitivity of various standard (non-robust) trained models: Resnet-50, Efficientnet-B7, CLIP VIT-B32, VIT-B32 to different spurious features. We first compute the clean accuracy for the set D(i, j) for each model (called initial in the the figure captions below). Next, for each image and spurious mask in D(i, j), i.e. x, s ∈ D(i, j)

Next, we compute the accuracy for each model using these noisy images and the drop in model accuracy (called accuracy class loggerhead (class index: 33)

For Resnet-50, accuracy drop: -73.846% (initial: 96.923%). For Efficientnet-B7

846% (initial: 84.615%). For VIT-B32