key: cord-0582274-at9tzg1i
authors: Dhurandhar, Amit; Ramamurthy, Karthikeyan; Ahuja, Kartik; Arya, Vijay
title: Locally Invariant Explanations: Towards Stable and Unidirectional Explanations through Local Invariant Learning
date: 2022-01-28
journal: nan
DOI: nan
sha: 916ca83adfac4d00875213a6f09b054090c0f980
doc_id: 582274
cord_uid: at9tzg1i

Locally interpretable model agnostic explanations (LIME) method is one of the most popular methods used to explain black-box models at a per example level. Although many variants have been proposed, few provide a simple way to produce high fidelity explanations that are also stable and intuitive. In this work, we provide a novel perspective by proposing a model agnostic local explanation method inspired by the invariant risk minimization (IRM) principle -- originally proposed for (global) out-of-distribution generalization -- to provide such high fidelity explanations that are also stable and unidirectional across nearby examples. Our method is based on a game theoretic formulation where we theoretically show that our approach has a strong tendency to eliminate features where the gradient of the black-box function abruptly changes sign in the locality of the example we want to explain, while in other cases it is more careful and will choose a more conservative (feature) attribution, a behavior which can be highly desirable for recourse. Empirically, we show on tabular, image and text data that the quality of our explanations with neighborhoods formed using random perturbations are much better than LIME and in some cases even comparable to other methods that use realistic neighbors sampled from the data manifold. This is desirable given that learning a manifold to either create realistic neighbors or to project explanations is typically expensive or may even be impossible. Moreover, our algorithm is simple and efficient to train, and can ascertain stable input features for local decisions of a black-box without access to side information such as a (partial) causal graph as has been seen in some recent works.

Deployment and usage of neural black-box models has significantly grown in industry over the last few years creating the need for new tools to help users understand and trust models [21] . Even well-studied application domains such as image recognition require some form of prediction understanding in order for the user to incorporate the model into any important decisions [41, 29] . An example of this could be a doctor who is given a cancer diagnosis based on an image scan. Since the doctor holds responsibility for the final diagnosis, the model must provide sufficient reason for its prediction. Even new text categorization tasks [18] are becoming important with the growing need for social media companies to provide better monitoring of public content. Twitter recently began monitoring tweets related to COVID-19 in order to label tweets containing misleading information, disputed claims, or unverified claims [37] . Laws will likely emerge requiring explanations for why red flags were or were not raised in many examples. In fact, the General Data Protection and Regulation (GDPR) [46] act passed in Europe already requires automated systems that make decisions affecting humans to be able to explain them. Given this acute need, a number of methods have been proposed to explain local decisions (i.e. example specific decisions) of classifiers [36, 30, 41, 29, 13 ]. Locally interpretable model-agnostic explanations (LIME) is arguably the most well-known local explanation method that requires only query (or black-box) access to the model. Although LIME is a popular method, it is known to be sensitive to certain design choices such as i) (random) sampling to create the (perturbation) neighborhood 1 , ii) the size of this neighborhood (number of samples) and iii) (local) fitting procedure to learn the explanation model [32, 49] . The first, most serious issue could lead to nearby examples having drastically different explanations making effective recourse a challenge. One possible mitigation is to increase the neighborhood size, however, one cannot arbitrarily do so as it not only leads to higher computational cost, but in today's cloud computing-driven world it could have direct monetary implications where every query to a black-box model has an associated cost [14] . There have been variants suggested to overcome Coefficient inconsistency for LINEX Figure 1 : Above we visualize for the IRIS dataset the Coefficient Inconsistency (CI) (see Section 5 for exact definition and setup details) between the explanation (top two features) for an example and its nearest neighbor in the dataset. Each circle denotes an example and a rainbow colormap depicts the degree of inconsistency w.r.t. its nearest neighbor where red implies least inconsistency, while violet implies the most. As can be seen LINEX explanations are much more consistent than LIME's. some of these limitations [8, 40, 34] primarily through mechanisms that create realistic neighborhoods or through adversarial training [28] , however, their efficacy is restricted to certain settings and modalities based on their assumptions and training strategies.

In this paper we introduce a new method called Locally INvariant EXplanations (LINEX) inspired by the invariant risk minimization (IRM) principle [7] , that produces feature based explanations that are robust to neighborhood sampling and can recover faithful (i.e. mimic black-box behavior), stable (i.e. similar for closeby examples) and unidirectional (i.e. same sign attributions for closeby examples, see section 4.1) explanations across tabular, image, and text modalities. In particular, we show that our method performs better than the competitors for random as well as realistic neighborhood generation, where in some cases even with the prior strategy our explanation quality is close to methods that employ the latter. Qualitatively, our method highlights (local) features as important that in the particular locality i) have consistently high gradient with respect to (w.r.t.) the black-box function and ii) where the gradient does not change significantly, especially in sign. Such stable behavior for LINEX is visualized in Figure 1 , where we get similar explanations for nearby examples in the IRIS dataset. The (in)fidelity of LINEX is still similar to LIME (see Table 3 ), but of course our explanations are much more stable.

Posthoc explanations can typically be partitioned into two broad categories global and local. Global explainability avers to trying to understand a black-box model at a holistic level where the typical tact is knowledge transfer [25, 16, 15] where (soft/hard) labels of the black-box model are used to train an interpretable model such as a decision tree or rule list [38] . Local explanations on the other hand avers to understanding individual decisions. These explanations are typically in two forms, either exemplar based or feature based. For exemplar based as the name suggests similar but diverse examples [27, 22] are provided as explanations for the input in question. While for feature based [36, 30, 13, 29, 50] , which is the focus of this work, important features are returned as being important for the decision made for the input. There are some methods that do both [34] . Moreover, there are methods which provide explanations that are local, global as well as at a group level [35] . All of these methods though may not still provide stable and robust local feature based explanations which can be desirable in practice [20] .

Given this there have been more recent works that try to learn either robust or even causal explanations. In [28] the authors try to learn robust and stable local explanations relative to distribution shifts and adversarial attacks. However, the distribution shifts they consider are linear shifts and adversarial training is performed which can be slow and sometimes unstable [48] . Moreover, the method seems to be applicable primarily to tabular data. Works on causal explanations [19, 24] mainly modify SHAP and assume access to a partial causal graph. Some others [43] assume white-box access. In this work we do not assume availability of such additional information. There are also works which show that creating realistic neighborhoods by learning the data manifold for LIME [8, 40] can lead to better quality explanations, where in a particular work [6] it is suggested that projecting explanations themselves on to the manifold can also make them more robust. The need for stability in a exemplar neighborhood for LIME like methods has been highlighted in [49] , with the general desire for stable explanations being also expressed in [47, 44] .

Invariant Risk Minimization: Given a collection of training datasets D = {D e } e∈Etr gathered from a set of environments E tr , where D e = {x i e , y i e } ne i=1 is the dataset gathered from environment e ∈ E tr and n e is the number of points in environment e. The feature value for data point i is x i e ∈ X and the corresponding label

in environment e is drawn i.i.d from a distribution P e . Define a predictor f : X → R. The goal of IRM is to use these collection of datasets D to construct a predictor f that performs well across many unseen environments E all , where E all ⊇ E tr . Define the risk achieved by f in environment e as R e (f ) = E e (f (X e ), Y e ) , where is the square loss when f (X e ) is the predicted value and Y e is the corresponding label, (X e , Y e ) ∼ P e and the expectation E e is defined w.r.t. the distribution of points in environment e.

An invariant predictor is composed of two parts a representation Φ ∈ R d×n and a predictor (with the constant term) w ∈ R d×1 . We say that a data representation Φ elicits an invariant predictor w T Φ across the set of environments E tr if there is a predictor w that achieves the minimum risk for all the environments w ∈ arg minw ∈R d×1 R e (w T Φ), ∀e ∈ E tr . IRM may be phrased as the following constrained optimization problem [7] :

If w T Φ solves the above, then it is an invariant predictor across the training environments E tr . Local explainability setup vs IRM setup: There are multiple differences between IRM setup and the setup we have in this paper. First, IRM is meant to learn global models directly from the dataset. This also applies to the game theoretic version proposed in [4] , while for local explainability we are trying to learn local models that explain (per example) a given black-box model. Second, given that we want to explain a black-box model which typically is a function of the input features, we want to highlight features in our explanation that may be spurious from the domain perspective, but nonetheless the black-box model uses them to make decisions. Third, the IRM model does not have to be interpretable as its learned representation Φ can be arbitrary. Fourth, the IRM learning procedure is quite inefficient as it tries to solve a bi-level optimization problem with a highly non-convex constraint. Fifth, IRM assumes that one has access to multiple environments (viz. multiple data sources), but in our case we have to figure out how to appropriately create them for each example that we want to explain. Last, since a black-box model is trained on the input features all or a subset of them are causal to its prediction, unlike the standard IRM setup where a predictor is expected to model the true underlying mechanism. Hence in our case, it is sufficient to limit ourselves to some interpretable representation, which in many cases is just the input feature space (i.e. Φ = X ).

Thus IRM cannot be directly applied to our problem and we propose a novel game theoretic approach that is suitable for local explainability consistent with the above points. With our approach it is interesting that we are able to obtain similar style theoretical results for our setting that have been seen in out-of-distribution generalization type of works [3] , thus justifying our choice of leveraging such frameworks.

Nash Equilibrium (NE): To understand how certain key aspects of our method function let us revisit the notion of Nash Equilibrium [17] . A standard normal form game is written as a tuple Ω = (N , {u i } i∈N , {S i } i∈N ), where N is a finite set of players. Player i ∈ N takes actions from a strategy set S i . The utility of player i is u i : S → R, where we write the joint set of actions of all the players as S = Π i∈N S i . The joint strategy of all the players is given as s ∈ S, the strategy of player i is s i and the strategy of the rest of players is

NE thus identifies a state where each player is using the best possible strategy in response to the rest of the players leaving no incentive for any player to alter their strategy. In seminal work by [11] it was shown that for a special class of games called concave games such a pure NE always exists. This is relevant because the game implied by Algorithm 1 falls in this category.

We first define certain desirable properties we would like our explanation method to have. The first three have been seen in previous works, while the last Unidirectionality is something new we propose. We then describe our method where the goal is to explain a black-box model f : X → R for individual inputs x based on predictors w by looking at their corresponding components, also termed as feature attributions.

We now discuss certain properties we would like our explainability method to have in order to provide robust explanations that could potentially be used for recourse. Fidelity: This is the most standard property which all proxy model based explanation methods are evaluated against [36, 30, 28] as it measures how well the proxy model simulates the behavior of the black-box it is attempting to explain. Stability: This is also a popular notion [23, 35, 47] to evaluate robustness of explanations. Largely, stability can be measured at three levels. One is prediction stability, which measures how much the predictions of an explanation model change for the same example subject to different randomizations within the method or across close by examples. The second is the variance in the feature attributions again for the same or close by examples. It is good for a method to showcase stability w.r.t. both even though in many cases the latter might imply the former. An interesting third notion of stability is the correlation between the feature attributions of an explanation model and average feature values of examples belonging to a particular class. This measures how much does the explanation method pick features that are important for the class, rather than spurious ones that seem important for the example of interest. Black-box Invariance: This is the same as implementation invariance defined in [42] . Essentially, if two models have exactly the same behavior on all inputs then their explanations should also be the same. Since, our method is model agnostic with only query access to the model it is easy to see that it satisfies this property if the same environments are created. Unidirectionality: This is a new property, but as we argue that this is a natural one to have. Loosely speaking, unidirectionality would measure how consistently the sign of the predictor for a feature is maintained for the same or close by examples by an explanation method. This is a natural metric [31] , which from an algorithmic recourse [26] perspective is also highly desirable. For instance, recommending a person to increase their salary to get a loan and then recommending to another person with a very similar profile to decrease their salary for the same outcome makes little sense.

Input: example to explain x, black-box predictor f (.), number of environments to be created k, (l ∞ ) threshold γ > 0, (l 1 ) threshold t > 0 and convergence threshold > 0 Initialize: ∀i ∈ {1, ..., k}w i = 0 and ∆ = 0 Let ξ 1 (.), ..., ξ k (.) be k environment creation functions as described in section 4.2.2

We define the unidirectionality Υ as a measure of how consistent the sign of the attribution for a particular feature in a local explanation is when varying neighborhoods for the same example or when considering different close by examples. In particular, given m attributions for each of d features denoted by w

the unidirectionality metric for an example is:

where |.| stands for absolute value. Clearly, the more consistent the signs for the attribution of a particular feature across m realizations/explanations the higher the value, where the maximum value can be one. If equal number of attributions have different signs for all features then Υ will be zero, the lowest possible value. This property thus measures how intuitively consistent (ignoring magnitude) the explanations are and compliments the other properties mentioned above. Given its sole focus on the sign of the attributions it also compliments attributional robustness metrics [9, 39] .

In Algorithm 1, we show the steps of our method LINEX. The input to the method is the example we want to explain x, the black-box predictor, a few thresholds that we describe next and k (local) environments whose creation is described in Section 4.2.2.

In the algorithm we iteratively learn a constrained least squares predictor for each environment, where the final (local) linear predictor is the sum of these individual predictors. In each iteration when computing the contribution of environment e i to the final summed predictor, the most recent contributions of the other predictors are summed and the residual is optimized subject to the constraints. The first constraint is a standard lasso type constraint which tries to keep the final predictor sparse as is also seen in LIME. Why l ∞ constraint? The second constraint is more unique and is a l ∞ constraint on the predictor of just the current environment. This constraint as we prove in Section 4.3 is essential for obtaining robust predictors. To intuitively understand why this is the case consider we have two environments. In this case if the optimal predictors for a feature in each environment have opposite signs, then the Nash equilibrium (NE) is when each predictor takes +γ or −γ values as they try to force the sum to have the same sign as them. In other words, features that have a disagreement in even the direction of their impact are eliminated by our method. LIME type methods on the other hand would simply choose some form of average value of the predictors which may be a risky choice especially for actionability/recourse given that the directions change so abruptly. On the other hand, if the optimal predictors for a feature in the two environments have the same sign, the lower absolute valued predictor would be chosen (assuming γ is greater) making it a careful choice. The reasoning for this and a discussion involving more than two environments is given in Section 4.3.

The overall algorithm resembles a (simultaneous) game where each environment is a player trying to find the best predictor for its environment given all the other predictors and constraints. Also note that the optimization problem solved by each player is convex as norms are convex.

In the standard IRM framework environments are assumed to be given, however, in our case of local explainability we have to decide how to produce them. We offer a few options for the environment creation functions ξ i ∀i{1, ..., k} in Algorithm 1. Random Perturbation: This possibly is the simplest approach and similar to what LIME employs. We could perturb the input example by adding zero mean gaussian noise to create the base environment (used by LIME) and then perform bootstrap sampling to create the k different environments. This will efficiently create neighbors in each environment, although they may be unrealistic in the sense that they could correspond to low probability points w.r.t. the underlying distribution. Realistic Generation/Selection: One could also create neighbors using data generators such as done in MeLIME [8] or select neighboring examples from the training set as done in MAPLE [34] to create the base environment following which bootstrap sampling could be done to form the k different environments. This approach may provide more realistic neighbors than the previous one, but may be much more computationally expensive.

Other than bootstrapping one could also over sample and try to find the optimal hard/soft partition through various clustering type objectives [2, 10].

In this section, we analyze the output of Algorithm 1 when there are two environments. The extension to multiple environments is discussed following this result, where the general intuition is still maintained but some special cases arise depending on whether there are an even or odd number of environments. To prove our main result we make two assumptions. This assumption is satisfied by the most standard way of creating neighborhoods/environments, where random gaussian noise is used to create them as described in Section 4.2.2. Assumption 2 t ≥ γd, where d is the dimensionality of the feature vector.

Making this assumption ensures that we closely analyze the role of the ∞ penalty, which is one of the main novelties in our method. Definition 2 Let the explanation that each environment ξ i arrives at for an example x based on unconstrained least squares minimization be w * i where,

The expectation is taken w.r.t the environment generation distribution.

Theorem 1. The output of Algorithm 1 under Assumptions 1, 2 and equation 3 is given by:

where is element wise product and 1 is the indicator function. Proof Sketch. The above expression describes the NE of the game played between the two local environments each trying to move w towards their least squares optimal solution. Given assumptions 1 and 2, we witness the following behavior of our method. Let the i th feature of the predictorsw 1 andw 2 from Algorithm 1 bẽ w 1i andw 2i respectively. Let the corresponding least squares optimal predictors for the i th feature have the following relation: w * 1i > w * 2i and |w * 1i | > |w * 2i |. Then the two environments will push the ensemble predictor, w 1i +w 2i , in opposite directions during their turns, with the first environment increasing its weight,w 1i , and the second environment decreasing its weight,w 2i . Eventually, the environment with a higher absolute value (ξ 1 = 1 since |w * 1i | > |w * 2i |) reaches the boundary (w 1i = γ) and cannot move any further due to the l ∞ constraint. The other environment ξ 2 best responds, where it either hits the other end of the boundary (w 2i = −γ), in which case the weight of the ensemble for component i is zero, a case which occurs if w * 1i and w * 2i have opposite signs; or gets close to the other boundary while staying in the interior (w 2i = w * 2i − γ), in which case the weight of the ensemble for feature i is w * 2i , a situation which occurs if w * 1i and w * 2i have the same sign. Implications of the Theorem 1: The following are the main takeaways from Theorem 1: (1) If the signs of the explanations for unconstrained least squares for the two environments differ for some feature, then the algorithm outputs a zero as the attribution for that feature. (2) If the signs of the explanations for the two environments are the same, then the algorithm outputs the lesser magnitude of the two. These two properties are highly desirable from an algorithmic recourse or actionability perspective, where the first biases us to not rely on features where the black-box function changes direction rapidly (unidirectionality). The second, provides a reserved estimate so that we do not incorrectly over rely on the particular feature (stability). Behavior for More than Two Environments: Given Assumptions 1 and 2 we now discuss the behavior of our method for more than two environments. If the number of environments is odd, then using similar logic to that discussed in the proof sketch one can see that the feature attribution would be equal to the median of the feature attributions across all the environments. Essentially, all environments with optimal least squares attributions above the median would be at +γ, while those below it would be at −γ. The one at the median would remain so with no incentive for any environment to alter its attribution making it a NE. This is a stable choice that is also likely to be faithful as we have no more information to decide otherwise. On the other hand if we have an even number of environments the final attribution in this case depends on the middle two environments in the same manner as the two environment case proved in Theorem 1. Thus, if the optimal least squares attributions of the middle two environments have opposite sign, then the final attribution is zero, else its the lower of the two attributions in terms of the numerical value. This happens because the NE for the other environments is ±γ depending on if their optimal least squares attributions are above/below those of the middle two environments. This again is a stable and likely to be faithful choice, where also unidirectionality is preferred.

We test our method on four real world datasets covering all three modalities: IRIS (Tabular) [12] , Medical Expenditure Panel Survey (Tabular) [1] , Fashion MNIST (Image) [45] , and Rotten Tomatoes reviews (Text) [33] with LIME-like random (rand) and MeLIME-like realistic neighborhood generation (real) or MAPLE-like realistic neighborhood selection (mpl). The summary of black-box classifier accuracies, and type of realistic perturbation used for the datasets are provided in Table 1 . In all cases except FMNIST which comes with its own test partition we randomly split the datasets into 80/20% train/test partition and average results for the local explanations over this test partition. We chose FMNIST over other higher resolution image datasets as MeLIME performs best w.r.t. it, making it an acid test for our method. For LINEX we produce two environments where the two environments are formed by performing bootstrap sampling on the base environment which is created either by rand, real or mpl type neighborhood generation described above. Thus in all cases the union of the environments is the same as a single neighborhood used to produce explanations We observe that LINEX explanations capture important artifacts and thus exhibit significantly higher correlation with the original images for the same level of sparsity, where in aggregate too the correlations are high w.r.t. images belonging to a particular class, thus showcasing higher stability (i.e. high CAC) as is seen in Table 3 . More examples are shown in Appendix E for the competitors making it a fair comparison. Behavior of LINEX with more environments is in Appendix C.

Given the neighborhood generation schemes we compare LINEX with LIME, Smoothed LIME (S-LIME), MeLIME and MAPLE, where for S-LIME we average the explanations of LIME across the LINEX environments. SHAP's results are in Appendix F, since it is not a natural fit here. Metrics: We evaluate using five simple metrics: Infidelity (INFD), Generalized Infidelity (GI), Coefficient Inconsistency (CI), Class Attribution Consistency (CAC) and Unidirectionality (Υ). The first two evaluate faithfulness, the next two stability and the last goodness for recourse.

Let D t denote a (test) dataset with examples (x, y) where y b (x) is the black-box models prediction on x and y x e (x) is the prediction on x (∈ X ) using the explanation model at x . The feature attributions (or coefficients) for the explanation model at x are denoted by c x e , and N x denotes the exemplar neighborhood of x with |.| card denoting cardinality. We now define the above metrics. Infidelity (INFD): This is the most commonly used metric to validate the faithfulness of explanation models [36] . Here we define it as the MAE between the black-box and explanation model predictions across all the test points:

Generalized Infidelity (GI): This metric has also been used in previous works [35] to measure the generalizability of local explanations to neighboring test points. It is defined as: GI =

This notion has been used before [23] to measure an explanation methods robustness. It can be defined as the MAE between the attributions of the test points and their respective neighbors: CI =

For local explanations of classification black-boxes, we expect certain important features to be highlighted across most of the explanations of a class. This is codified by this metric which is defined as follows: CAC = 1 |Y| card y∈Y r(µ y e , µ y ), where Y denotes the set of class labels in the dataset, µ y the mean (vector) of all inputs in class y ∈ Y, µ y e the mean explanation for class y and r the Pearson's correlation coefficient. This metric quantifies the consistency between the important features for a class and attributions provided by the explanations. Coefficient Unidirectionality (Υ): This is motivated and defined in equation 2 in section 4.1.

We report the above metrics in Table 3 . Each result in Table 3 is mean ± standard error of the mean over five kernel sizes τ √ d generally, where τ = {0.05, 0.1, 0.25, 0.5, 0.75}. Test neighborhoods do not make Table 3 . The results were generated on Linux machines with 56 cores and 242 GB RAM. More details regarding the exact perturbation schemes for LIME/MeLIME/MAPLE, the perturbation neighborhood sizes and the time taken by the different methods are in Appendices A and B.

Observations: We see that in terms of CAC, LINEX is better than baselines in all cases which indicates that on average the local explanations with LINEX highlight the important features characterzing the entire class making them more stable. This is further verified by looking at the Υ and CI metrics where LINEX is similar or better than others. For GI and INFD metrics, the results are more evenly spread which implies that LINEX's main advantage is obtaining stable and unidirectional explanations that are faithful to a similar degree.

We show an example MeLIME and LINEX/realistic explanation on FMNIST in Figure 2 , where we see that LINEX explanations are more coherent and also delineate the outline of the image more clearly compared to MeLIME explanations leading to better CAC. Even on the text data we see more reasonable attributions in Table 2 , where "masterpiece", "moving" and "audacious" are highlighted as the most important words indicative of positive sentiment in the three examples.

An interesting observation is that when it comes to the stability metrics (CI and CAC) and unidirectionality LINEX with even random perturbation model is better than MeLIME in some cases. This is very promising as it means LINEX could be potentially be trusted without the need to generate realistic perturbations which may be computationally expensive or not even possible.

We also performed qualitative error analysis on the most challenging case FMNIST which is described in Appendix G. We see that even where LINEX has high infidelity it invariably still focuses on salient features ignoring superfluous features that may not be critical for correct identification, but focusing on which may result in lower infidelity for the specific example. The goodness of these features identified by LINEX can be further verified by looking at other metrics such as GI, CAC, CI and Υ in Table 3 where it is either comparable or better than MeLIME.

In this paper we have provided a method based on a game theoretic formulation and inspired by the invariant risk minimization principle to provide faithful, stable and unidirectional explanations. We have defined the latter property and argued that it is somewhat of a necessity (may not be sufficient) for recourse. We have theoretically shown that our method has a strong tendency to be stable and unidirectional as we will mostly eliminate features where the black-box models gradient changes abruptly and in other cases choose a conservative value. Empirically, we have verified this where we outperform competitors in majority of the cases on these metrics. An interesting observation is also that in some cases our method provides more stable and unidirectional explanations with just a random perturbation model relative to more expensive methods that use realistic neighbors.

In the future, it would be worth experimenting with more varied strategies to form environments and if possible find the optimal ones [10] , which may lead to picking even more relevant features that are "causal" to the local decision.

It is important to note that the query complexity (i.e. number of times we query the black box to obtain an explanation) of LINEX is the same as that of LIME since the union of the environments is the same as a LIME perturbation neighborhood. This is important in todays cloud-driven world where models may exist on different cloud platforms and posthoc explanations are an independent service where each call to the model has an associated cost. In terms of running time for two environments, convergence was fast and running time was approximately 2.5 times that of LIME (LINEX took 2.5 seconds on IRIS for 30 examples as opposed to 1 second by LIME, LINEX took 47 seconds on MEPS for 500 examples as opposed to 18 seconds by LIME), which is very similar to Smoothed LIME (S-LIME) (took 2.3 seconds on IRIS and 40 seconds on MEPS) that we still outperform in majority of the cases.

Realistic neighborhood generation can be time consuming especially for MeLIME since generators have to be trained which may take up to an hour using a single GPU for datasets such as FMNIST. After the generator is trained and neighborhood sampled MeLIME takes the same amount of time as LIME since the model fitting procedure is the same. MAPLE took 1. 

We describe the datasets and the hyperparameters used for each. We set perturbation neighborhood sizes 10 (IRIS), 500 (MEPS), 100 (FMNIST-random), 500 (FMNIST-realistic), 100 (Rotten tomatoes) for generating local explanations. We also use 3, 10, 10, 5 as examplar neighborhood sizes to compute GI, CI and Υ metrics for the four datasets respectively. We also use 5−sparse explanations for all cases except FMNIST with realistic perturbations where we follow MeLIME and generate a dense explanation using ridge penalty with penalty multiplier value of 0.001. The ∞ bound γ in Algorithm 1 is set as the maximum absolute value of linear coefficient computed by running LIME/MeLIME in the two individual environments. Please look at IRIS dataset first since it contains some of the common details used across others.

This dataset has 150 instances with four numerical features representing the sepal and petal width and length in centimeters. The task is to classify instances of Iris flowers into three species: setosa, versicolor, and virginica. A random forest classifier was trained with a train/test split of 0.8/0.2 and yielded a test accuracy of 93%. We provide local explanations for the prediction probabilities for class setosa. For both random and realistic perturbations, we use a perturbation neighborhood size of n. For random perturbations, we used the same approach followed by LIME and sample from a Gaussian around each data point. Realistic perturbations (with the same number n) were generated using KDEGen [8] , a kernel density estimator (KDE) with the Gaussian kernel fitted on the training dataset to sample data around a sample point. For both random and realistic perturbations, we weight the neighborhood using a Gaussian kernel of width τ Rotten Tomatoes (Text): This dataset contains 10662 movie reviews from rotten tomatoes website along with their sentiment polarity, i.e., positive or negative reviews and the task is to classify the sentiment of the reviews into positive or negative. The review sentences were vectorized using CountVectorizer and TfidfTransformer and a sklearn Naive Bayes classifier was fitted on training dataset which yielded a test accuracy of 75%. Explanations are generated for the prediction probabilities corresponding to the predicted class for each example. Realistic perturbations were generated using Word2VecGen [8] , wherein word2vec embeddings are first trained using the training corpus and new sentences are generated by randomly replacing a sentence word whose distance in the embedding space lies within the radius of the neighbourhood. For both random and realistic perturbations, n was chosen from {25, 50, 75, 100}. The kernel sizes were {0.42, 1.06, 2.12, 3.18} for random perturbations (kernel size 0.21 resulted in numerical issues), and {0.21, 0.42, 1.06, 2.12, 3.18} for realistic perturbations. We use k = {2, 3, 4, 5} for this dataset.

We present results with all hyperparameter combinations for random and realistic perturbations. Results for LIME with random perturbations (LIME), smoothed LIME (S-LIME), LINEX with random perturbations (LINEX/rand), MeLIME (MeLIME), LINEX with MeLIME-like realistic neighborhoods (LINEX/real), MAPLE (MAPLE), LINEX with MAPLE-like realistic neighborhoods (LINEX/mpl) are presented in figures 4-18. The legend for these figures are given in Figure 3 .

For the four datasets, we perform ablations by varying one of perturbation neighborhood size ( Figures  4-8) , number of environments (Figures 9-13) , and kernel width (Figures 14-18) . Each point in these figures are averaged over all possible values for the two parameters that are not ablated. For example, each point in Figure 4 is averaged over all possible values for kernel widths and number of environments for a given perturbation neighborhood size. Standard errors of the mean are also plotted in the same color with lesser opacity. Lower values of Infidelity (INFD), Generalized Infidelity (GI), Coefficient Inconsistency (CI) are better whereas for Unidirectionality (Υ) and Class Attribution Consistency (CAC) higher values are better. Figures 4-8 show ablations with respect to perturbation neighborhood sizes. Considering all datasets, the stability/recourse metrics (CI, Υ, CAC) are clearly better for LINEX compared to its counterparts. For LINEX methods (LINEX/rand, LINEX/real, LINEX/mpl), the metrics get better or stays approximately the same generally as perturbation neighborhood size increases keeping with the intuition that larger perturbation neighborhood sizes should produce explanations that are more stable in the exemplar neighborhood. Υ for FMNIST is somewhat of an exception which could be because of the high dimensionality as well as the quality of MeLIME perturbations.

Turning to the fidelity metrics (INFD and GI) in tabular datasets, we see that the results still favor LINEX, but less heavily compared to the stability/recourse metrics. This is in line with what we observe in Table 3 . In IRIS and MEPS, LINEX is close to or outperforms the corresponding baselines in the GI measure (except for LINEX/mpl with MEPS). This gap closes a bit with INFD, but we note that GI is a better measure since it estimates how faithful explanations are in a exemplar neighborhood. With the text dataset, LINEX variants are slightly more favored, whereas with the image dataset, the baselines have an edge.

Considering Figures 9-13 , we see that variations are less stark with respect to number of environments overall for LINEX variants. Note that except for S-LIME, other baselines do not use multiple environments, and hence stay constant. The slight variations in MAPLE are due to the effect of random seeds. In the stability/recourse metrics, again LINEX variants emerge as the clear winner across datssets. With the faithfulness metrics (GI and INFD), in the text dataset, LINEX variants generally perform better, whereas the baselines have a better performance in the image dataset.

Finally, we study the variation of the performance measures with respect to kernel width in Figures  14-18 . We see that the stability/recourse metrics flatten out in all cases with large kernel widths. This behaviour holds true for faithfulness metrics (GI and INFD) as well except in some cases. GI and INFD measures also increase before they flatten out since the fit becomes poorer at larger kernel widths. The stability/recourse metrics become better or remain approximately the same since explanations generally improve or preserve their stability properties as kernel widths increase. Note that very small kernel widths can lead to unexpected behavior that does not fit the trend as seen with the tabular datasets since explanations can become hyper-local. MAPLE and LINEX/mpl stay the same at different kernel widths since they use a different weighting scheme. As with other ablations, we see that LINEX variants are similar or better in stability/recourse metrics overall, while with the faithfulness metrics the results are more mixed.

Note We show feature attributions for individual example images with MeLIME and LINEX with MeLIME perturbations in Figure 19 . In Figure 20 we show class-wise mean feature attributions along with mean images. Clearly, LINEX explanations provide more meaningful feature attributions.

In 

We perform error analysis for LINEX to gain better understanding about the method. We choose FMNIST dataset for doing this since: (a) This is the highest dimensional dataset (784 dimensions) that we have.

(b) LINEX/real under performs MeLIME in terms of the INFD measure here (see Table 3 ) more heavily compared to other datasets. (c) It is easier to visualize the explanations for this dataset. We start by observing that even though LINEX/real underperforms in the INFD metric, the gap is not so great in the GI metric, which suggests that MeLIME may be overfitting explanations here. We also note that in terms of CI, Υ, and CAC metrics, LINEX/real clearly outperforms MeLIME.

We now choose a sample of images from the dataset where LINEX/real has highest instance-level infidelity numbers and display them in Figure 21 . Just looking at the explanations and the corresponding original images visually, it is evident that LINEX/real highlights the prominent features like sleeves and collar in a shirt, handles of the bags, outlines of the boots/shoes, even though the infidelity values are high. However, MeLIME misses out on some of these prominent features and focuses only on optimizing the local fit. The fact that LINEX zeroes in on important features also provides additional evidence for the closeness of GI metrics between the two methods, and the better performance of LINEX/real with CI, Υ, and CAC metrics.

This conclusion is also verified when we look at the performance of LINEX at a class level. In Figure  22 , we see two classes one where the infidelity of LINEX is low (i.e. Trousers class) and the other where its infidelity is high (i.e Shirt class). As can be seen since the Trousers class has examples with less superfluous features (viz. varied designs) focusing on which might reduce infidelity but are not critical for determination of the class, LINEX does better in terms of infidelity on the prior. However, although infidelity is higher for the latter Shirt class it does much better on other metrics such as GI, CAC, CI and Υ indicating that LINEX truly focuses on robust features. We show the Pearson's correlation coefficient between feature attributions and mean of the original images from the respective classes (r) and instance-level infidelity (INFD) measures. LINEX seems to highlight important features like stripes in the t-shirt, handles of the bags, outlines of the boots/shoes more prominently, while MeLIME seems to overfit to the data while missing out on highlighting some key features prominently. Figure 22 : We see above that infidelity is lower for Trousers class for LINEX as compared with the Shirts class. A reason for this is that the trousers are more plain with less superfluous features such as the different designs in shirts. Since LINEX focuses on robust features focusing excessively on the designs is not critical for it to determine a shirt, albeit focusing on these designs might reduce infidelity. Advantage of it relying on robust features is however apparent when we look at other metrics such GI, CAC, CI and Υ as seen in Table  3 where it is much closer to or superior to MeLIME.

25

Data Clustering: Algorithms and Applications

Linear regression games: Convergence guarantees to approximate out-of-distribution solutions

Invariant risk minimization game

To trust or not to trust an explanation: using leaf to evaluate local linear xai methods

Fairwashing explanations with off-manifold detergent

Invariant risk minimization

Melime: Meaningful local explanation for machine learning models

Environment inference for invariant learning

A social equilibrium existence theorem

UCI machine learning repository

Explanations based on the missing: Towards contrastive explanations with pertinent negatives

Model agnostic contrastive explanations for structured data

Enhancing simple models by exploiting what they already know

Improving simple models with confidence profiles

Strategies and games: theory and practice

Pathologies of neural models make interpretations difficult

Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability

Interpretation of neural networks is fragile

Explainable artificial intelligence (xai)

Efficient data representation by selecting prototypes with importance weights

Robustness in machine learning explanations: does it matter?

Causal shapley values: Exploiting causal knowledge to explain individual predictions of complex models

Distilling the knowledge in a neural network

Algorithmic recourse: from counterfactual explanations to interventions

Examples are not enough, learn to criticize! Criticism for interpretability

Robust and stable black box explanations. ICML

The lrp toolbox for artificial neural networks

A unified approach to interpreting model predictions

Explanation in artificial intelligence: Insights from the social sciences

Interpretable machine learning

Thumbs up? sentiment classification using machine learning techniques

Model agnostic supervised local explanations

Model agnostic multilevel explanations

Why should I trust you?": Explaining the predictions of any classifier

Updating our approach to misleading information

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Enhanced regularizers for attributional robustness

Constraint-driven explanations of black-box {ml} models

Deep inside convolutional networks: Visualising image classification models and saliency maps

Axiomatic attribution for deep networks

Causal mediation analysis for interpreting neural nlp: The case of gender bias

Statistical stability indices for lime: obtaining reliable explanations for machine learning models

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms

Analysis: Article 29 working party guidelines on automated decision making under gdpr

On the (in) fidelity and sensitivity of explanations

The limitations of adversarial training and the blind-spot attack

Why should you trust my explanation? ICML-AI for Social Good

Baylime: Bayesian local interpretable model-agnostic explanations