key: cord-0674709-azzyl0m4
authors: Veiga, S'ebastien da
title: Kernel-based ANOVA decomposition and Shapley effects -- Application to global sensitivity analysis
date: 2021-01-14
journal: nan
DOI: nan
sha: b3ee8eabbc15c6ca2ba7177db5e0604059c5f3cb
doc_id: 674709
cord_uid: azzyl0m4

Global sensitivity analysis is the main quantitative technique for identifying the most influential input variables in a numerical simulation model. In particular when the inputs are independent, Sobol' sensitivity indices attribute a portion of the output of interest variance to each input and all possible interactions in the model, thanks to a functional ANOVA decomposition. On the other hand, moment-independent sensitivity indices focus on the impact of input variables on the whole output distribution instead of the variance only, thus providing complementary insight on the inputs / output relationship. Unfortunately they do not enjoy the nice decomposition property of Sobol' indices and are consequently harder to analyze. In this paper, we introduce two moment-independent indices based on kernel-embeddings of probability distributions and show that the RKHS framework used for their definition makes it possible to exhibit a kernel-based ANOVA decomposition. This is the first time such a desirable property is proved for sensitivity indices apart from Sobol' ones. When the inputs are dependent, we also use these new sensitivity indices as building blocks to design kernel-embedding Shapley effects which generalize the traditional variance-based ones used in sensitivity analysis. Several estimation procedures are discussed and illustrated on test cases with various output types such as categorical variables and probability distributions. All these examples show their potential for enhancing traditional sensitivity analysis with a kernel point of view.

In the computer experiments community, global sensitivity analysis (GSA) has now emerged as a central tool for exploring the inputs/outputs relationship of a numerical simulation model. Starting from the pioneering work of Sobol (Sobol', 1993) and Saltelli (Saltelli et al., 1999) on the interpretation and estimation of Sobol' indices, the last two decades have been a fertile ground for the development of advanced statistical methodologies and extensions of original Sobol' indices: new estimation procedures (Da Veiga et al. (2009) , Da Veiga and , Solís (2019) , Gamboa et al. (2020) ), multivariate outputs with aggregation and dimensionality reduction (Lamboni et al., 2011) , goal-oriented sensitivity analysis (Fort et al., 2016) or moment-independent sensitivity measures (Borgonovo (2007) , Da Veiga (2015) ), among others. At the heart of the popularity of Sobol' indices is the fundamental functional analysis of variance (ANOVA) decomposition, which opens the path for their interpretation as parts of the output variance and makes it possible to pull apart the input main effects and all their potential interactions, up to their whole influence measured by total Sobol' indices. This decomposition however has two drawbacks. First, it is only valid when the inputs are independent, although some generalizations were investigated (Chastaing et al., 2012) . Secondly, it only concerns the original Sobol' indices, meaning that it is not possible to split the input effects with goal-oriented or moment-independent sensitivity analysis in general.

When the inputs are dependent, total Sobol' indices can still be used to discriminate them when the objective is to build a surrogate model of the system, and other Sobol'-related indices have also been proposed for interpretability (Mara et al., 2015) . But the major breakthrough happened when Shapley effects have been defined for GSA by Owen (Owen, 2014) . Indeed due to their mathematical foundations from game theory, Shapley effects do not require the independence assumption to enjoy nice properties: each input is assigned a Shapley effect lying between 0 and 1, while the sum of all effects is equal to 1. For a given input all interactions and correlations with other ones are mixed up, but the interpretation as parts of the output variance is kept and input rankings are still sensible. For these reasons Shapley effects are now commonly thought as central importance measures in GSA for dealing with dependence, and their estimation has been thoroughly investigated recently (Song et al., 2016; Iooss and Prieur, 2019; Broto et al., 2020; Plischke et al., 2020) .

From an interpretability perspective, other importance measures introduced in the context of goal-oriented and moment-independent sensitivity analysis have proven useful to gain additional insights on a given model. For example quantile-oriented (Fort et al., 2016; Maume-Deschamps and Niang, 2018) or reliability-based measures (Ditlevsen and Madsen, 1996) can help understand which inputs lead to the failure of the system, while optimization-related indices enable dimension reduction for optimization problems (Spagnol et al., 2019) . On the other hand, moment-independent sensitivity indices, which quantify the input impact on the whole output distribution instead of the variance only, are powerful complementary tools to grasp further types of input influence. Among them are the f-divergence indices (Da Veiga (2015) , Rahman (2016) ) with particular cases corresponding to the sensitivity index introduced by Borgonovo (Borgonovo, 2007) and the class of kernel-based sensitivity indices, which rely on the embedding of probability distributions in reproducing kernel Hilbert spaces (RKHS) (Da Veiga, 2015 . Unfortunately an ANOVA-like decomposition is not available yet for any of these indices even in the independent setting: as a consequence this limits the interpretation of their formulation for interactions since without ANOVA it is not possible to remove the main effects, and at the same time the natural normalization constant (equivalent to the total output variance for Sobol' indices) is not known.

In this paper we focus on a general RKHS framework for GSA and prove that an ANOVA decomposition actually exists for two previously introduced kernel-based sensitivity indices in the independent setting. To the best of our knowledge this is the first time such a decomposition is available for other sensitivity indices other than the original Sobol' ones. Not only this makes it possible to properly define higher-order indices, but this further gives access to their natural normalization constant. We also demonstrate that these measures are generalizations of Sobol' indices, in the sense that they are recovered with specific kernels. But the RKHS point of view additionally comes with a large body of work on several kernels specifically designed for particular target applications, such as multivariate, functional, categorical or time-series case studies, thus defining a unified framework for many real GSA test cases. When inputs are not independent, we finally introduce a kernel-based version of Shapley effects similar to the ones proposed by Owen. The paper is organized as follows. Section 2 first briefly introduces the traditional functional ANOVA decomposition with Sobol' indices and moment-independent indices. In Section 3 we then discuss the elementary tools from RKHS theory needed to build kernel-based sensitivity indices which are at the core of this work. We further investigate these indices and prove they also arise from an ANOVA decomposition. In addition we define Shapley effects with kernels and the benefits of the RKHS framework for GSA are studied through several examples. Several estimation procedures are then discussed in Section 4, where we generalize some of the recent estimators for Sobol' indices. Finally, Section 5 illustrates the potential of these sensitivity indices with various numerical experiments corresponding to typical GSA applications.

Let η : X 1 × . . . × X d → Y denote the numerical simulation model, which is a function of d input variables X l ∈ X l , l = 1, . . . , d, and Y ∈ Y the model output given by Y = η(X 1 , . . . , X d ).

In standard GSA the inputs X l are further assumed to be independent with known probability distributions P X l , meaning that the vector of inputs X = (X 1 , . . . , X d ) is distributed as P X = P X 1 ⊗ . . . ⊗ P X d . For any subset A = {l 1 , . . . , l |A| } ∈ P d of indices taken from {1, . . . , d} we denote X A = X l 1 , . . . , X l |A| ∈ X A = X l 1 × . . . × X l |A| the vector of inputs with indices in A and X −A the complementary vector with indices not in A. In this setting, the main objective of global sensitivity analysis is to quantify the impact of any group of input variables X A on the model output Y .

In this section we first recall the functional ANOVA decomposition and the definition of Sobol' indices, which fall into the category of variance-based indices. Sensitivity indices that account for the whole output distribution, referred to as moment-independent indices, are then discussed. Note that in the following, we adopt the notation S for a properly normalized sensitivity index, while S will stand for an unnormalized index, where normalization is to be understood as an end result from an ANOVA-like decomposition.

Here we first assume that Y ∈ Y ⊂ R is a square integrable scalar output. If the inputs are independent, the function η can then be decomposed according to the ANOVA decomposition:

Theorem 1 (ANOVA decomposition (Hoeffding, 1948; Antoniadis, 1984) ). Assume that η :

with η A depending only on the variables X A and satisfying

Furthermore, (b) implies that all the terms η A in the decomposition are mutually orthogonal. As a consequence, the output variance can be decomposed as

When this decomposition holds, it is then straightforward to quantify the influence of any subset of inputs X A on the output variance by normalizing each term with Var Y .

Definition 1 (Sobol' indices (Sobol', 1993) ). Under the same assumptions of Theorem 1, the Sobol' sensitivity index associated to a subset A of input variables is defined as

while the total Sobol' index associated to A is

In particular, the first-order Sobol' index of an input X l writes

and its total Sobol' index is given by

Finally, the ANOVA decomposition (1) readily provides an interpretation of Sobol' indices as a percentage of explained output variance, i.e.

With these definitions, the impact of each input variable can be quantitatively assessed: the first-order Sobol' index measures the main effect of an input, while the total Sobol' index aggregates all its potential interactions with other inputs. As an illustration, an input variable with low total Sobol' index is thus unimportant and one can freeze it at a default value. When for a given input both first-order and total Sobol' indices are close, this means that this input does not have interactions, while a large gap indicates strong interactions in the model. Furthermore, due to the summation property (5), the interpretation of Sobol' indices as shares of the output variance is an efficient tool for practitioners who aim at understanding precisely the impact and interactions of the inputs of a model on the output. For example the interaction of two inputs X l and X l writes

Note that to compute this interaction one subtracts the first-order indices S l and S l from the sensitivity index of the subset (X l , X l ) in order to remove the main effects and highlight the interaction only.

Despite their appealing properties, Sobol' indices rank the input variables according to their impact on the output variance only. In a parallel line of work, several authors proposed to investigate instead how inputs influence the whole output distribution, thus introducing a different insight on the inputs/outputs relationship. The starting point (Baucells and Borgonovo (2013) , Da Veiga (2015)) is to consider that a given input X l is important in the model if the probability distribution P Y of the output changes when X l is fixed, i.e. if the conditional probability distribution P Y |X l is different from P Y . More precisely, if d(·, ·) denotes a dissimilarity measure between probability distributions, one can define a sensitivity index for variable X l given by

Such a general formulation is flexible, in the sense that many choices for d(·, ·) are available. As an illustration, it is straightforward to show that the unnormalized first-order Sobol' index is retrieved with the naive dissimilarity measure d(P, Q) = (E ξ∼P (ξ) − E ξ∼Q (ξ)) 2 , which compares probability distributions only through their means. A large class of dissimilarity measures is also given by the so-called f-divergence family: assuming that (X l , Y ) has an absolute continuous distribution with respect to the Lebesgue measure on R 2 , the f-divergence between P Y and P Y |X l is

where f is a convex function such that f (1) = 0 and p Y and p Y |X l are the probability distribution functions of Y and Y |X l , respectively. The corresponding sensitivity index is then

with p X l and p X l ,Y the probability distribution functions of X l and (X l , Y ), respectively. This index has been studied for example in Da Veiga (2015) and Rahman (2016) . A notable special case is obtained with the total-variation distance corresponding to f (t) = |t − 1|, leading to the sensitivity index proposed by Borgonovo (Borgonovo, 2007) :

Obviously, definition (7) can be easily extended to measure the influence of any subset of inputs

. But in this case, since there is no ANOVA-like decomposition, there is no longer the guarantee that an interaction index defined following (6):

really measures the pure interaction between X l and X l . Therefore the interpretation of higherorder moment-independent sensitivity indices is cumbersome. On the other hand, even if normalization constants have been proposed through general inequalities on f-divergences (Borgonovo (2007) , Rahman (2016) ), the lack of an ANOVA decomposition once again impedes the definition of a natural normalization constant equivalent to the output variance for Sobol' indices.

Recently, new moment-independent indices built upon the framework of RKHS embedding of probability distributions have also been investigated (Da Veiga, 2015 . Though originally introduced as an alternative to reduce the curse of dimensionality and make the most of the vast kernel literature, we will see in what follows that they actually exhibit an ANOVA-like decomposition and can therefore be seen as a general kernelized version of Sobol' indices.

Before introducing the kernel-based sensitivity indices, we first review some elements of the RKHS embedding of probability distributions , which will serve as a building block for their definition.

We first introduce a RKHS H of functions X → R with kernel k X and dot product ·, · H . The kernel mean embedding µ P ∈ H of a probability distribution P on X is defined as Smola et al. (2007) . The representation µ P is appealing because, if the kernel k X is characteristic, the map P → µ P is injective (Sriperumbudur et al., 2009 (Sriperumbudur et al., , 2010 . Consequently, the kernel mean embedding can be used in lieu of the probability distribution for several comparisons and manipulations of probability measures but using only inner products or distances in the RKHS. For example, a distance between two probability measures P 1 and P 2 on X can simply be obtained by computing the distance between their representations in H, i.e. MMD(P 1 , P 2 ) = µ P 1 − µ P 2 H , which is a distance if the kernel k X is characteristic (Sriperumbudur et al., 2009 (Sriperumbudur et al., , 2010 . This distance is called the maximum mean discrepancy (MMD) and it has been recently used in many applications (Muandet et al., 2012; Szabó et al., 2016) . Indeed, using the reproducing property of a RKHS one may show (Song, 2008) that

where ξ, ξ ∼ P 1 and ζ, ζ ∼ P 2 with ξ, ξ , ζ, ζ independent, this notation being used throughout the rest of the paper. This means that the MMD can be computed with expectations of kernels only, unlike other distances between probability distributions which will typically require density estimation.

Another significant application of kernel embeddings concerns the problem of measuring the dependence between random variables. Given a pair of random vectors (U, V) ∈ X × Y with probability distribution P UV , we define the product RKHS H

. A measure of the dependence between U and V can then be defined as the distance between the mean embedding of P UV and P U ⊗ P V , the joint distribution with independent marginals P U and P V :

This measure is the so-called Hilbert-Schmidt independence criterion (HSIC, see Gretton et al. (2005a,b) ) and can be expanded as

where (U , V ) is an independent copy of (U, V). Once again, the reproducing property implies that HSIC can be expressed as expectations of kernels, which facilitates its estimation when compared to other dependence measures such as the mutual information.

The RKHS framework introduced above can readily be used to define kernel-based sensitivity indices. The first approach relies on the MMD, while the second one builds upon HSIC. We discuss them below and show that in particular both of them admit an ANOVA-like decomposition.

The first natural idea is to come back to the general formulation for moment-independent indices (7) and use the MMD as the dissimilarity measure to compare P Y and P Y |X l as proposed in Da Veiga (2016):

where we have defined a RKHS G of functions Y → R with kernel k Y . More generally, we can also consider the unnormalized MMD-based sensitivity index for a group of variables X A given by

, provided the following assumption holds:

First note that if we focus on the scalar output case Y ⊂ R with the linear kernel k Y (y, y ) = yy , we have

that is, we recover the unnormalized Sobol' index for X A . S MMD A can thus be seen as a kernelized version of Sobol' indices since the latter can be retrieved with a specific kernel. However it is obvious that since the linear kernel is not characteristic, the MMD in this case is not a distance, which means that S MMD A is no longer a moment-independent index. To make another connection with Sobol' indices, we now recall Mercer's theorem, a notable representation theorem for kernels.

Theorem 2 (Mercer, see Aubin (2000) ). Suppose k Y is a continuous symmetric positive definite kernel on a compact set Y and consider the integral operator T k Y :

Then there is an orthonormal basis {e r } of L 2 (Y) consisting of eigenfunctions of T k Y such that the corresponding sequence of eigenvalues {λ r } are non-negative. The eigenfunctions corresponding to non-zero eigenvalues are continuous on Y and k Y has the following representation

where the convergence is absolute and uniform.

Assume now that the output Y ∈ Y with Y a compact set, meaning that Mercer's theorem holds. Then k Y admits a representation

where φ r (y) = √ λ r e r (y) are orthogonal functions in L 2 (Y). In this setting we can write

Now, since the convergence of the series is absolute, we can interchange the expectations and the summations to get

In other words, the MMD-based sensitivity index S MMD A generalizes the Sobol' one in the sense that it measures the impact of the inputs not only on the conditional expectation of the output, but on a possibly infinite number of transformations φ r of the output, given by the eigenfunctions of the kernel.

We can now state the main theorem of this section on the ANOVA-like decomposition for S MMD A . Recall that the variance decomposition (1) states that the variance of the output can be decomposed as Var Y = A⊆P d V A where each term is given by

The MMD-based equivalent is obtained with the following theorem.

Theorem 3 (ANOVA decomposition for MMD). Under the same assumptions of Theorem 1 (in particular, the random vector X has independent components) and with Assumption 1, denote

where Y is an independent copy of Y . Then the total MMD can be decomposed as

where each term is given by

The proof is given in Appendix A.1. Theorem 3 is very similar to the ANOVA one given in (1): one can note that the total variance of the output is replaced by a generalized variance MMD 2 tot defined by the kernel, and that each subset effect is obtained by removing lesser order ones in the MMD distance of the conditional distributions (instead of the variance of the conditional expectations in the ANOVA). The following corollary states that these two decompositions coincide when the kernel is chosen as the linear one.

Corollary 1. When Y ∈ Y ⊂ R and k Y (y, y ) = yy in Theorem 3, the decomposition is identical to the decomposition (1), which means that

Thanks to Theorem 3 we can now define properly normalized MMD-based indices.

Definition 2 (MMD-based sensitivity indices). In the frame of Theorem 3, let A ⊆ P d . The normalized MMD-based sensitivity index associated to a subset A of input variables is defined as

while the total MMD-based index associated to A is

From Theorem 3, we have the fundamental identity providing the interpretation of MMD-based indices as percentage of the explained generalized variance MMD 2 tot :

Finally, we exhibit a generalized law of total variance for MMD 2 tot which will yield another formulation for the total MMD-based index.

Proposition 1 (Generalized law of total variance). Assuming Assumption 1 holds, we have

The proof is to be found in Appendix A.2. This is a generalization in the sense that the total variance is replaced by MMD 2 tot , the conditional variance by

In particular, all these terms reduce to the ones in the classical law of total variance if one uses the linear kernel k Y (y, y ) = yy in the scalar case. This gives the following corollary.

Corollary 2 (Other formulation of total MMD-based index). In the frame of Theorem 3, we have

Another approach for combining kernel embeddings with sensitivity analysis consists in directly using HSIC as a sensitivity index. For example Da Veiga (2015) considers the unnormalized index

and provided the following assumption holds:

In Da Veiga (2015) an empirical normalization inspired by the definition of the distance correlation criterion (Székely et al., 2007) was also proposed. But similarly to the MMD decomposition above, it is actually possible to exhibit an ANOVA-like decomposition for HSIC, thus providing a natural normalization constant. The main ingredient is an assumption on the kernel k X associated to the input variables.

where for each l = 1, . . . , d, k l (·, ·) is the reproducing kernel of a RKHS F l of real functions depending only on variable x l and such that 1 / ∈ F l . In addition, for all l = 1, . . . , d and ∀x l ∈ X l , we have

The first part (10) of Assumption 3 may seem stringent, however it can be easily fulfilled by using univariate Gaussian kernels since they define a RKHS which does not contain constant functions (Steinwart et al., 2006) .

On the contrary, the second assumption (11) is more subtle. It requires using kernels defining a so-called RKHS of zero-mean functions (Wahba et al., 1995) . A prominent example of such RKHS is obtained if (a) all input variables are uniformly distributed on [0, 1] and (b) the univariate kernels are chosen among the Sobolev kernels with smoothness parameter r ≥ 1:

where B j is the Bernoulli polynomial of degree j. Even though applying a preliminary transformation on the inputs in order to get uniform variables is conceivable (with e.g. the probability integral transform), a more general and elegant procedure has been proposed by Durrande et al. (2012) . Starting from an arbitrary univariate k(·, ·), they build a zero-mean kernel k D 0 (·, ·) given by

where k D 0 (·, ·) satisfies ∀x, k D 0 (x, t)dP(t) = 0. Interestingly, they also show that the RKHS H 0 associated to k 0 (·, ·) is orthogonal to the constant functions, thus satisfying directly the requirements for the product kernel (10). More recently, several works made use of the Stein operator (Stein et al., 1972) to define the Stein discrepancy in a RKHS (Chwialkowski et al., 2016) which showed great potential for Monte-Carlo integration (Oates et al., 2017) or goodness-of-fit tests (Gorham and Mackey, 2015; Chwialkowski et al., 2016; Jitkrittum et al., 2017) when the target distribution is either impossible to sample or is known up to a normalization constant. More precisely, given a RKHS H with kernel k(·, ·) of functions in R d and a (target) probability distribution with density p(·), they define a new RKHS H 0 with kernel k S 0 (·, ·) which writes

can still be defined when p(·) is known up to a constant: this property may find interesting applications in GSA when the input distributions are obtained via a preliminary Bayesian data calibration, since it would no longer be required to perform a costly sampling step of their posterior distribution and one could easily use the unnormalized posterior distribution instead.

With Assumption 3, we can now state a decomposition for HSIC-based sensitivity indices.

Theorem 4 (ANOVA decomposition for HSIC). Under the same assumptions of Theorem 1 (in particular, the random vector X has independent components) and with Assumptions 2 and 3, the HSIC dependence measure between X = (X 1 , . . . , X d ) and Y can be decomposed as

where each term is given by (10).

The proof, which mainly relies on Mercer's theorem and on Theorem 4.1 from Kuo et al. (2010) , is given in Appendix A.3. Once again, this decomposition resembles the ANOVA decomposition (1), where the conditional variances are replaced with HSIC dependence measures between subsets of inputs and the output. Properly normalized HSIC-based indices can then be defined:

Definition 3 (HSIC-based sensitivity indices). In the frame of Theorem 4, let A ⊆ P d . The normalized HSIC-based sensitivity index associated to a subset A of input variables is defined as

while the total HSIC-based index associated to A is

From Theorem 4, we have the fundamental identity providing the interpretation of HSIC-based indices as percentage of the explained HSIC dependence measure between X = (X 1 , . . . , X d ) and Y :

Finally, a noteworthy asymptotic result yields a link between HSIC-based indices and MMDbased ones when the input kernel k X degenerates to a dirac kernel, as elaborated in the following proposition.

The proof is given in Appendix A.4. As a particular case of Proposition 2, one can for example choose a (normalized) Gaussian kernel for k X with a standard deviation tending to 0, or the sinc kernel associated to the RKHS of band-limited continuous functions with a cutoff frequency tending to infinity. Obviously the result also holds if one uses different kernels K for each input X l ∈ X A in Eq. (13).

Although they may seem trivial, Proposition 2 and Corollary 1 actually justify our claim that both the MMD-and the HSIC-based sensitivity indices are natural generalizations of Sobol' indices, in the sense that a degenerate HSIC-based index with a dirac kernel for the input variables gives the MMD-based index which, in turn, is equal to the Sobol' index when using the dot product kernel for the output.

In this section, we now discuss how the previously indices can still be valuable in the case where the input variables are no longer independent. In this setting, the Shapley effects introduced in the context of GSA by Owen (Owen, 2014) and based on Shapley values (Shapley, 1953) from game theory have appealing properties, since they provide a proper allocation of the output variance to each input variable, without requiring they are independent. We recall their definition below.

Definition 4 (Shapley effects (Shapley, 1953) ). For any l = 1 . . . , d, the Shapley effect of input X l is given by

This definition corresponds to the Shapley value (Shapley, 1953) 

The only requirement is that the value function satisfies val : P d → R + such that val(∅) = 0. Combining this result with the kernel-based sensitivity indices is consequently straightforward, which leads to the definition of kernel-embedding Shapley effects:

Definition 5 (Kernel-embedding Shapley effects). For any l = 1 . . . , d, we define (a) The MMD-Shapley effect

provided Assumption 1 holds.

provided Assumptions 2 and 3 hold.

We further have the decompositions

Just like in the independent setting, kernel-embedding Shapley effects (15) and (16) can be seen as general kernelized versions of Shapley effects, since Proposition 2 and Corollary 1 are still valid when the inputs are dependent.

Remark 1. In the machine learning community dedicated to the interpretability of black-box models, an importance measure called Kernel-Shap has been recently introduced (Lundberg and Lee, 2017) . Although its naming resembles ours, they designate clearly separated approaches, since the Kernel-Shap measure is a local Shapley effect, and the "Kernel" denomination only refers to an estimation procedure without any links to RKHS.

Beyond their theoretical interest in themselves, the kernel-ANOVA decompositions and the associated sensitivity indices also appear powerful from a practical point of view when one carefully examines the potential of using kernels. We give below some insights on how they could enhance traditional GSA studies in several settings.

Categorical model outputs and target sensitivity analysis. In some applications, the model output Y is categorical, meaning that Y = {1, . . . , K} when the output can take K levels. A simple common instance involves two levels, corresponding to a failure/success situation. Similarly even if Y is not categorical, the objective may be to measure the impact of each input on the fact that the output reaches disjoint regions of interest R 1 , . . . , R K ⊂ Y, as for example in the case where one focuses on events {t i+1 > Y > t i } for thresholds t i , i = 1, . . . , K. Such an objective is called target sensitivity analysis (TSA, see Marrel and Chabridon (2020) ) and can be reformulated in a categorical framework by the change of variable

A straightforward approach is to use Sobol' indices with a 0/1 output, yielding a first-order Sobol index equal to

see Li et al. (2012) . But to the best of our knowledge, no systematic procedure is available when the number of levels is greater than two. Without resorting yet to our kernel-based indices, there are at least two roads, which actually lead to the same indices:

(a) The one-versus-all approach, where we compute several Sobol' indices by repeatedly considering Z = 1 {Y =i} for all i = 1 . . . , K. We thus have a collection of indices

, and we can aggregate them by normalizing each of them by its own variance, yielding

.

(b) The one-hot encoding approach, which consists in encoding the categorical output into a multivariate vector of 0/1 variables and use the aggregated Sobol' indices defined in Gamboa et al. (2013) on these transformed variables. More precisely if Y = {1, . . . , K}, Y is encoded as a K-dimensional vector (Z 1 = 1 Y =1 , . . . , Z K = 1 Y =K ). The aggregated Sobol' indices are then

, which is exactly the index obtained with the one-versus-all approach.

As for the kernel-based indices, the process is less cumbersome, since the only ingredient that requires attention is the choice of a kernel k Y (·, ·) adapted to categorical outputs, which has already been investigated in the kernel literature (Song et al., , 2012 . We focus here on the simple dirac kernel defined as k Y (y, y ) = δ(y, y ) for categorical values y, y ∈ {1, . . . , K}, and the corresponding kernel-based indices are then

• The first-order MMD-based index:

where we retrieve again the one-versus-all Sobol' index.

• The first-order HSIC-based index:

thus extending the result of Spagnol et al. (2019) to any number of levels.

• The MMD-and HSIC-Shapley effects using one of the above indices as building block.

Interestingly, it has been shown that Eq. (17) can also been written, up to a constant, as the Pearson χ 2 divergence between P X l |Y =1 and P X l (Perrin and Defaux, 2019; Spagnol, 2020) . This means that S TSA l = S MMD l and S HSIC l essentially have the same interpretation as weighted sums of distances between the initial input distributions and the conditional input distributions (when restricted to an output level), with the Pearson χ 2 divergence and the MMD distance, respectively. But we will see in Section 4 that the estimation of HSIC-based sensitivity indices is much less prone to the curse of dimensionality and does not require density estimation, as opposed to S TSA l (Perrin and Defaux, 2019) . Finally, note that another kernel for categorical variables has also been proposed (Song et al., , 2012 , but this is actually a normalized dirac kernel which would only modify the weights in the indices above.

Beyond scalar model outputs. In many numerical simulation models, some of the outputs are curves representing the temporal evolution of physical quantities of the system such as temperatures, pressures, etc. One can also encounter industrial applications which involve spatial outputs (Marrel et al., 2008) ,. In such cases, the two main approaches in GSA are (a) the ubiquitous point of view, where one sensitivity index is computed for each time step or each spatial location (Terraz et al., 2017) and (b) the dimension reduction angle, in which one preliminary projects the output into a low-dimensional vector space and then calculates aggregated sensitivity indices for this new multivariate vector (Lamboni et al., 2011; Gamboa et al., 2013) .

However, the kernel perspective for such structured outputs can bring new insights for GSA. Indeed the kernel literature has already proposed several ways to handle curves or images in regression or classification tasks. For instance the PCA-kernel (Ferraty and Vieu, 2006) can be used as an equivalent of (b), such as illustrated in Da Veiga (2015) . But more interestingly, kernels dedicated to times series were designed, such as the global alignment kernel (Cuturi, 2011) inspired by the dynamic time-warping kernel (Sakoe and Chiba, 1978) . Such kernel could be employed in industrial applications where one is interested by the impact of an input variable on the shape of the output curve. On the other hand, for dealing with spatial outputs similar to images such as in Marrel et al. (2008) , one may consider a kernel based on image classification (Harchaoui and Bach, 2007) which would be better suited to analyze the impact of inputs on the change of the shapes appearing inside the image output.

Finally, numerical models involving graphs as inputs or outputs (e.g. electricity networks or molecules) may now be tractable with GSA by employing kernels specifically tailored for graphs Ramon and Gärtner, 2003) .

Stochastic numerical models. On occasions one has to deal with stochastic simulators, where internally the numerical model relies on random draws to compute the output. Typical industrial applications include models dedicated to the optimization of maintenance costs, where random failures are simulated during the system life cycle, or molecular modeling to predict macroscopic properties based on statistical mechanics, where several microstates of the system are generated at random (Moutoussamy et al., 2015) . For fixed values of the input variables, the output is therefore a probability distribution, meaning that Y ⊂ M + 1 the set of probability measures. In this setting GSA aims at measuring how changes in the inputs modify the output probability distribution, which is clearly out of the traditional scope of GSA. Once again the kernel point of view makes it possible to easily recycle the MMD-and the HSICbased sensitivity indices in this context since they only require the definition of a kernel k Y (·, ·) on probability distributions. This can be achieved through one of the two following kernels:

introduced in Song (2008) or k Y (P, Q) = σ 2 e −λW 2 2 (P,Q) discussed in Bachoc et al. (2017) where P, Q ∈ M + 1 , W 2 is the Wasserstein distance and σ 2 , λ > 0 are parameters.

The properly normalized kernel-based sensitivity indices being defined above, we now discuss their estimation. The HSIC-based index is first examined as we only consider already proposed estimators. On the other hand, the MMD-based index is analyzed more thoroughly since several estimators can be envisioned given its close links with Sobol' indices. Finally we investigate the estimation of kernel-embedding Shapley effects.

We start by observing that if Assumption 3 holds, for any subset A ⊆ P d we have E X A k A (X A , x A ) = 1 for all x A ∈ X A , which means that HSIC in Eq. (8) simplifies into:

Given a sample x (i) , y (i) , i = 1, . . . , n and following Song et al. (2007) ; Gretton et al. (2008) two estimators HSIC u (X A , Y ) and HSIC b (X A , Y ) based on U-and V-statistics, respectively, can be introduced:

where we assume that P X A is known and is used to compute analytically the zero-mean kernels in Eq. (3.2.2). The study of the version of the above estimators when the sample x (i) i=1,...,n also serves to estimate k A is left as future work. Both HSIC u (X A , Y ) and HSIC b (X A , Y ) converge in probability to HSIC(X A , Y ) with rate 1/ √ n, and one can show ) that if we assume that k A and k Y are bounded almost everywhere by 1 and are nonnegative, with probability at least 1 − δ we have |HSIC u (X A , Y ) − HSIC(X A , Y )| ≤ 8 log(2/δ)/n. The asymptotic distributions of HSIC u (X A , Y ) and HSIC b (X A , Y ) have also been studied in the case where X A and Y are dependent, see Song et al. (2007) and Gretton et al. (2008) .

It is worth mentioning that here the number of model evaluations is n, which is independent from the input dimension, meaning that all HSIC-based sensitivity indices can be computed with only a given sample x (i) , y (i) , i = 1, . . . , n.

MMD-based indices are close generalizations of Sobol' indices, since they involve computing the expectation of a conditional quantity (a MMD distance with a conditional probability for the former and a conditional variance for the latter). This is the reason why estimation procedures developed for Sobol' indices can be adapted to the MMD ones. The first two estimators discussed below assume that one can easily sample the computer model for any input values (to be determined by the estimation procedure), as opposed to the next two ones which can be defined with any given sample x (i) , y (i) , i = 1, . . . , n.

The first naive estimator consists in systematically resampling the conditional distribution P Y |X A =x A for many values of x A , as detailed in Algorithm 1 below.

For each MMD-based index of a subset of variables X A the total number of model evaluations is (n+1)m, which means for example that all first-order MMD-based sensitivity indices are computed at a cost of p(n + 1)m model evaluations. It is however possible to design better sampling strategies to compute first-order and total indices if the inputs are independent, as explained in the next section.

We begin by recalling the definition of the pick-freeze estimators for Sobol' indices.

Lemma 1 (Pick-freeze formulation of Sobol indices (Janon et al., 2014) ). Assume X and X are two independent copies of the input vector, the inputs being independent. For any subset A ⊆ P d define X ∼A the vector assembled from X and X such that

In the particular case of A = {l}, the first-order and total indices S l and S T l can be estimated by collecting estimatorsV l ,V −l andV of Cov Y, Y ∼l , Cov Y, Y ∼−l and Var Y , respectively. Such estimators have been first studied in Homma and Saltelli (1996) , but we focus on the ones introduced by Saltelli et al. (2010) 

where x (i) and x (i) denote independent samples of X and x ∼l,(i) is a vector such that x The total number of model evaluations to estimate both S l and S T l is thus (p + 2)n, which is much less than the amount required by the previously introduced double-loop estimator.

We now build upon these estimators to design equivalent ones for the first-order and total MMD-based sensitivity indices. The main ingredient is to state an equivalent of Lemma 1 for the MMD.

Lemma 2 (Pick-freeze formulation of MMD-based indices). With the same notations and assumptions as in Lemma 1, we have

Proof. Since Y and Y ∼A are conditionally independent on X A with the same distribution, we can write

All these estimators can actually be recovered by using Mercer's theorem k Y (y, y ) = ∞ r=1 φ r (y)φ r (y ) and plugging the Sobol' estimators of Cov φ r (Y ), φ r (Y ∼l ) , Cov φ r (Y ), φ r (Y ∼−l ) and Var φ r (Y ) for all r > 1. Once again all first-order and total MMD-based sensitivity indices can be estimated with a total cost of (p + 2)n model evaluations, and by the strong law of large numbers it is straightforward to show that both MMD 

The two previous estimators, although simple, necessitate specific sampling schemes (double-loop Monte-Carlo or pick-freeze) which may not be amenable in practice. In addition first-order MMD indices estimation call for a number of model evaluations which increases with the number of input variables d. Recently, Gamboa et al. (2020) introduced new estimators of first-order Sobol' indices based on ranking and inspired by the work of Chatterjee (2020). In particular, for any pair of random variables (V, Y ) and measurable bounded functions f and g, they propose a universal estimation procedure for expectations of the form

using only a given sample (v (i) , y (i) ) i=1,...,n and an estimator given by 1 n n i=1 f (y (i) )g(y (σn(i)) ) where σ n is a random permutation with no fixed point and measurable with respect to the σ-algebra generated by (v (1) , . . . , v (n) ). First-order Sobol' indices are then estimated using f (x) = g(x) = x and the permutation σ n = N defined as in Chatterjee (2020) :

where π(i) is the rank of V (i) in the sample (V (1) , . . . , V (n) ). All first-order indices are finally obtained with a given sample by considering one after the other the pairs (X l , Y ) with their own permutation based on the sample ranks of X l . Interestingly, it is possible to generalize this result to the first-order MMD indices with the following proposition.

Proposition 3 (Generalization of Proposition 3.2 from Gamboa et al. (2020) ). Let k(·, ·) be a measurable bounded kernel and (v (i) , y (i) ) i=1,...,n an iid sample from a pair of random variables (V, Y ). Consider a random permutation with no fixed point and measurable with respect to the σ-algebra generated by (v (1) , . . . , v (n) ) such that for any i = 1, . . . , n, v (σn(i)) → v (i) as n → ∞ with probability one. Then the estimator

k(y (i) , y (σn(i)) ) converges almost surely to

as n → ∞.

The proof relies again on Mercer's theorem and is given in Appendix A.5. The estimators of E X l MMD 2 (P Y , P Y |X l ) and MMD 2 tot are finally given by

for a sample x (i) , y (i) , i = 1, . . . , n and where σ l n is the permutation defined in Eq. (19) with a ranking performed on the sample x

The ranking approach introduced above can actually be generalized to estimate higher-order sensitivity indices by replacing ranking (in dimension 1) by nearest-neighbors (in arbitrary dimension), since they define a permutation with the same properties as required in Proposition 3. This was proposed independently by Azadkia and Chatterjee (2019) in the context of a dependence measure and by Broto et al. (2020) for Shapley effects estimation. Here we adopt the formalism of Broto et al. (2020) , where they introduce j * A (i, m) the index such that the sample point x 

where s(j), j = 1, . . . , n A is a sample of uniformly distributed integers in {1, . . . , n}, with n A ≤ n.

The choice of using a subsample s(j) is motivated by the authors so that their framework is general enough for the different aggregation procedures they propose for Shapley effects and for their consistency proofs. Several numerical experimentations not reported here also show that using all the samples instead of subsamples yield biased estimators, so we follow the procedure of Broto et al. (2020) . Once again this estimator can be generalized to MMD-based indices, where

where we denote y (i) = η x (i) . The consistency of this estimator directly follows from the consistency ofV knn A from Broto et al. (2020) and Mercer's theorem. Since j * A (s(j), 1) = s(j), the estimator is identical to the ranking-based one in (20) where the permutation from rankings is simply replaced by the index of the nearest neightbor not including itself j * A (s(j), 2).

The last estimation task concerns kernel-embedding Shapley effects set forth in Definition 5. Of course a straightforward approach consists in using any of the estimators discussed before in the general formulation of the MMD-or HSIC-Shapley effects. But a closer inspection actually reveals that although this is easy for the HSIC-Shapley effects since both HSIC u (X A , Y ) and HSIC b (X A , Y ) can be computed for all subsets A ⊆ P d with only one sample x (i) , y (i) , i = 1, . . . , n, the MMD-Shapley effects require estimators of E X A MMD 2 (P Y , P Y |X A ) which do not involve two many calls to the numerical model. Among the estimators introduced in Section 4.2, only the one based on nearest neighbors has a computational cost independent of the number of input variables. This is exactly the framework proposed in Broto et al. (2020) for the variance-based Shapley effects. However, as pointed out in Song et al. (2016) in the case of variance-based Shapley effects, a double-loop Monte-Carlo estimator of the value function val(A) = Var E (Y |X A ) /Var Y can be heavily biased. They show that another value function val (A) = EVar (Y |X −A ) /Var Y behaves better and gives rise to the exact same Shapley effects (Theorem 1 in Song et al. (2016) ). This is why Broto et al. (2020) also introduced a nearest neighbor estimator of EVar (Y |X −A ) given bŷ

where this time n I nearest neighbors are used. In a nutshell, the nearest neighbors are used as if they were independent samples from P Y |X A =x (s(j)) , which explains why we compute their empirical variance in the formula above. In order to follow the same road for the estimation of MMD-Shapley effects, we first need an equivalent of Theorem 1 from Song et al. (2016) for a new value function related to the MMD.

/MMD 2 tot are exactly equal to the MMD-Shapley effects from Definition 5 with value function val(A) = E X A MMD 2 (P Y , P Y |X A ) /MMD 2 tot . The proof is based on the generalization of the law of total variance for the generalized variance MMD 2 tot and is given in Appendix A.6. A nearest neighbor estimator EMMD

is then given by

   and the MMD-Shapley effect estimator is

where MMD 2 tot is estimated as in Eq. (21). As a side-note, when the number of input variables is large, the number of terms involved in Shapley effects severely increases and the computational cost to assemble all the terms (even if one uses estimators relying on a given sample only) becomes prohibitive. For such cases it is possible to use a formulation of Shapley effects involving a sum on permutations of {1, . . . , d} instead of a sum on subsets of P d , which makes it possible to add another level of approximation by computing the sum on a random sample of permutations instead of on all of them (Castro et al., 2009) . Obviously since this trick does not depend on the value function used inside the Shapley values, it can also be used for our kernel-embedding Shapley effects.

In this section we illustrate the behavior of the kernel-based sensitivity indices on several test cases representative of typical GSA industrial applications. In particular, we address the following numerical model categories: a standard scalar output model, a stochastic simulator, a model with a time-series output and a multi-class categorical output simulator with dependent inputs. All the results presented here are reproducible with the R code provided in the supplementary material.

To exemplify the additional insight provided by these indices we first consider a classical GSA test case, the Ishigami function (Ishigami and Homma, 1990) where the output Y is given by

where X l ∼ U(−π, π) for l = 1, . . . , 4, meaning that we add a dummy input variable X 4 for analysis purposes.

We start by computing the traditional Sobol' first-order and total sensitivity indices using a pick-freeze estimator as in Section 4.2.2 with a sample size n = 1000 and we repeat this calculation 50 times. For each replication the total number of calls to the numerical model is thus (p + 2)n = 6000. We then use the same pick-freeze procedure to estimate the MMD-based firstorder and total indices with the exact same samples. For the output we use a Gaussian kernel k Y (y, y ) = exp(− 1 2σ 2 (y − y ) 2 ) where σ is chosen as the median of the pairwise distances between the output samples. Results are given in Figure 1 . First note that, as is well known, the first-order Sobol' index of X 3 is zero, while its total index is around 0.25 due to its interaction with X 1 . X 2 is also an important variable, which does not have any interaction since its total Sobol' index is equal to its first-order one. As expected X 4 is correctly detected as non-important. The MMD-based indices however bring a different insight: from a probability distribution perspective, one can observe that interactions are much more present since there is a large gap between total and first-order indices for all inputs (except X 4 of course). In addition, this time X 3 is detected to have a main effect: indeed even though it does not impact the output conditional mean, it influences the tails of the output conditional distribution when it is close to −2π/2π as was already illustrated in Da Veiga (2016). This shows that MMD-based indices capture other types of input influence than Sobol' ones. To take a different view at the inputs/output relationship we also estimate HSIC-based firstorder and total indices using the V-statistic of Section 4.1. Again for the output we use the same Gaussian kernel as above, while we use the Sobolev kernel from Eq. (12) for the inputs. Since they are uniform it is easy to renormalize them to satisfy the zero-mean kernel condition. We use only one sample of size n = 1000 and estimates obtained with 50 replications are reported in Figure 2 .

Interestingly, we observe first that with HSIC we no longer detect any interaction: our intuition is that first-order HSIC indices already aggregate a very large family of potential influences and thus interactions may only appear with highly complicated inputs/output link functions. This is supported by the fact that HSIC indices rank the inputs the exact same way at total Sobol' indices. Another appealing property is that to compute all HSIC indices we only need a given sample of moderate size, which is interesting from a screening perspective for GSA on very time-consuming numerical models. 

Our second illustration is a more original setting for GSA which consists of a stochastic simulator where the numerical model outputs a probability distribution, or rather a sample from a probability distribution in practice, for a fixed value of the input variables. Here we use a test case proposed in Moutoussamy et al. (2015) which involves five input variables and writes

where X 1 , . . . , X 5 ∼ U(0, 1) are the input variables and U 1 ∼ U(0, 1), U 2 ∼ U(1, 2), N ∼ N (0, 1) and B ∼ Bernoulli(1/2) are additional random variables which are responsible for the simulator stochasticity. Note that we modify the constant in front of X 5 B to lessen the effect of X 5 as compared to Moutoussamy et al. (2015) . An example of the output distribution for 20 random fixed values of the input variables obtained each time with a sample of size 100 for the stochastic ones is given in Figure 3 . Leaving aside for now the whole output distribution, we first place ourselves in a standard GSA deterministic setting by first analyzing the input influence on both the output mean and standard deviation (with respect to U 1 , U 2 , N and B). We thus compute Sobol' indices for these two outputs of interest with a pick-freeze estimator with a sample of size n = 1000 and perform 50 replications, see Figure 4 . It shows that interactions are negligible, and that X 5 is clearly the most influential input by far: it explains alone 65% of the output mean variability and 75% of the output standard deviation variability. This is expected since X 5 is coupled with B, which creates the multi-modal feature of the output distribution. The output mean variability also depends on X 3 and X 4 to some lesser extent, and the output standard deviation variability on X 2 .

We now make use of the kernel framework to compute MMD-and HSIC-based sensitivity indices which can accommodate directly the output distribution thanks to the specific kernels discussed in Section 3.4. More precisely we use the kernel of Eq. (18) with σ 2 = 1 and λ chosen as the median of the MMD 2 computed on the preliminary sample used for visualization in Figure 3 , with a kernel k Y (y, y ) = exp(− 1 2τ 2 (y − y ) 2 ) and τ chosen as the median of the pairwise distances between the output samples. We only compute first-order indices here and use for illustration the rank estimator of the MMD index from Section 4.2.3 while for HSIC we use again the Sobolev kernel. Results with 50 replications and a sample of size n = 200 are given in Figure 5 . Both indices coincide and identify X 5 as the most important input variable, as well as a small influence of X 3 , X 2 and X 4 while X 1 is non-important: considering the whole output distribution variability via the specific kernel is comparable to an aggregation of the variability on the output mean and standard deviation (and other moments we did not compute above). 

Another commonly encountered industrial application is a physics-based numerical simulator involving functional outputs, such as curves representing the evolution over time of some system characteristics (e.g. pressure, temperature, ...). To illustrate how time-series kernels can easily handle GSA on such systems we build a simplified compartmental epidemiological model inspired by previous works on COVID-19 (Magal and Webb, 2020; Charpentier et al., 2020; Di Domenico et al., 2020) . Our model is a straightforward Susceptible -Infected -Recovered (SIR) model (Kermack and McKendrick, 1927) which is slightly modified, in the sense that it accounts for two different types of infectious people: the reported cases, which we assume are isolated and can non longer contaminate others, and the unreported cases who can infect others. A summary of this compartment model proposed by Magal and Webb (2020) is given in Figure 6 . S consists of the susceptible individuals who are not yet infected. During the epidemic spread they are infected depending on the time-dependent transmission rate τ (t). Once infected they Figure 6 : Functional simulator test case. The modified SIR model with 4 compartments following Magal and Webb (2020) . mode to compartment I where are the asymptomatic infectious individuals. After a period of η days they become symptomatic and a fraction f of them is detected and go to compartment R, while the rest of them are undetected and go to compartment U . After a recovering period of η days symptomatic people from R and U recover and go to the last compartment. Observe that this is a highly simplified representation of the epidemic where we do not account for hospitalizations, testing strategies or deaths: our goal here is not to be representative of COVID-19 but rather exemplify how GSA can be applied to such models.

The dynamics of the evolution of individuals from a compartment to another is modeled with the following system of ordinary differential equations:

The transmission rate is chosen according to Magal and Webb (2020) where they propose a parametric form given by τ (t) = τ 0 exp(−µ max(t − N, 0)). The underlying assumption is that before the epidemic outbreak the transmission rate is constant equal to τ 0 and it then decreases with an exponential decay with rate µ once social distancing and lockdown start to have an effect after N days. They further assume that the cumulative number of reported cases CR(t) is approximately

where χ 1 and χ 2 are to be estimated on data. From this assumption they get the value of the initial conditions I 0 , U 0 , R 0

and with in our case S 0 = 66.99 × 10 6 is the initial susceptible population (here in France). From a GSA perspective we then assume that we have uncertainty on the following 6 input variables: τ 0 , µ, N (transmission rate), η, ν (days until symptoms and recovery) and χ 2 (which impacts the initial conditions). f is assumed to be fixed at a fraction equal to 0.1. We assign uniform distributions to the input variables with ranges consistent with the values from Magal and Webb (2020) , i.e. τ 0 ∼ U(5.9 × 10 −9 , 6.1 × 10 −9 ), µ ∼ U(0.028, 0.036), N ∼ U(8, 15), 1/η ∼ U(5, 9), 1/ν ∼ U(5, 9) and χ 2 ∼ U(0.32, 0.4). An example of the dynamics of compartments I and R for 20 values of the inputs chosen at random according to these uniform distributions is given in Figure 7 . For GSA we rely on the global-alignment kernel of Cuturi (2011) designed for time-series, which searches for all their alignments and averages them, and use it inside our first-order HSIC indices for the output, whereas we still employ the Sobolev kernel for the inputs. The results obtained with 50 repetitions with a sample of size n = 200 and the V-statistic estimator are reported in Figure 8 .

For both compartments the most influential input is χ 2 , as expected since it influences the initial conditions, and then N and µ related to the transmission rate. ν has also an impact for compartment I but not η, which is coherent with the ordinary differential equations, and one can see on the contrary that η influences compartment R.

Finally we investigate a numerical model with both a categorical output (to make use of the discussion from Section 3.4) and dependent inputs (to analyze kernel-embedding Shapley effects from Section 3.3). We build upon the famous wine quality data set (Cortez et al., 2009 ) of the UCI repository (Dua and Graff, 2017) . This dataset consists of 4898 observations of wine qualities (categorical variable with levels 0 to 10 corresponding to a score) associated to 11 features obtained with physicochemical tests. In order to place ourselves in a standard computer experiments setting (i.e. a numerical simulator and uncertain inputs with given probability distribution) we use this dataset to design a GSA scenario detailed in the following steps:

1. We regroup wine quality scores into only 3 categories: low (score less than 5), medium (score equal to 6) and high (score higher than 7) in order to have a balanced dataset. We also use a small subsample of size 600 of white wine only from the initial 4898 observations for faster estimation of the input dependence structure;

2. We estimate a random forest model between the wine quality and the 11 inputs from this transformed dataset and compute variable importance for each input. The variable importance score is used to select only 4 important features among the initial 11 ones (volatile acidity, chlorides, density and alcohol). This is absolutely not a mandatory step, but we choose to do so for both a faster computation of Shapley effects and estimation of the input dependence structure. A new random forest model is finally built with these 4 input variables only, and the predictor serves as our numerical simulation model;

3. The samples from the 4 input variables identified above are used to estimate a vine copula structure which models their dependence (Czado, 2019) . Once the vine copula is estimated, it is then easy to generate new input samples as much as required.

MMD-and HSIC-Shapley effects are then computed with a sample size of n = 1000 with a dirac categorical kernel for the output and a Sobolev kernel for the inputs in the HSIC case. For MMD we use the nearest-neighbor estimator of Section 4.3 and for HSIC the V-statistic estimator and we repeat the estimation 50 times, see Figure 9 . Both kernel-embedding Shapley effects identify alcohol as the most influential input, which was expected from the variable importance scores computed with the random forest. However, the MMD-Shapley effects do not discriminate as clearly the input variables as HSIC. We suspect that there may be remaining estimation bias coming from the nearest-neighbor estimators which we plan to carefully examine in future work.

In this paper we discussed two moment-independent sensitivity indices which generalize Sobol' ones by relying on the RKHS embedding of probability distributions. These MMD-and HSICbased sensitivity indices are shown to admit an ANOVA-decomposition, which makes it possible to properly define input interactions and their natural normalization constant. To the best of our knowledge this is the first time such a result is proved for sensitivity indices apart from Sobol' ones. We also defined kernel-embedding Shapley effects which are built upon these indices for the case where the input variables are no longer independent. As discussed through several GSA applications with categorical outputs or stochastic simulators, this opens the path for new powerful and general GSA approaches by means of kernels adapted to the task at hand. Finally, several estimators have been introduced, including new ones inspired by recent advances in Sobol' indices and Shapley effects estimation. However, there is still room for improvement in the theoretical understanding of theses indices. First, we extensively used Mercer's theorem and it would be interesting to extend our results when it no longer holds. We also assume a kernel product form for HSIC indices, whereas the theorem used in our proof allows for more general kernels. From an estimation perspective, we did not exhibit here any central limit theorem, although this would be an important step enabling to statistically test whether indices are zero or not. But this is not at all an easy task, which may be tackled via the functional delta method combined with Mercer's theorem. On the other hand, some bias can be observed in the nearest neighbor estimators, which should be analyzed carefully in future work. Finally, further practical experimentations should be performed to better understand the behavior of these new indices. We think in particular to the choice of the kernel hyperparameters, and the investigation of invariant kernels for outputs given as curves or images.

A.1 Proof of Theorem 3

Proof. The theorem is proved in the case where Mercer's theorem holds, i.e., the output is assumed to be such that Y ∈ Y with Y a compact set and k Y has the representation k Y (y, y ) = ∞ r=1 φ r (y)φ r (y ) as in Eq. (9). Consider now the random variable W = ∞ r=1 η [r] (X) where η [r] (X) = φ r (Y ) = φ r (η(X)). To prove the theorem, two formulations of Var W are exhibited. First, since the functions φ r are orthogonal in L 2 (Y) and using the absolute convergence of the series, we have

On the other hand, using the variance decomposition (1) for each η [r] (X) = φ r (η(X)) we get

using again the absolute continuity and the expansion of E X B MMD 2 (P Y , P Y |X B ) obtained in Eq. (9). The theorem follows by equating both formulations of Var W .

Proof. We simply add and subtract E X A E ξ,ξ ∼P Y |X A k Y (ξ, ξ ):

Proof. We first rewrite HSIC between X and Y from Eq. (8) as a multivariate integral, assuming P XY is absolutely continuous with respect to the Lebesgue measure on X × Y:

where p XY , p X and p Y are the probability density functions of (X, Y ), X and Y , respectively. As in Theorem 3 we further assume Mercer's theorem holds, which means that k Y (y, y ) = ∞ r=1 φ r (y)φ r (y ).

For each r, we then define the function g [r] (x) = X Y k X (x, x )φ r (y ) p XY (x , y ) − p X (x )p Y (y ) dx dy , noting that g [r] ∈ F from Assumption 2. It is then straightforward to show that

which means that

since in Mercer's theorem we have the absolute convergence of the series. Now the idea is to write an orthogonal decomposition (in F) for each function g [r] , which will finally provide a decomposition for HSIC through Eq. (23).

The orthogonal decomposition of g [r] is obtained with Theorem 4.1 from Kuo et al. (2010) . First, recall that we have from the first part of Assumption 3:

which corresponds to Eq. (4.1) in Kuo et al. (2010) . We then introduce a set of commuting projections {P l } p l=1 on F given by P l (f ) = X l f (x 1 , . . . , x l−1 , t, x l+1 , . . . , x d )p X l (t)dt (25) for all f ∈ F. From the second part of Assumption 3, one has for all subset A ⊆ P d and x A ∈ X A P l (k A (·, x A )) = l ∈A,l =l k l (·, x l ) X l k l (t, x l )p X l (t)dt = 0 for l ∈ A, meaning that Eq. (4.5) from Kuo et al. (2010) is satisfied. From Theorem 4.1 from Kuo et al. (2010) , we can now state that g [r] (x) has an unique orthogonal decomposition given by

with P −B = l / ∈B P l . Since the decomposition is orthogonal, we further have

The last part is to expand P −B (g [r] ) 2 F . We first write the projection:

and its norm then equals

Finally, we have

Proof. We begin with the integral formulation of HSIC as in Eq. (22) and plug the kernel defined in Eq. (13):

We then use a change of variables u l = (x l − x l )/h, which leads to

where (26) is obtained by the absolute convergence of Mercer's series, (27) by applying Eq

2020) to f = g = φ r (which is bounded since k is bounded) and the absolute convergence of Mercer's series and (28) with Eq. 9. The Mac Diarmid's concentration inequality given in Theorem A.1 in the proof of Proposition 3.2 from Gamboa

Analysis of variance on function spaces

Applied Functional Analysis, 2 nd edn

A simple measure of conditional dependence

A gaussian process regression model for distribution inputs

A new uncertainty importance measure

Variance reduction for estimation of shapley effects and adaptation to unknown input distribution

Polynomial calculation of the shapley value based on sampling

Covid-19 pandemic control: balancing detection policy and lockdown intervention under icu sustainability

Generalized hoeffding-sobol decomposition for dependent variables-application to sensitivity analysis

A new coefficient of correlation

A kernel test of goodness of fit

Modeling wine preferences by data mining from physicochemical properties

Fast global alignment kernels

Analyzing dependent data with vine copulas

Global sensitivity analysis with dependence measures

New perspectives for sensitivity analysis

Efficient estimation of sensitivity indices

Local polynomial estimation for sensitivity analysis on models with correlated inputs

Expected impact of lockdown inîle-de-france and possible exit strategies

Structural reliability methods

Uci machine learning repository

Anova kernels and rkhs of zero mean functions for model-based sensitivity analysis

Nonparametric functional data analysis: theory and practice

New sensitivity analysis subordinated to a contrast

Global sensitivity analysis: a new generation of mighty estimators based on rank statistics

Sensitivity indices for multivariate outputs

On graph kernels: Hardness results and efficient alternatives, in 'Learning theory and kernel machines

Measuring sample quality with stein's method'

Measuring statistical dependence with hilbert-schmidt norms

A kernel statistical test of independence, in 'Advances in neural information processing systems

Kernel methods for measuring independence

Image classification with segmentation graph kernels

A class of statistics with asymptotically normal distributions

Importance measures in global sensitivity analysis of nonlinear models

Shapley effects for sensitivity analysis with dependent inputs: comparisons with Sobol' indices, numerical estimation and applications

An importance quantification technique in uncertainty analysis for computer models

Asymptotic normality and efficiency of two sobol index estimators

A linear-time kernel goodness-of-fit test

A contribution to the mathematical theory of epidemics

On decompositions of multivariate functions

Multivariate sensitivity analysis to measure global contribution of input factors in dynamic models

Moment-independent importance measure of basic variable and its state dependent parameter solution

A unified approach to interpreting model predictions, in 'Advances in neural information processing systems

Predicting the number of reported and unreported cases for the covid-19 epidemic in south korea, italy, france and germany

Non-parametric methods for global sensitivity analysis of model output with dependent inputs

Statistical developments for target and conditional sensitivity analysis: application on safety studies for nuclear reactor

An efficient methodology for modeling complex computer codes with Gaussian processes

Estimation of quantile oriented sensitivity indices

Emulators for stochastic simulation codes

Learning from distributions via support measure machines

Control functionals for monte carlo integration

Sobol' indices and Shapley value

Efficient estimation of reliability-oriented sensitivity indices

Computing shapley effects for sensitivity analysis

The f-sensitivity index

Expressivity versus efficiency of graph kernels

Dynamic programming algorithm optimization for spoken word recognition

Variance based sensitivity analysis of model output. design and estimator for the total sensitivity index

A quantitative model-independent method for global sensitivity analysis of model output

Contributions to the theory of games II, Annals of mathematic studies

A hilbert space embedding for distributions

Sensitivity estimates for non linear mathematical models

Non-parametric estimation of the first-order sobol indices with bootstrap bandwidth

Shapley effects for global sensitivity analysis: Theory and computation

Learning via Hilbert Space Embedding of Distributions

Feature selection via dependence maximization

Supervised feature selection via dependence estimation

Kernel-based sensitivity indices for high-dimensional optimization problems

Global sensitivity analysis for optimization with variable selection

Kernel choice and classifiability for rkhs embeddings of probability distributions, in 'Advances in neural information processing systems

Hilbert space embeddings and metrics on probability measures

A bound for the error in the normal approximation to the distribution of a sum of dependent random variables

An explicit description of the reproducing kernel hilbert spaces of gaussian rbf kernels

Learning theory for distribution regression

Measuring and testing dependence by correlation of distances

Large scale in transit global sensitivity analysis avoiding intermediate files

Smoothing spline anova for exponential families, with application to the wisconsin epidemiological study of diabetic retinopathy: the 1994 neyman memorial lecture

Proof. We follow closely the proof of Theorem 1 from Song et al. (2016) . We only need to prove that for a subset A ⊆ P d such that l / ∈ A, thenwhere B = P d \(A ∪ {l}). We first need the generalized law of total variance for MMD 2 tot from Proposition 1:Now we prove the equality (29) for val and val defined in Lemma 3, except that here we work without the denominator MMD 2 tot for better readability (this does not change the proof since this same constant appears in both value functions).where we have used the generalized law of total variance for the second equality. The rest of the proof is identical to the one of Theorem 1 from Song et al. (2016) .