key: cord-0174858-5mqn5p4m
authors: Pan, Weishen; Cui, Sen; Bian, Jiang; Zhang, Changshui; Wang, Fei
title: Explaining Algorithmic Fairness Through Fairness-Aware Causal Path Decomposition
date: 2021-08-11
journal: nan
DOI: nan
sha: d728d9bc1ab9fff1c7856e5f9bcfb99ae44ce94f
doc_id: 174858
cord_uid: 5mqn5p4m

Algorithmic fairness has aroused considerable interests in data mining and machine learning communities recently. So far the existing research has been mostly focusing on the development of quantitative metrics to measure algorithm disparities across different protected groups, and approaches for adjusting the algorithm output to reduce such disparities. In this paper, we propose to study the problem of identification of the source of model disparities. Unlike existing interpretation methods which typically learn feature importance, we consider the causal relationships among feature variables and propose a novel framework to decompose the disparity into the sum of contributions from fairness-aware causal paths, which are paths linking the sensitive attribute and the final predictions, on the graph. We also consider the scenario when the directions on certain edges within those paths cannot be determined. Our framework is also model agnostic and applicable to a variety of quantitative disparity measures. Empirical evaluations on both synthetic and real-world data sets are provided to show that our method can provide precise and comprehensive explanations to the model disparities.

Machine learning algorithms have been widely applied in a variety of real-world applications including high-stakes scenarios such as loan approvals, criminal justice, healthcare, etc. In these real-world applications, fairness is getting increasing attentions as machine learning algorithms may lead to discrimination against certain disadvantaged sub-populations. This triggers the research on algorithmic fairness, which focus on whether members of specific unprivileged groups are more likely to receive unfavorable decisions made from machine learning algorithms.

One important line of research in the computational fairness community is to develop metrics for measuring group fairness, such as demographic parity [11] , equalized opportunity [14] , accuracy parity [32] , etc., so that the discrepancy among the decisions made in different groups (a.k.a. disparity) are precisely quantified, which can further inspire the development of fair machine learning models that aim to minimize such disparities [3, 22, 32, 33] .

Despite the great efforts on fairness quantification and fair model development, one critical issue that has not been studied extensively is the diagnostics of model fairness, i.e., what are the reasons that lead to the model disparity? This information is crucial for understanding the intrinsic model mechanism and provides insights on how to improve model fairness. As an example, under the current pandemic, researchers have found that the racial and ethnic minority groups have been disproportionately affected by COVID-19. African Americans and Hispanics or Latinos are found to be more likely to have positive tests [2, 23] , COVID-19 associated hospitalizations and deaths [26] , compared with non-Hispanic Whites. In this case, it is crucial to figure out whether such disparity is coming from genetic factors or accessibility to adequate healthcare services, which will imply completely different clinical management plans and public health policies for battling with the pandemic.

In view of this need, recently researchers have leveraged Shapley value based methods [21] to attribute the model disparity as the sum of individual contributions from input features [6, 20] , so that we can understand which feature contributes more or less to the model disparity. However, in real-world problems, the mechanism that causes model disparity could be much more complex. Considering the COVID-19 example above, it turns out that one main factor contributing to such disparity is disproportionate access to care for patients with different races and ethnicity, which can be impacted by both economic status [13] and food insecurity [30] . In practice, they correspond to two different causal paths leading to outcome disparity and imply different public health intervention policies.

In this paper, we propose FACTS (which stands for Fairness-Aware Causal paTh decompoSition), a novel framework for algorithm fairness explanation. FACTS decomposes the model disparity as the sum over the contributions of a set of Fairness-Aware Causal paThs (FACT) linking the sensitive attributes with the outcome variable. In this way, our approach can quantify the different causality mechanisms that can lead to the overall model disparity. Specifically, FACTS includes two steps:

• Step 1. FACTs identification, where we propose a method to identify all active paths that link the sensitive attributes and final outcome without colliders given a causal graph constructed on feature variables (which are referred to as FACTs). The graph could be given according to domain knowledge or learned from the training data. One important consideration here is that frequently the causal directions for certain edges on the graph cannot be determined [7, 25] , which makes the graph a Partially Directed Acyclic Graph (PDAG). Our proposed algorithm can effectively identify active paths on PDAGs. • Step 2. Disparity attribution through Shapley value decomposition, where we propose a Shapley value [21] based method to attribute the quantified model disparity value (e.g., according to demographic parity [11] ) to identified FACTs, so that the contribution of each FACT can be quantified.

In addition, with the derived attributed disparity on FACTs, we further develop a fair learning approach by selectively removing the FACTs based on their effects on disparity and accuracy through data transformation. Our framework is model agnostic and can be applied to a broad set of popular group-fairness criteria including demographic parity, equalized opportunity, equalized odds, and accuracy parity.

With experiments on both synthetic and real-world datasets, we show that FACTS can accurately quantify individual path contributions to the model disparity. With qualitative analysis on real-world datasets, we demonstrate how our approach can appropriately explain the sources of disparity and successfully make fair adjustments 1 .

In this paper, we use capitalized/lower-case letters in italics to represent a variable/value of the variable. We use capitalized/lower-case letters in boldface to represent a variable set/values of the variable set. We use {. . . } to represent a set without ordering relationship and use [. . . ] to represent a sequence. is a function indicating the rank of specific elements in an ordered sequence. For example, for an order pair of features [ 1 , 2 ], we will have (1) < (2).

Suppose we are given a dataset including samples characterized by a set of variables { , X, }, where X = { 1 , ..., } is the set of input feature variables. ∈ {0, 1} is the sensitive attribute and ∈ {0, 1} is the outcome. We set = 1 as the favored outcome. is a trained model andˆis the predicted outcome. x = ( 1 , 2 , · · · , ) ⊤ is a concrete data vector with 1 = 1 , 2 = 2 , · · · , = .

We introduce causal model-related concepts that will be used throughout the paper in this subsection. Our descriptions are based on the definitions in [25, 28] . Nodes and Edges. A graph G consists of a set of nodes (variables) { , ,ˆ, 1 , . . . , } and a set of edges (variable relations). The edges can be either directed (→) and undirected (−). We call two nodes are adjacent to each other if there is an edge linking them. The collection of nodes adjacent to is denoted as Adj( ). Paths. A path from to in G is a sequence of nodes where every successive nodes are adjacent in G. A path from to in 1 We upload our source code on https://github.com/weishenpan15/FACTS. A X 1Ŷ X 2 X 3 X 4 Figure 1 : An example of a causal graph. Here the prediction is obtained by a function which takes 1 , . . . , 4 as input features.

which all edges are directed from towards ( → · · · → ) is a causal path from to . Causal Relationships.

is a parent of if there is a directed edge from to , and is a child of . is an ancestor of if there is a causal path from to , and is a descendant of . Following prior research [5, 17] ,ˆis the child of all , which means the predictor maps the features variables X to predicted outputˆ. DAGs and PDAGs. A directed graph is a graph where all edges are directed. A directed graph without directed cycles is a directed acyclic graph (DAG), where directed cycle is formed as a causal path from to plus a directed edge → . A partially directed graph is a graph where edges can be either directed or undirected. Similarly, a partially directed acyclic graph (PDAG) is a partially directed graph without directed cycles. Colliders, Active Nodes and Paths. Even though there are paths linking and in G, and are not guaranteed to be dependent. In the example shown in Figure 1 , there are paths (e.g., → 3 ← 4 ) linking and 4 , but and 4 are independent due to fact that they are linked by a collider 3 . To characterize the variable relationships, we introduce the following definitions.

If a path contains → ← as a subpath, then is a collider on . For a given set of nodes C (referred to as the conditioning set), a node is active relative to C on a path if either: 1) is a not a collider on and not in C; 2) is a collider, and or any of its descendants is in C. A path is active relative to C only when every node on the path is active relative to C. When C = ∅, the definition of an active path is degenerated to that there is no collider on the path. When we say is an active path, we means is an active path relative to ∅ in this paper.

As in Figure 1 , considering the path → 3 →ˆwhen C = ∅, 3 is an active node since 3 is not a collider. Thus → 3 →î s an active path. Similarly, → 3 ← 4 is not an active path because 3 is a collider. But → 3 ← 4 is an active path relative to { 3 } since 3 is a collider in the conditioning set { 3 } . Faithfulness. The distribution of the variables and G are faithful to each other means for all , , C, is conditional independent with on C if and only if there exists no active path from to with relative to C. Faithfulness is an important and common assumption in the research field of causality, which will be also used in this paper.

Equalized Odds [14] . A predictionˆsatisfies equalized opportunity if (ˆ= 1| = , = 1) = (ˆ= 1| = , = 0), ∀ ∈ {0, 1}. Equalized Opportunity (EO) [14] . A predictionˆsatisfies equalized opportunity if (ˆ= 1| = 1, = 1) = (ˆ= 1| = 1, = 0). Accuracy Parity [32] A predictionˆsatisfies accuracy parity if (ˆ= | = 1) = (ˆ= | = 0). In practice, we can take the difference between the two sides of the equalities in the above definitions as a quantification measure for disparity. For example,

are two popular algorithm disparity measures used in fairness learning algorithms [22, 27] . Here we follow the work of [6] to use signed difference across groups to show which group is privileged. These fairness definitions are based on the (conditional) independence of theˆand the , which can be determined using the causal graph [5] . For example, according to the definitions in Section 2.2,ˆis independent to if there is no active path between them, which can obtain demographic parity. In other words, any non-zero Δ comes from the active paths linking andˆ. In the example in Figure 1 , active paths between andˆcontain ← 1 →ˆ, → 2 →ˆand → 3 →ˆ.

Shapley values is a popular concept that has been used in model interpretation in recent years [1, 12, 15, 21] . These methods typically decompose the prediction (x) for a given x as follows.

where (x) ( ) is the contribution of feature to (x). (0) = E (X) is the averaged prediction with the expectation over the observed data distribution (X). (x) ( ) is referred as the Shapley value of feature for (x).

In order to calculate (x) ( ), we firstly assume a sequential order for all variables in X, such that ( ) corresponds to the rank of . The Shapley value of for (x) with respect to is [12] further considered different choices of . For example, one reasonable approach is to put weights only on those permutations which are consistent with known causal orderings: 

There have been a few studies trying to derive explanations for model fairness, which can be categorized as either feature-based or path-specific explanation.

2.5.1 Feature-based Explanation. Lundberg [20] and Begley et al. [6] leveraged Shapley values defined in Eq.(3) to attribute the feature contributions to Δ in Eq.(1). Specifically, they proposed to use the group difference of ( ) ( ) to quantify the contribution of to Δ as follows:

Begley et al. [6] has also extended this formulation to disparity measured on equalized opportunity as in Eq.(2). However, decomposing model disparity into feature contributions ignores the causal structure of the features.

There have been existing research studying fairness and the causal effects on the outcome by flipping the sensitive attribute value. They also studied the causal effects from particular paths, which are called path-specific effects [9, 18, 24, 31] . Intuitively, the path-specific effect can be viewed as quantification of path-specific contribution to model disparity. Existing research has only focused on causal paths (paths in which all edges are directed outwards ) so far, this may miss other sources of disparity. For example, in Figure 1 , ← 1 →ˆis an active path linking andˆand thus may contribute to model disparity, but it is not a causal path. Therefore, the sum of the path-specific effects considering only causal paths will not amount to the entire model disparity (e.g., measured by Eq.(1)), which leads to incomplete explanations.

In this paper, we propose a novel algorithmic fairness explanation method called Fairness-Aware Causal paTh decompoSition (FACTS), which is introduced in the next section.

In this section, we will introduce our framework with Δ as the model disparity metric. We provide generalizations of our framework to other model disparity metrics in the appendix.

Our framework is based on a causal graph G, which could be obtained based on domain knowledge or learned from the training data with existing causal discovery algorithms [28] . According to Section 2.3, active paths from toˆare sources of Δ . If there is no active path between andˆ, Δ = 0. If G is a DAG, we can identify all active paths based on the definition above and analyze their contributions to disparity. However, in reality G could be a PDAG, where the causal directions on some edges cannot be determined [7, 25] . This makes the problem more challenging.

In this section, we first propose a corresponding concept potential active path to represent the correlation relations between and under a PDAG, and each potential active path between and is referred to as a fairness associated causal path (FACT) in this paper. Then we propose an algorithm to extract all potential active paths from toˆ. Finally, we decompose Δ as the sum of the contributions of these paths following the Shapley values strategy.

otential active paths on a PDAG are defined as follows: Potential Active Paths. A path in G is a potential active path if one of following properties is satisfied:

(1) All edges on are directed and satisfies the definition of active path in Section 2.2. (2) contains undirected edges. If we add arbitrary directions to all undirected edges (adjacent node pairs) on , the resulting path is active. (3) contains undirected edges. Considering all possible directions of undirected edges in G, there exists at least one situation to obtain a DAG G ′ which satisfies: 1) corresponding path obtained from in G ′ is active; 2) the conditional independence relationships among variables encoded in G ′ are consistent with those inferred from observational data. We illustrate the above definition with examples in Figure 2 . For path → 1 → 2 →ˆin G 1 , since it is an active path by definition, it is a potential active path as well. As for − 1 → 2 →ˆin G 2 , since no matter what direction of − 1 is, the resulting path → 1 → 2 →ˆor ← 1 → 2 →ˆis active. Thus − 1 → 2 →ˆin G 2 is a potential active path. For

Consider the directions of all undirected edges to be 1 → 2 , the corresponding path → 1 → 2 →ˆin G 3 under this case is active and the relationships among variables is consistent with the conditional independence relation observed from the data. So → 1 − 2 →ˆis a potential active path. When G is DAG, the potential active path is equivalent to the active path. The potential active paths satisfy the following property:

If there is no potential active path between two variables on G, then the two variables are independent.

This proposition shows the importance of potential active paths between andˆwhen we consider Δ . Since if there is no potential active path between andˆ, ⊥ ⊥ˆand Δ = 0.

In ordering to search potential active paths more efficiently, we have the following proposition:

If is a potential active path in G, then every subpath of is also a potential active path.

The proofs of propositions are provided in the appendix. Based on this proposition, we propose an algorithm to search all potential active paths from toˆas demonstrated in Algorithm 1.

Algorithm 1 Search Potential Active Paths from toÎ nput: A PDAG G, Dataset { , } Output: A set of potential active paths from toˆ: P, A set of features involved in P: X (P) Initialization: P → = {Path directly connects and :

∈ Adj( )}, P = ∅ X (P) = Adj( ) ⊲ Adj( ) is the set of nodes adjacent to on G. (Here P → is a temporary set to store the potential active paths from during searching)

if ′ is a potential active path by definition then 7:

if is notˆthen 9: Table 1 summarizes the notations that will be needed for the follow up presentations. As an example in G 1 of Figure 2 

We will haveX (P) ⊥ ⊥ . The set of potential active paths from toX (P)

The set of features involved in P X (P)

Set of features not involved in

Ordering Relationships with respect to . For two nodes and in X (P), is defined to be prior to with respect to if:

(1) There exist at least one potential active paths from tot hat both and are on these paths; (2) For all potential active paths from toˆcontaining both and , precedes on them. In the following, we write such relationship as ≻ . We also define: ∀ , ≻ , ≻ˆ. If ≻ and is adjacent to , we call is a direct successor of with respect to , and is a direct predecessor of with respect to . The set of all direct successors of w.r.t.

is denoted by Ds ( ). The set of all direct predecessors of w.r.t. is denoted by Dp ( ). For G 1 in Figure 2 , we have

We will write Dp ( ) as Dp and Ds ( ) as Ds for simplicity in the following. Completely Ordered with respect to . X (P) is defined to be completely ordered with respect to on G if ∀ , ∈ X (P) , ∈ Adj( ), one of them must be the direct successor or predecessor of the other. This can also be written as ∀ , ∈ X (P) , ∈ Adj( ), we have ∈ Dp ∪ Ds . Considering a special case where G is a DAG, we have the following proposition. Proposition 3.3. If G is a DAG and no other node is the parent of , then X (P) is completely ordered with respect to on G.

where the condition is sufficient but not necessary for completely ordering w.r.t. . For example, as in G 2 of Figure 2 , even though the edge 1 − is undirected, we can still have that X (P) is completely ordered with respect to .

In this subsection, we propose an algorithm to quantitatively attribute the model disparity Δ to individual FACTs. We first present the algorithm for the case when X (P) is completely ordered with respect to on G. The extension of our algorithm to the scenario when the completely-ordered condition does not hold is introduced in Section 3.3.

Specifically, our algorithm is based on the Shapley values strategy in Section 2.4, and our goal is to decompose Δ as the sum of contributions from the paths in P as Δ = ∈P Φ ( ), and

where (x) ( ) is the Shapley value of FACT . Following Eq.(4), we can define (x) ( ) as

where is a permutation function for all FACTs, and Π is the collection of all permutations. (x) (T) is a value function defined on a set of FACTs T ⊂ P. In order to appropriately define (x) (T), we need to leverage the causality structure encoded in P. In particular, P can be seen as a system which transfer the information of downstreams and finally affect the value ofˆ. So an intuitive idea is to formulate this system as a calculation process started from , passing through FACTs and finally get the model prediction (x). During the inference process, each ∈ X (P) can be estimated as

where is the regression link function, Dp is the set of predecessors of in P,X (P) is the set of feature variables that are not in X(P), is the random regression error. We assume { } are mutually independent and each is independent of Dp andX (P). Hyvarinen et al. [16] proved that we can always construct such { } and { }. In the following we present the calculation process with a concrete example.

Considering G 1 in Figure 2 , we have X (P) = { 1 , 2 },X (P) =

According to Eq.(10), we also have 1 = 1 ( , 3 , 1 ), Dp 2 = 1 and 2 = 2 ( , 1 , 3 , 2 ). For calculating (x) (T), we first consider two extreme cases.

• T = P. In this case, the actual value of is visible to all FACTs, which makes (x) (T) = (x). • T = ∅. In this case, the actual value of is visible to none of the FACTs. We can sample from its marginal distribution ′ ∼ ( ) and calculate under = ′ , denoted as ( ′ ), then we calculate (x) (T) as

The case of T ⊂ P is more complicated. We denote ( ′ |T) as the value of with T. In such process, the values of which pass through T will be set to , while those passing through P \ T will be set to ′ ∼ ( ). Considering the situation T = { → 1 → 2 → } in the example, we need to transfer the information of = along the path → 1 → 2 →ˆbut block this information and use a random sample ′ along other paths → 1 →ˆand → 2 →ˆ:

In this way, we can calculate (x) ( ) as in Eq. (9) .

In practice, we can choose (x) as the probability to predict x to be positive or the binary decision with a threshold on the probability. In the latter case, the decomposed Shapley values of Δ on FACTs satisfy the following properties (detailed proofs are provided in the appendix).

(1) (Efficiency)

Our path explanations can also be aggregated to generate featurelevel explanations. To obtain the contribution of feature to Δ , we can sum the path contributions for all paths ended with ". . . →ˆ".

When G is not completely ordered, some of FACTs could be contradictory to each other. Considering G 3 in Figure 2 , we have

One solution is to consider all orientation possibilities of undirected edges. For example, the direction of 1 − 2 in G 3 could be either 1 ← 2 or 1 → 2 . We can study each situation respectively and then summarize them. However, this makes the exploration space potentially huge (suppose we have undirected edges, then we can have 2 orientation possibilities to explore).

We propose to solve this problem by grouping the adjacent feature variables which cause the inconsistency problem. In the example of G 3 in Figure 2 , if we group 1 and 2 as 1 = { 1 , 2 } and treat it as a single variable, then G 3 satisfies the completely-ordered condition. Therefore, we propose to first obtain a partition of X (P) as (P) = { 1 , . . . , } so that (P) is completely ordered with respect to with these grouped variables { } =1 . Then we can calculate the contributions of the paths with group-level variables. The concrete algorithm implementation steps are as follows.

Step 1: Identify Potential Active Paths from toˆ. With G, we find all potential active paths from toˆas FACTs with Algorithm 1.

Step 2: Generate Groups Completely Ordered With Respect To . We group the feature variables so that the grouped variables are completely ordered with respect to with Algorithm 2.

Step 3: Calculate Path Contributions to Δ . For each group level variable , we learn a prediction link function and obtain the error term E (both and E are multi-dimensional, with each dimension corresponding to an individual feature variable in ). Finally we calculate the path contribution on the group-level with the procedure in Section 3.2.

Algorithm 2 Generate Groups Completely Ordered w.r.t Input: A PDAG G, P, X (P) Output: A division of X (P) into groups 1 , . . . , , A set of active paths on group-level P ′ Initialization: Create groups { }, each containing ∈ X (P) and path set P ′ by replacing with for all paths in P Remove repeated paths in P ′ 14: end function

With the FACTs based model disparity decomposition approach, we can obtain the quantitative contribution of each FACT to the model disparity. At the same time, if we also consider the model utility, then we can select the paths with high model utility and low model disparity contributions when building the model.

Without the loss of generality, we assume the outcome variable ∈ {0, 1} and the prediction model (x) can return the prediction of x belonging to 1, then the utility of can be estimated as

For a given data sample {x, }, we can calculate the specific model utility of this sample as

We can decompose U ( (x)) as the the contributions of FACTs in P. Similar to Eq.(9), we define (x), ( ) as where (x) = (x) if = 1 and (x) = 1 − (x) otherwise. In this way, the contribution of to U ( ) is

Thus U ( ) can be decomposed as

where

With the decomposition of U ( ) and Δ , we can construct an interpretable fair learning algorithm to achieve trade-off between accuracy and fairness through FACT selection. Specifically, our goal is to select a set of paths T * ⊂ P by minimizing the objective:

We propose a greedy algorithm shown in Algorithm 3 to solve this problem by iteratively removing the edges in P to get T * . After we obtain T * , (x) (T * ) is used as the new prediction result. 

Synthetic Data: We created a dataset with 10 features under a DAG G. The feature variables and the sensitive attribute are randomly connected by directed edges. We generated the data in two different settings: 1) 1 : the relation between features and outcome is linear; 2) 2 : the relation between features and outcome is non-linear. The detailed data generation process is described in the appendix. Adult [19] : The Adult dataset consists of 48,842 samples. The task is to predict whether one's annual income is greater than 50K. We consider gender (male, female) as sensitive attribute and age, nationality, marital status, level of education, working class and hours per week as feature variables similar to [34] . We set = 1 for ≥ 50 and = 1 for male. COMPAS [4] : The dataset contains 6,172 samples and the goal is to predict whether a defendant will recidivate within two years We follow the data preprocessing procedures in [29] . Race (white, non-white) is selected as the sensitive attribute. We set = 1 for 15-year survival and = 1 for white people.

We evaluate FACTS from two aspects: 1) path explanations for Δ ; 2) fair learning through FACT selection. Here are some general settings to train model and learn the FACTs. Model Training: We train using a 70%/30% train/test split. We randomly split data with this ratio and run experiments 5 times. The average result is reported. The hyper-parameters of the model are tuned by cross-validation. When we calculate path explanations for Δ , we implement as MultiLayer Perceptron (MLP) or Xgboost [8] and report the results respectively. ( 5 ) 0.0099 0.0061 White blood cells, Red blood cells ( 6 ) -0.0077 -0.0082 Table 5 : Path explanations on Nutrition dataset. The first column shows the contributions of paths to Δ (Φ ( )), the second column shows the contributions to utility (Ψ ( )).

Causal Discovery: For the synthetic dataset, we directly use the ground-truth causal graph. For real-world datasets, the groundtruth causal graphs are not available. For Adult and Nutrition datasets, we use the causal graphs built in previous works [9, 24, 29] . For COMPAS dataset, we construct the causal graph with PC algorithm [28] , with directions on certain edges restricted and corrected according to domain knowledge. The PDAGs and rules we use to determine the causal directions are shown in the appendix.

Since there is no existing method specifically designed to explain disparity by causal paths. We adopt the following explanation methods as baselines:

• Feature-based Explanation by Shapley Values: we use different Shapley values (Independent Shapley Values(ISV ) in [21] and Asymmetric Shapley Values(ASV ) in [12] ) to calculate the feature-based contribution to disparity with Eq.(7). • Path-specific Explanations (PSE): we calculate the path-specific effect of each potential active path as the estimation of its contribution to disparity, following the calculation in [9] . Since the calculation of path-specific effect requires that the causal relations among (P) must be identified in G. We only report the quantitative results on synthetic, Adult and COMPAS datasets, where this condition is satisfied.

As path-specific effect (PSE) can also provide estimations of path contributions to disparity, we directly compare PSE and FACTS with the following evaluation metrics.

(1) Accuracy: For synthetic dataset where the ground-truth path contributions is available, we evaluate the methods with the normalized root-mean-square error:

where ( ) is the ground-truth contribution of . (2) Efficiency: Efficiency is the property that the sum of the contribution values from all FACTs exactly equals to the total disparity (Property 1):

To show how close methods come to achieving efficiency, we compute normalized absolute difference between summation of path contributions and disparity: Table 2 . Our FACTS outperforms PSE on different datasets and prediction models, especially when learns non-linear relation. The result on efficiency is shown in Table  3 . We can find that the summation of path explanations obtained by FACTS is closer to the original value of disparity than baseline. 

A → χ6 →Ŷ White Non-white Figure 4 : The histograms of (x) for two selected paths → 1 →ˆand → 6 →ˆ.

Dataset. We choose Nutrition as the case study in this section and the results on other datasets are shown in the appendix. We train a model without any fairness constraint on the dataset. Figure 3(a) illustrates the causal graph which contains the top paths contributing to Δ . Table 4 shows the feature contributions to Δ obtained by ISV and ASV. The result of FACTS is shown in Table 5 . The first column of the table represents the path contributions to Δ , while the second column shows the path contributions to the utility.

We can see that 1 , which includes the features of economic status {Poverty Idx, Food Program}, has an important effect on Δ . Here Poverty Idx stands for the Poverty Index of a person and Food Program denotes whether he/she is qualified for food programs. Its contribution calculated from ASV is more dominant than that from ISV. This is because economic status, in addition to its direct impact, indirectly affects the output through sedimentation rate, as shown in Figure 3(a) . While ISV underestimates the impact of economic status and ASV mixes up all its impacts through different paths, FACTS can provide a more comprehensive and detailed analysis.

We also plot the histograms of (x) for selected paths in Figure  4 . We can see that (x) distributes differently across racial groups.

We choose fair learning baselines as in [6] .

• UNFAIR: A baseline model without any fairness constraint.

• AFL [33] : AFL is a fair learning algorithm which trains the model by maximizing the predictors ability to predict while minimizing the adversary's ability to predict .

• RFL [3] : Agarwal proposes an approach to fair classification, which optimize a constrained objective to achieve fairness.

For fair comparison, we use the same structure of MLP for all methods. For our method, we will first train an UNFAIR model and apply our algorithm on to get T * . Then we finetune and use (x) (T * ) as the new prediction result.

We follow the work of [3] to evaluate the fair learning algorithms. For each method, we run experiments in a range of the weights of fairness constraints (such as in Eq. (18)). Then we plot the accuracy-disparity curves for each method.

We firstly compare the path contributions to utility and disparity from Table 5 and use Figure 3 (b) to illustrate how our algorithm works. → 1 →ˆis the largest disparity contributor and makes a major contribution to utility. Other paths also contribute to Δ , but their contributions to utility are relatively low. Algorithm 3 can determine the optimal sets of selected paths under different in Eq.(18). In Figure 3 (b), we can see that when = 0, T * contains all FACTs to maintain maximal utility. If is large enough ( = 1.0), all FACTs will be removed to meet this strict fairness constraint. As = 0.1, T * only contains → 1 →ˆand → 6 →ˆ. The ratio Ψ ( ) Φ ( ) of → 1 →ˆis greater than = 0.1 so it is selected. → 6 →ˆis also included because the direction of its Φ ( ) is opposite to that of → 1 →ˆ. It may be odd to keep two paths with different directions of Φ ( ) to obtain a low absolute value of disparity. This reflects the limitation of treating disparity as a single metric of fairness.

We show the accuracy-fairness trade-off curves for demographic parity in Figure 5 . Our fair learning algorithm with FACTS achieves comparable results as other fair learning algorithms. It outperforms baselines on COMPAS and Nutrition datasets. While on Adult dataset, most of Δ comes from a single path (See appendix). After this path is removed, both Δ and accuracy decrease obviously. Our algorithm can be completely explained by which paths are removed, while the baselines can only obtain black-box models.

In this work, we propose to a novel framework to explain algorithmic fairness with the causal graph. In particular, we decompose the disparity into contributions from fairness-aware causal paths that link the sensitive attribute and model outcome on a causal graph. We propose an algorithm to identify those paths and calculate their contributions to the disparity. With the path explanations, we can gain more insight into the inequality of a machine learning algorithm and propose an interpretable method to achieve better trade-offs between utility and fairness. Proof. Suppose there is no potential active path between two variables and on G, but they are dependent. According to the faithful assumption, consider all directions of undirected edges on G, there must be a DAG G ′ that the conditional independence relationships among variables encoded in G ′ are consistent with those inferred from observational data. That is, there are active paths from certain to on G ′ . Suppose ′ is one of these active paths and is its corresponding path on G. According to the definition, must be a potential active path, which violates the assumption. □

To prove Proposition 3.2 and 3.3, we first prove the following lemma: Lemma A.1. In a DAG, if is an active path, then every subpath of is an active path.

Proof. Suppose ′ is a subpath of . For each that is a node on ′ , then is also on . Since is an active path, then is a non-collider. According to the generality of , each node on ′ is a non-collider. So ′ is also an active path. □ Then we will prove Proposition 3.2 with Lemma A.1:

Proof. (Proposition 3.2) Suppose is a potential active path and ′ is any subpath of . We consider the three conditions in the definition of potential active path. If all edges on are directed and satisfies the definition of active path. Then all edges on ′ are also directed and ′ is active according to Lemma A.1, and thus a potential active path. Similarly, for the rest two cases, if satisfies the condition, we can get ′ also satisfies the same condition by Lemma A.1. So we can have ′ is a potential active path. □

Since G is a DAG, a potential active path is equivalent to an active path according to the definition. Assume ∃ , ∈ X (P) , ∈ Adj( ), but neither of them is prior to the other with respect to . According to the definition, at least one of the following conditions are satisfied:

• There is no active path from toˆthrough both and . Without loss of generality, we let → . Since ∈ X (P), there exists an active path from toˆthrough . We consider its subpath which connects and and call it . is an active path according to Lemma A.1. Considering a new path obtained by appending " →ˆ" to , it will be obviously an active path since andˆmust be noncolliders. This new path is an active path from toˆand both and are on it. So the beginning condition does not hold.

• There exist at least two active paths from toˆ, one is as " → · · · → . . . · · · →ˆ" and the other is as " → · · · → . . . · · · →ˆ". The paths are pointed out from because is a source node in G. Without loss of generality, we let → . We denote the latter path as . And we denote the subpath of from to as ′ . ′ must be an active path according to Lemma A.1. There are three possibilities:

→ · · · → , if it is satisfied, ′ and the edge → will form a directed cycle, which violates that G is a DAG;

← · · · ← or ← · · · ← → · · · → , in both cases, will be a collider on and is not active. Since all the situation can not be satisfied, the beginning condition dose not hold.

In conclusion, neither of the conditions above can be satisfied. So the original assumption can not be satisfied, and X (P) is completely ordered with respect to on G. □

Proof. Efficiency: from Eq. (9), we will have ∈P (x) ( ) = (x) (P) − (x) (∅). Since (x) (P) = (x), we have: ∑︁

where

And when we calculate (x) (∅), we only use andX(P )), which are independent to .

Linearity: In the calculation of ( 1 + 2 ) (x) (T), ( 1 + 2 ) (x) can be written as 1 (x) + 2 (x). Then we have ( 1 + 2 ) (x) (T) = 1 (x) (T) + 2 (x) (T) for all possible T. With Eq. (9) and (8), we can prove the Linearity. Nullity: If (x) (T ∪ { }) = (x) (T), ∀x, T ⊂ P\ , obviously all the item in the summation of Eq. (9) will be 0. Then (x) ( ) and Φ ( ) are both 0. □

As discussed in previous work by Baer et al. [5] , equalized odds and equalized opportunity satisfy whenˆ⊥ ⊥ | . In fact,ˆ⊥ ⊥ | can also induce accuracy parity. To study the conditional dependence between X andˆon , we first introduce a new causality concept: Spouse. If both and are parents of , then is a of and is a spouse of with child . Potential Active Paths Relative to . The definition of potential active path relative to can be obtained by replace the notions "active path" with "active path relative to " in the definition of potential active path. To save space, we only write the first condition as an example: if a path is a directed path and satisfies the definition of active path relative to in Section2.2, then it is a potential active path relative to .

The potential active path and potential active path relative to have the following relation: Proposition B.1. For any path on G which does not pass through , if it is a potential active path, then it is a potential active path relative to .

Proposition B.2. If there is no potential active path relative to between two variables (neither of them is ) on G, then the two variables are conditional independent on .

We denote the set of potential active paths from toˆrelative to as P . We use an example in Figure 6 (a) to further illustrate the definition. We can get: P = { → 1 →ˆ, → 1 → ← 2 →ˆ}. → 1 →ˆobviously satisfies the definition, while → 1 → ← 2 →ˆis not so intuitive. It is not an active path from toˆ. But if we consider the conditioning set { }, we can see is a collider on this path and belongs to the conditional set. While other nodes { 1 , 2 } are non-colliders and not in { }, according to the definition, → 1 → ← 2 →ˆis an active path relative to .

We can propose a similar definition of ordering with respect to conditioned on . ≻ | means is prior to with respect to conditioned on . When we further consider the information of passes to P . It passes through spouses with child 

← 2 ) as well as from direct predecessor to direct successor ( → 1 ). So we propose new concepts other than direct predecessor/successor.

If the two following conditions satisfy, then we call is an informative successor of with respect to relative to : 1)

is adjacent to or , are spouses with child . And is called an informative predecessor of with respect to relative to . The set of all informative successors of is denoted as Is | ( ). The set of all informative predecessors of is denoted as Ip | ( ). For Figure 6 (a), we have Ip | ( 1 ) = { }, Ip | ( 2 ) = { , 1 }. We will write Ip | ( ) as Ip and Is | ( ) as Is for simplicity in the following. Completely Ordered With Respect To Relative to . X (P ) is defined to be completely ordered with respect to relative to on G if: ∀ , ∈ X (P ), we have if and are adjacent or spouses with child , one of them must be the informative successor or predecessor of the other.

After defining these concepts, we can directly replace the concept of Dp with Ip during the step of grouping features. When we try to decompose (x) into the path contributions conditioned on . We need to learn a two sets of functions { | =1 } and { | =0 }, corresponding to different values of . And we need to infer as = | (Ip ,X(P )). Thus we have (x), ( ) obtain by replacing the function with | in the calculation of (x), ( ). For equalized opportunity, we compute the path contribution to Δ as:

For accuracy parity, we have:

For equalized odds, we need to consider the disparity conditioned on = 1 and = 0 respectively.

Generation of Synthetic Data. In experiments on effiency, we generate the data as follows: a feature is randomly connected to (with pointing to ) with 0.2 probability if > , otherwise 0. A feature is randomly connected to (with pointing to ) with 0.4 probability. The generative function of each feature is linear. In experiments on accuracy, we constrain symmetric structure and parameters to obtain the ground-truths. In 1 , the probability of = 1 is calculated by the a linear function of features (normalized to be the in range of [0, 1]). While in 2 , the probability is calculated with additional non-linear transformation. Model Parameters. When we implement as MLP, we use the network with one hidden layer and the dimensions of hidden layer are 32/16/8/16 for Synthetic/Adult/COMPAS/Nutrition respectively. The models are implemented and trained with the Python package scikit-learn. We choose ADAM to optimize the model. Causal Discovery Results on Real-world datasets. Since there is no widely-used causal graph for COMPAS dataset, we generate a PDAG by the following procedure: first we run the PC algorithm on the data, then we determine and correct the direction following the rules: {race, age, gender} → {numbers of prior crimes, numbers of juvenile felonies/juvenile misdemeanors/other juvenile conviction} → {prior crimes} → {original charge degree}. Estimation of Φ . Since FACTS and all baselines are expensive to compute exactly, we use a Monte Carlo approximation of Eq.( 9). In particular, we conduct breadth first search and sample 100 orderings from Π and average across those orderings. For mixedtype data, we binarize all categorical features and follow the recent work of Wang et al. [29] to formulate and .

The results on Adult and COMPAS dataset are shown in Figure 6 (b) and 6(c). Due to limited space, we only display the top paths of P on the graphs. And the corresponding path contributions are shown in Figure 7 . The edge − in 6(c) is undirected. In Adult dataset, marital status acts as the essential feature contributing to disparity. While in COMPAS dataset, age is the largest contributor.

Explaining individual predictions when features are dependent: More accurate approximations to Shapley values

Association of black race with outcomes in COVID-19 disease: a retrospective cohort study

A reductions approach to fair classification

Machine bias: There's software used across the country to predict future criminals. And it's biased against blacks

Fairness criteria through the lens of directed acyclic graphical models

Explainability for fair machine learning

Incorporating causal prior knowledge as path-constraints in bayesian networks and maximal ancestral graphs

Xgboost: A scalable tree boosting system

Path-specific counterfactual fairness

Plan and operation of the NHANES I Epidemiologic Followup Study, 1992. Number 35. National Ctr for Health Statistics

Fairness through awareness

Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability

Black workers face two of the most lethal preexisting conditions for coronavirus-Racism and economic inequality

Equality of Opportunity in Supervised Learning

Causal Shapley Values: Exploiting Causal Knowledge to Explain Individual Predictions of Complex Models

Nonlinear independent component analysis: Existence and uniqueness results

Avoiding discrimination through causal reasoning

Counterfactual fairness

UCI machine learning repository

Explaining Quantitative Measures of Fairness

A unified approach to interpreting model predictions

Learning Adversarially Fair and Transferable Representations

SARS-CoV-2 positivity rate for Latinos in the

Fair inference on outcomes

Identifying causal effects in maximally oriented partially directed acyclic graphs

Hospitalization and mortality among black patients and white patients with Covid-19

Learning controllable fair representations

Causation, prediction, and search

Shapley flow: A graphbased approach to interpreting model predictions

Food Insecurity and COVID-19: Disparities in Early Effects for US Adults

Pc-fairness: A unified framework for measuring causality-based fairness

Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment

Mitigating unwanted biases with adversarial learning

Causal modeling-based discrimination discovery and removal: Criteria, bounds, and algorithms