key: cord-0835123-kl51xhlu authors: Wang, Chao; Liu, Linfang; Sun, Shichao; Wang, Wei title: Rethinking the framework constructed by counterfactual functional model date: 2022-02-17 journal: Appl Intell (Dordr) DOI: 10.1007/s10489-022-03161-8 sha: 90e87cb6ae1118ada73606bf1999c4c96cc1d742 doc_id: 835123 cord_uid: kl51xhlu The causal inference represented by counterfactual inference technology breathes new life into the current field of artificial intelligence. Although the fusion of causal inference and artificial intelligence has an excellent performance in many various applications, some theoretical justifications have not been well resolved. In this paper, we focus on two fundamental issues in causal inference: probabilistic evaluation of counterfactual queries and the assumptions used to evaluate causal effects. Both of these issues are closely related to counterfactual inference tasks. Among them, counterfactual queries focus on the outcome of the inference task, and the assumptions provide the preconditions for performing the inference task. Counterfactual queries are to consider the question of what kind of causality would arise if we artificially apply the conditions contrary to the facts. In general, to obtain a unique solution, the evaluation of counterfactual queries requires the assistance of a functional model. We analyze the limitations of the original functional model when evaluating a specific query and find that the model arrives at ambiguous conclusions when the unique probability solution is 0. In the task of estimating causal effects, the experiments are conducted under some strong assumptions, such as treatment-unit additivity. However, such assumptions are often insatiable in real-world tasks, and there is also a lack of scientific representation of the assumptions themselves. We propose a mild version of the treatment-unit additivity assumption coined as M-TUA based on the damped vibration equation in physics to alleviate this problem. M-TUA reduces the strength of the constraints in the original assumptions with reasonable formal expression. Originally, most studies on counterfactual inferences (such as the query above) focus on the field of philosophy. Philosophers establish the form of a logical relationship constituting a logical world, which is consistent with the counterfactual antecedent and must be the closest to the real world (for the convenience of description, we call it the closest world approach) [4] . Further, Ginsberg [5] applies similar counterfactual logic to analyze problems of AI tasks, which relies on logic based on the closest world approach. However, the disadvantage of the closest world approach is that it lacks constraints on closeness measures. Regarding the above issue, Balke and Pearl [3] are committed to explaining the closest world approach. Specifically, they suggest that turning a CQ into a probability problem, named, the probabilistic evaluation of counterfactual queries (PECQs). In other words, PECQs focus more on the probability of an event occurring in a specific CQ, rather than just outputting "True" or "False" (or "Yes" or "No", etc.) for this query. PECQs motivate us to deeply rethink counterfactual problems in many AI applications. For example, we know that COVID-19 has caused economic losses and increased unemployment in the United States [6] . An important reason is that the government has not dealt with the epidemic promptly. 2 Based on the facts that have already occurred, we may reflect on the following question, CQ1: If the government issued effective policies in time to control the spread of COVID-19, would the unemployment rate in the United States still have raised? Note that, in CQ1, there is a clear causal relationship, that is, COVID-19 has caused the unemployment rate in the United States to rise. Therefore, in response to CQ1, an essential task is to be able to evaluate the degree of belief in the counterfactual consequence (i.e., probability evaluation) after considering the facts that have already happened. In other words, it is equivalent to evaluating the probability of a potential (or counterfactual) outcome given the antecedent. Moreover, in CQ1, it is a fact that the COVID-19 sweeps the world and causes the unemployment rate in the United States to rise. Hence, we should focus on analyzing what is the probability that the unemployment rate in the United States will rise if there is no COVID-19? This is undoubtedly an influence on the government to make decisions. Therefore, evaluating counterfactual queries like these has far-reaching significance for practical application. With the widespread application of causal inference in the field of AI [7, 8] , the current popular method is to adopt the functional model (FM) [9] for inference. FM takes a CQ as an input and finally outputs the probability evaluation of the CQ by combining prior knowledge and internal inference mechanisms. The evaluation of CQs has benefited many research fields and tasks, such as the determination of person liable [10] , marketing and economics [11] , personalized policies [12] , medical imaging analysis [13, 14] , Bayesian network [7] , high dimensional data analysis [15] , abduction reasoning [16] , the intervention of tabular data [8] , epidemiology [17] , natural language processing (NLP) [18, 19] and graph neural networks (GNN) [20, 21] . In particular, FM can provide powerful interpretability for machine learning model decisions [22] [23] [24] [25] , which is one of the most concerning issues in the Artificial Intelligence (AI) community today. Judea Pearl discusses the limitations of the current machine learning theory and points out that current machine learning models are difficult to be used as the basis for strong AI [9] . An important reason is that the current machine learning approach is almost entirely in the form of statistics or "black box", which brings serious theoretical limitations to its performance [26] . For example, it is difficult for current smart devices to make counterfactual inferences. A large number of researchers are increasingly interested in combining counterfactual inference with AI [27, 28] , such as explaining consumer behavior [29] , the study of viral pathogenesis [30] , and predicting the risk of flight delays [31] . In addition, counterfactual inference has shown advantages in improving the robustness of the model [32, 33] and optimizing text generation tasks [34] and classification tasks [35] . Although counterfactual inference has set off a new upsurge in the field of machine learning, a deeper understanding of the existing models and methods is notably lacking. In our work, we focus on two basic aspects in the task of counterfactual inference. The first aspect focuses on the counterfactual framework and this aspect is related to the inference results of the model. The second aspect focuses on the preconditions for the counterfactual inference tasks. Specifically, the first aspect is based on a type of counterfactual approach (e.g., the functional model) in causal science. We analyze the credibility of some results obtained by using this counterfactual approach to evaluate CQs. Another aspect we are concerned about is the assumptions used in causal inference to estimate causal effects. Since causal effects depend on the potential results, however, we cannot observe all the potential outcomes of the experimental individual simultaneously (unobservable outcomes are usually called counterfactual outcomes). Therefore, some assumptions are often needed when estimating the causal effect. We pay attention to a commonly used strong assumption (i.e., the Treatment-Unit Additivity (TUA) assumption) and weaken it using some mathematical methods. Next, we specify the above two aspects to the following two issues (we use a real inference task (i.e., PECQs) as an example to explain the relationship between the two issues in Fig. 2 . For example, in the task of evaluating the probability solution of CQs by FM, if the model predicts that the probability of a CQ is 0, the result may be ambiguous. In other words, although the probability value predicted by the model in this situation is 0, it is still possible that the event will happen. Intuitively, the existence of statistical uncertainty may cause ambiguity of the inference results. Dawid [36] proves that even if the statistical uncertainty can be eliminated, the inference may also produce ambiguity. Therefore, when ambiguity cannot be eliminated, we must consider what may cause ambiguity and how to avoid trouble caused by ambiguity. 2) The assumptions used to estimate causal effects in the data are strong, which are often violated in realworld applications. Some strong assumptions tend to constrain on individuals (e.g., individuals u in an experimental population U ) to obtain the ideal environment in an experiment. This neglects to obtain the equivalent form of the assumption directly from the abstract level (e.g., the experimental population U , the dataset itself). In some practical applications of causal inference, a challenging task requires researchers to make causal inferences in the absence of data. For example, in RCM, the causal effect is described as [37] . Owing to the existence of FPCI, we can only apply additional assumptions on the data distribution to avoid it. Some typical assumptions are shown below: -Stable Unit Treatment Value Assumption (SUTVA) [38] , where each O of u is treated as an independent event; -Assumption of Homogeneity (AOH) [39] , which requires that for any individual u i and u j,j =i , and any intervention method t , O t ,u i = O t ,u j,j =i always holds; -Treatment-Unit Additivity (TUA) [36] , some studies also call it the assumption of constant effect (AOCE). The TUA assumption constrains such an equivalence relationship that, for all individuals, the causal effect is the same for each individual under a defined intervention method. where (u i ) denotes the individual causal effect of u i ∈ U , and |U | is the cardinality of set U . Apparently, AOH is stronger than TUA. Therefore, in the second aspect, we focus on TUA, aiming to obtain the milder TUA assumption. To address the two issues mentioned above, in this paper, our contributions are three-fold: -We focus on a basic problem in the FM and primarily analyze the evaluation method of [3] . We find that FM sometimes produces ambiguous output results for some CQs, even if the final output result is unique. One of the important reasons is that FM needs to calculate the Fig. 2 The framework of the probabilistic evaluation of counterfactual queries: these two issues spread over the same inference task, and these two issues are independent of each other. However, for the same counterfactual inference task, the plausibility of the output affects, the user's confidence, and the strong assumptions premise determines the scope of the task intersection between the two sets to get the final result when estimating the output probability. However, the intersection may be an empty set ∅, when estimating some special CQs. -We provide a mild TUA assumption, called M-TUA, which incorporates the idea of the damped vibration equation. -We prove theoretically that M-TUA can be applied to large datasets, and give a reasonable and rigorous mathematical description of this theory (see Theorem 1). Especially for some complex internal principles, we do not choose to use the "black box" method but hope to use M-TUA to try to reveal the complex internal relationship between certain parameters and assumptions and make some reasonable description and explanation. The rest of this paper is organized as follows: In Section 2, we give the mathematical notation and their descriptions. In Section 3, we give a visualization of the FM inference mechanism and analyze the pitfalls of this inference mechanism based on concrete examples. In Section 4 and Section 5, we give a mild version of the TUA assumption (i.e., M-TUA), and theoretically prove the equivalent representation of the TUA assumption in the vector space and analyze the rationality and limitations of M-TUA. The comparison between TUA and M-TUA is in Section 6. Section 7 summarizes this paper. In this section, the key mathematical notations and their descriptions are listed in Table 1 . In this section, we first introduce the definition of PECQs [3] , which is a probabilistic description of the counterfactual query. Second, we review the inference mechanism of FM in Fig. 3 . Finally, we exhaustively analyze the inference mechanism in FM by some examples and find that when the probabilistic evaluation of a CQ is 0, the result causes unreliable guidance for decision-making. Definition 1 (Probabilistic Evaluation of Counterfactual Queries, PECQs [3] ) The core idea of PECQs is to transform a CQ into a probabilistic evaluation problem, which can be formalized as: where "|(α 0 , β 0 )" represents the evidence (or observed data) we have observed in the real world, and the value of evidence be considered as a conditional probability (e.g., (α 0 , β 0 ) Pr(β 0 |α 0 ) = p 0 ). Pr(β 1 |α 1 ) is the counterfactual outcome that we need to infer based on evidence. The probabilistic evaluation of (2) can be obtained by the inference mechanism of FM [3] (i.e,. Figure 3 ). Example 1 CQ1 can be translated into (2) for evaluation. Specifically, for (α 0 , β 0 ), we observe that there is an ineffective policy (i.e., α 0 ) that causes the unemployment rate rise (i.e., β 0 ); Pr(β 1 |α 1 ) indicates the probability of unemployment rate falls (i.e.,β 1 ) if we implement effective policies (i.e.,α 1 ). The inference mechanism of FM is shown in Fig. 3 . More detailed information on the inference mechanism of FM is Fig. 3 The inference mechanism of FM when evaluating the CQ1 elaborated upon in [3] , and we will not repeat it here in this section. Although FM can output a unique solution for a CQ, however, we find that the results are not credible when the probability estimate of the FM output is 0. In other words, the output value of Pr(·) = 0 does not mean that the event will not occur. Next, we introduce some simple examples to reveal the untrustworthy guidance that this ambiguity may bring to the decision-making. Example 2 CQ2 [36] : Patient P has a headache. Will it help if P takes aspirin? The information we observe is that the current patient has a headache (denoted as β 0 ) and is not taking aspirin (denoted as α 0 ). Therefore, Pr(β 1 |α 1 ) |(α 0 ,β 0 ) is equivalent to the probability evaluation of CQ2 (the query of this form like CQ2 can also be called the effects of causes [36] ). However, consider a situation (denoted as the variant of CQ2, which is abbreviated as V-CQ2 ) where the patient still does not take aspirin. What is the probability of the headache disappearing? It is equivalent to evaluating Pr(β 1 |α 0 ) |(α 0 ,β 0 ) . If we still choose to use FM to estimate this query, we first determine the value of n (α 0 ,β 0 ) ∈ {1, 2} (n (α 0 ,β 0 ) refers to the value of n, which is determined according to (α 0 , β 0 )), and then we determine the new value of n (α 1 ,β 1 ) ∈ {3, 4} (n (α 1 ,β 1 ) refers to the value of n, which is determined according to (α 1 ,β 1 )). Finally, the evaluation of Pr(β 1 |α 0 ) |(α 0 ,β 0 ) is the sum of Pr(c 3,β ) |(α 0 ,β 0 ) and Pr(c 4,β ) |(α 0 ,β 0 ) , i.e., Why is the evaluation of V-CQ2 equal to 0, and what does this mean? 1) When using FM to estimate the results of CQ1 and V-CQ2, a key step is to calculate the intersection of N (α 0 ,β 0 ) and , which is determined by the observed evidence in the real world, and N ( Fig. 3 . Therefore, the probabilistic evaluation of CQ1 is uniquely determined by , which causes the probability evaluation of V-CQ2 to be 0 (i.e., (3) , because N (α 0 ,β 0 ) ∩ N (α 1 ,β 1 ) = ∅). This probability estimate is not completely credible. The reason is that we cannot be sure whether the output results are derived from real predictive inferences or the processing of some special counterfactual queries (e.g., V-CQ2) by the inference mechanism. Therefore, when the probabilistic evaluation of a CQ is 0, the decision based on this result is not credible, that is, the result is ambiguous. 2) In addition, in V-CQ2, α 0 does not constitute a counterfactual condition, it still belongs to the assumptions in the real world, in this case, theβ 1 is also known evidence in the real world, i.e.,β 1 = β 1 . Hence, we have which contradicts with the result of (3). This shows that α 0 does not constitute an intervention that affects the outcome of the counterfactual world. Therefore, the estimated value of (3) obtained by FM violates the counterfactual consistency rule [40] . Example 3 In predicting the probability value of 0.8 or 0.9 for an earthquake to occur at a certain location, there is little difference in decision-making for this probability. However, when the probability of an earthquake is estimated to be 0 and unique, it is essential for us to verify its rationality, because this may directly lead to the need for the corresponding deployment. In other words, how confident are we to ensure that there will be no earthquake based on the prediction of FM? Therefore, the fact that there exist queries that cannot be answered using FM does not mean that the evaluation of these queries is meaningless. Kennedy, if the assassination had failed, would Kennedy still be alive? Formally, if the shot hits the target (α 0 ) with a high probability (p 0 ) that the hit target will die (β 0 ), then we estimate Pr(β 1 |α 0 ) |(α 0 ,β 0 ) =? We will eventually get Pr(β 1 |α 0 ) |(α 0 ,β 0 ) = 0 using FM (the prediction process is similar to predicting V-CQ2). Obviously, if the assassination failed (that is, the shot was successfully fired but did not cause the target to die) and Kennedy is still alive, this situation may affect the assassin's further decisions and deployment. For Kennedy's team, this may affect the deployment of security measures for similar activities. Therefore, when the estimated result of a CQ is 0, the result cannot provide credible and sufficient opinions for decision-making. A straightforward solution Through the above series of analyses, it is not difficult to find that when the probability of a CQ is evaluated as 0, for this situation, further verification and analysis are indispensable. Because the inference mechanism of FM itself will inevitably introduce ambiguity for the evaluation result of Pr(·) = 0. Since the evaluation of the FM determines the final output solution through the intersection between two sets, there is a certain probability that the intersection is an empty set. A straightforward solution is that if an empty set appears in the estimation process, we need to stop using the FM for estimation because the above analysis shows that we cannot define the empty set as Pr(·) = 0. Therefore, when this happens, we should estimate the output probability in the real world instead of the counterfactual world to avoid the appearance of ambiguous results. In this case, Pr(·) = 0 plays a role in prompting a replacement prediction strategy. Therefore, to comply with the counterfactual consistency rule, we must use the prior probability (4) (i.e.,1 − p 0 ) to replace Pr(·) = 0. For the second reflection in Fig. 2 , in this section, we analyze the TUA assumption, which is often used as a strong prerequisite for estimating causal effects in data. We first review the potential outcome framework (Section 4.1), individual causal effect (Section 4.2), the definition of TUA (Section 4.3), and provide an equivalent description of TUA utilizing vectorization (Section 4.4). Second, based on the idea of the Damped Vibration Equation (DVE) [41] , we propose a mild TUA assumption (called M-TUA) (Section 4.5). M-TUA not only weakens the original assumption but also has good mathematical properties and interpretability . Our main conclusion in this section is presented based on two lemmas, and the specific proof process is mainly divided into the following two steps. First, we describe the relationship between TUA and ICE in the counterfactual approach, and we explore the equivalence of ICE and residual causal effect (RCE) in the TUA assumption (i.e., Lemma 1). Second, we innovatively introduce the definitions of positive effects and negative effects, and on this basis, we obtain the equivalent form of TUA in vector space by Lemma 2. According to the viewpoint of Rubin [42] , there is an intervention in the causal inference, which means that there is no cause and effect without intervention, and one intervention state corresponds to a potential outcome. When the intervention state is realized, we can only observe the potential outcomes in the realization state, that is, we cannot observe the potential outcomes (i.e, counterfactual outcomes) in the counterfactual world (e.g., O c,u 2 in Example 6). This situation where all potential outcomes of units cannot be observed simultaneously is also called FPCI we mentioned earlier. Formally, for binary intervention variables, let d ∈ {t = 1, c = 0}, the observation outcome O d,u i and the potential outcome Y o can be expressed by the following formula: For a more intuitive description, we focus on the following 2-dimensional Gaussian distribution model Specifically, we introduce the following example [36] and use it as a basic background for subsequent analysis. , each with the 2-dimensional Gaussian distribution with means (μ t , μ c ), σ c = σ t = σ o (for simplicity of calculation, we assume that the distribution has a common variance σ o ), and the correlation ρ ∈ (0, 1). Furthermore, we use the mixed model to describe the specific structure, i.e., where μ d indicates the treatment effects applicable to all units. τ u i represents the effect on unit u i ∈ U , called unit effects, and this effect applies to all units, i,e., τ u i = τ u j,j =i . λ d,u i stands for the effect between treatment and unit, called unit-treatment interaction. This internal mechanism reveals the change from one treatment to another for unit u i . τ u i and λ d,u i are independent random variables. Dawid [36] adopts the model of (7) to analyze the pros and cons of the counterfactual based on the idea of decisionmaking and mentions an assumption that is often used in the counterfactual analysis, which is called TUA (Definition 2). As the TUA assumption has strong constraints on data, it will lead to a reduction in the practicability and scope of use of TUA. Hence, in this paper, another goal of a study is to design a mild TUA assumption that constrains the dataset itself or the experimental population as a whole, rather than a strong constraint on each individual, as in the traditional TUA assumption. In the rest of this section, we try to optimize TUA to make it have a broader scope of application in the context of large data. Specifically, we first analyze the individual and average causal effect based on (7) . In an experimental study, the individual causal effect (ICE) is the basic object (or a basic measure). It describes the differences in various potential outcomes of a given unit u i ∈ U under all possible treatments d ∈ {t, c}. Generally, for one unit u i ∈ U , the ICE can be represented as For different tasks, the ICE can also have other forms of description, such as (u i ) = log O t,u i /O c,u i . Therefore, from a broader perspective, the subtraction in the definition of ICE may not necessarily be a subtraction in R. Note that no matter which form is used, only one potential outcome can be observed [43] . Researchers usually do not pay attention to ICE directly, but focus on the average value of the causal effect of all units, that is, ACE, also known as average treatment effect (ATE). ACE can be expressed by the following formula, Apparently, in (7), Limitations of the counterfactual approach focused on ICE We utilize the above Example 5 for our analysis. Specifically, according to (7) and (8), we have that, where (λ u i ) λ t,u i − λ d,u i is called residual causal effect (RCE) [36] . It is easy to verify that (λ u i ) ∼ N (0, 2(1 − ρ)σ o ). Thus, according to (7)-(9), we could obtain the distribution of ICE as follows: However, in (11) . (12) (12) indicates that different values of ρ determine different variances of the distribution of (u i ). We can only get a range of σ λ , and a different ρ will lead to a different σ λ , which will cause a variety of uncertain results for reasoning. For example, we can use (11) to estimate the ICE of the new unit u new . Because inferring (u new ) is equivalent to inferring ACE (u i ) and 2(1−ρ)σ o under (11) . Unfortunately, we cannot accurately determine the value of 2(1 − ρ)σ o . Example 6 (Calculation of causal effect parameters (i.e., ICE, ACE) in the ideal case). In Table 2 , we construct a simple example to demonstrate the calculation of the causal effect parameters, such as ICE, ACE. Suppose a population contains four subjects, labeled as u 1 , u 2 , u 3 and u 4 , respectively. For each u i , the potential outcomes in both intervention states are known (in reality only one potential outcome can be observed). Where individuals 1 and 2 are in the intervention group (i.e., the set of some units receiving treatment t) and individuals 3 and 4 are in the control group (the set of some units receiving treatment c). According to Table 2 , we can obtain: Meanwhile, based on the information in Table 2 , we can further obtain information on two other causal effect parameters, one is average treatment effect for the treated (ATT) and the other is average treatment effect for the control (ATC). Where, (14) and (15) Unfortunately, in the real world, the boldface numbers (e.g., O c,u 2 , O t,u 3 ) in Table 2 are not observable to us. The reason is that the treatment received by subject u 2 is d = t, we can not observe the potential outcome of u 2 receiving treatment d = c at the same time. Therefore, in the real world, the calculation and estimation of the causal effect parameters require additional constraints (e.g., Treatmentunit additivity assumption (Definition 2) to be imposed on the data. In summary, the POF focuses on the inference of causal effects but does not explain the mechanism of influence between variables [44] . A computational bottleneck is the prediction of parameter ρ through the marginal distribution. Therefore, in the task of using the causal model for inference, additional constraints (e.g., Example 7) are usually required to ensure that the inference result is obtained under this constraint. Example 7 Under the TUA, (u new ) = ACE (u i ) implies that ρ = 1. Definition 2 (Treatment-Unit Additivity (TUA) [36] ). The TUA assumption is to deal with the non-uniformity of data through a strong processing method. Specifically, TUA requirements that TUA can be equivalently regarded as the assumption of constant effect (AOCE). For example, we can set (u i ) = (u j,j =i )= a specific constant (e.g., ACE (u i ). Generally speaking, AOCE uses the average effect in the sample to estimate the causal effect. Next, we will give a simple example to demonstrate the relation between TUA and ACE and the application of TUA. Example 8 Considering a fundamental problem of causal inference, let u 1 be a patient. We want to know whether certain medication has a therapeutic effect on u 1 . Suppose that the data about patient u 1 is shown in Table 3 . According to Table 3 , we only know that O t,u 1 = 13. Due to the existence of FPCI, we cannot simultaneously observe the effects of u 1 taking the medication and not taking the medication. Therefore, we rely on adding additional constraints (i.e., TUA) to estimate the value of O c,u . Suppose we also have additional data (as shown in Table 4 ), we can then use TUA assumption to infer the values of O c,u i and O t,u i − O c,u i (i = 1, 2, 3, 4, 5). For example, according to we can obtain the following complete prediction data (see Table 5 ). TUA assumes that the causal effect (u i ) has the same effect on all units in U , e.g., Unfortunately, as a commonly used prerequisite, TUA is a strong assumption, which cannot be tested on observable data and lacks a more transparent explanation in the real world [36] . This leads to some interesting questions worth exploring, such as: - To address these issues, next, we first provide an equivalent form of the TUA assumption under the 2-dimensional Gaussian distribution (i.e., Lemma 1). Example 5, then the TUA assumption has the following equivalent form, i.e., Where u i , u j,j =i ∈U , i, j ∈[1, ...q], q=|U |, q is a sufficiently large positive integer (q q ). (λ d,u i ) = λ t,u i −λ c,u i . Proof Given two units u i and u j,j =i , according to (7) and (8), we have that Hence, a reasonable idea based on (18) Suppose that q is a large positive integer and naturally let Where 1 k k j=1 O t,u j,j =i represents the average of the responses of k units receiving treatment t, and 1 q−k q j =k+1 O c,u j,j =i is the average of the responses of q − k units receiving treatment c. q, k, and q − k are both large numbers. Therefore, ACE (u i ) =ˆ is estimable and close to the true value. Next, we employ the TUA constraint on (18) , which is equivalent to the setting (u i ) − (u j,j =i ) = 0. According to (18) , it is unnecessary for us to constrain every λ d,u to a fixed value if q is large enough (e.g., q −→ q ). The alternative solution is that we consider the difference between two (λ d,u ), and formally characterize (λ d,u i ) − (λ d,u j,j =i ) so that it gradually approaches 0 when q is a large number. Therefore, in the case of the considered RCE, we obtain the equivalent form of the TUA assumption, which proves the lemma. Further, we will analyze the properties of TUA in 2dimensional vector space. Through the above analysis, it is not difficult to find that both the TUA and the equivalent form given by Lemma 5 are only numerical constraints (e.g., u 2 ) ). In other words, neither the TUA assumption itself nor Lemma 5 reflects their internal influence on the data. To explore the internal influence of TUA on the data, our core idea is to transform the original TUA assumption of constraints on values (i.e., scalars) into constraints on vectors. Specifically, we analyze the TUA assumption by vectorizing λ d,u i (i.e., Lemma 2) and introducing a definition of the positive and negative effects of λ d,u i (i.e., Definition 3) on the data. For any λ d,u i , let + (λ d,u i ) denote the positive effect of λ d,u i on the data, and − (λ d,u j,j =i ) denote the negative effect of λ d,u j,j =i on the data. Then the TUA assumption has the following equivalent form in the vector space, i.e., (21) where q + + q − = q. Before proving Lemma 2, we need to introduce the definition of the vectorization of λ d,u i , positive effects, and negative effects. Definition 3 (The vectorization of λ d,u i .) Let λ d,u i = L ao represent the distance from a certain point a to the point o in the coordinate system (e.g., in Fig. 4a , L ao represents λ d,u i and L bo represents λ d,u j,j =i ). The vectorization of λ d,u i refers to assigning the characteristics of a vector to λ d,u i to describe the possible positive or negative effect of λ d,u i on the data. As shown in Fig. 4b, for each λ As shown in Fig. 4-(c) There is a one-to-one correspondence between positive effects and negative effects. In other words, if a positive effect "+" exists, there must be a negative effect "-" corresponding to it. Rationality analysis According to Definition 3, we transform the original TUA assumption of constraints on the scalars into constraints on vectors. For example, some individuals insist on eating nuts in actual life because nuts are good for their health (i.e., positive effect), but some people are allergic to nuts, and eating them will bring pains and even life-threatening effects (i.e., negative effect). Therefore, we argue that it is necessary to consider the positive or negative effects of λ d,u . Definition 3 provides an intuitive representation of positive/negative effect in the vector space, and according to the definition, next, we give a proof of Lemma 2 as follows. Proof For ease of understanding, we will combine Fig. 4 for the proof. Considering the representation of λ d,u i in a 2dimensional plane. As shown in Fig. 4a , we first represent λ d,u i as the Euclidean distance in the plane, i.e., According to Lemma 1, (u Second, we consider the representation of the TUA in 2dimensional vector space. According to Definition 3, we can vectorize λ d,u . The meaning of vectorization is to give each (λ d,u i ) a measure, which aims to describe the positive or negative effects of (λ d,u i ) on the data. In order to maintain consistency with the original TUA assumption, we assume that | (λ d,u i )| = | (λ d,u j,j =i )|. For instance, as shown in Fig. 4b , let | + (λ d,u i )| (| − (λ d,u j,j =i )|) denote the positive (negative) effect of (λ d,u i ) on the data, although Third, we consider extending (λ d,u i ) to the entire dataset. Since the background of our research is in the context of large datasets, we implied a condition here, that is, in the entire data, the positive effects + (λ d,u i ) and negative effects − (λ d,u j,j =i ) on data generation are basically the same. Furthermore, since | (λ d,u i )| = | (λ d,u j,j =i )|, we can visualize the entire data as a circle in a 2-dimensional plane, where | (λ d,u i )|=| (λ d,u j,j =i )|=r. Intuitively, under the TUA constraint, + (λ d,u i ) = − (λ d,u j,j =i ) always holds. However, + (λ d,u i ) = − (λ d,u j,j =i ) does not necessarily have to be under a strong constraint of (λ d,u i ) = (λ d,u j,j =i ) to hold. In other words, in Fig. 4-(b) , it is sufficient that the area of red is the same as the area of blue. Therefore, we can relax the restriction on (λ d,u i ) by only assuming u j,j =i ) . In summary, we obtain the following conclusion based on TUA, i.e., which proves the lemma. The traditional TUA strongly constrains all λ d,u i (or (u i )) to be the same for u i ∈ U , which undoubtedly ignores the effect of λ d,u i on the data and the estimated ICE. However, ignoring this effect by applying TUA does not mean that the effect of λ d,u i on the data does not exist. Therefore, we did not directly ignore this potential impact but pioneered to represent it by introducing the vectorization method (i.e., positive and negative effects in Definition 3). In addition, Lemma 2 relaxes the constraint on the data to the level of the entire dataset U rather than imposing a strong constraint on each unit u i . Therefore, Lemma 2 can be considered as an equivalent form of TUA at the abstract level. Through the above analysis, we provide the equivalent form of the TUA, which is based on 2-dimensional Gaussian distribution and a large dataset. By performing vectorization operations on λ d,u i , u i ∈ U , we introduce the definition of positive and negative effects, respectively, aiming to study the effect of + (λ d,u i ) and − (λ d,u j,j =i ) on the data under the premise (λ d,u i ) = (λ d,u j,j =i ). Although we assume that the effects of + (λ d,u i ) and − (λ d,u j,j =i ) are equal in a large dataset, we hope that + (λ d,u i ) and − (λ d,u j,j =i ) will have less and less impact on the data as q approaches q . This concern is necessary because if the sample size is not large enough, the positive and negative effects may not cancel each other out. For example, the positive effects may be greater than the negative effects or vice versa. Quantifying + (λ d,u i ) and − (λ d,u j,j =i ) requires rigorous and rational mathematical expressions. Therefore, a natural question is: how to describe the convergence of + (λ d,u i ) and − (λ d,u j,j =i ) when q approaches q ? We will give the answers to the above questions in Theorem 1. In classical physics, damping refers to the characteristic that the amplitude of vibration gradually decreases in any oscillating system, which may be caused by external influences or the system itself [45] . We introduce the above ideas into the study of the descriptive equation of + (λ d,u i ) vectorizing λ d,u i . (a) is the geometric description of the traditional TUA assumption in the coordinate system. According to Lemma 1, (u i ) = (u j,j =i ) can be regarded as (λ d,u i ) = (λ d,u j,j =i ). Hence, in the 2-dimensional plane, we can use Euclidean distance L ao = L bo to describe (λ d,u i ) = (λ d,u j,j =i ); (b) describes the vectorization of (λ d,u i ). According to the definitions of positive (red), negative (blue) effects and the TUA assumption, we have | + (λ d,u i )| = | − (λ d,u j,j =i )|; (c) describes the vectorization of (λ d,u i ). It should be noted that the positive and negative effects of λ d,u i on the data are almost equal when the number of samples is large enough. Since | + (λ d,u i )| = | − (λ d,u j,j =i )|, all after vectorization of (λ d,u i ) can form a circle in a 2-dimensional plane; (d) reflects the expansion of TUA assumption in the vector space. It can be regarded as a visualization of the TUA assumption at an abstract level (that is, constraints are applied to the dataset U rather than to each u i ). In other words, it is no longer necessary that and − (λ d,u j,j =i ). In this section, we provide a description equation about + (λ d,u i ) and − (λ d,u j,j =i ), which satisfies that when q approaches q , + (λ d,u i ) and − (λ d,u j,j =i ) converge strictly to 0 (see Theorem 1). and negative effect − (λ d,u j,j =i ) of (λ d,u i ) on the data, + (λ d,u i ) and − (λ d,u j,j =i ) satisfy (or approximately satisfy) the following equation, where n ∈ Z + , and η + > 0, η − > 0 are adjustment parameters. e −η + ·q and e −η − ·q are attenuation parameters. A + and A − are the initial values of + (λ d,u i ) and − (λ d,u j,j =i ), respectively. Then + (λ d,u i ) and − (λ d,u j,j =i ) will gradually converge to 0 as q approaches q . Proof Let's analyze the first term of (26), i.e., where A + and A − are the initial values of + (λ d,u i ) and − (λ d,u j,j =i ), respectively. Because of η + > 0, η − > 0, the two terms e −η + ·q and e −η − ·q in the equation decay with the data size q. Unfortunately, if the equation only uses (27) to describe the exponential decay trend of + (λ d,u i ) and − (λ d,u j,j =i ), it cannot reflect the potential impact of + (λ d,u i ) and − (λ d,u j,j =i ) on the data. In other words, + (λ d,u i ) and − (λ d,u j,j =i ) do not necessarily follow a strictly monotonically decreasing function for convergence (see Fig. 5 ). Therefore, we need to consider the volatility effect of + (λ d,u i ) and − (λ d,u j,j =i ) on the data. Consider that the influence of + (λ d,u i ) and − (λ d,u j,j =i ) on the data may be volatile. Therefore, we add the term "cos(n · η + · q)" to (27) to describe the volatility effect of + (λ d,u i ) and − (λ d,u j,j =i ) on the data. We can rewrite (27) as follows: where n and η + > 0, η − > 0 are adjustment parameters, e −η + ·q and e −η − ·q are attenuation parameters. Not only does the cos(n · η +/− · q) function ensure that A + e −η +/− ·q decays exponentially, but also it ensures that (26) decays. According to Fig. 5 , we can intuitively understand the meaning of parameter A + and parameter η + in (27) . The parameter A + determines the initial maximum value of the positive effect. The parameter η + determines the convergence speed of the function S 1 ( + (λ d,u i ), q) . Although S 1 ( + (λ d,u i ) , q) can describe that the positive effect converges to 0 quickly as the number of samples increases, it ignores the volatility of positive effects. The proof for S 2 ( − (λ d,u j,j =i ), q) is similar. Similarly, according to Fig. 6 , we can intuitively understand the meaning of parameter A + and parameter η + in (26) . The parameter A + determines the initial maximum value of the positive effect, the parameter η + determines the convergence speed of the function S( + (λ d,u i ), q), and the cos(n · η + · q) reflects the volatility of the positive and negative effect. The purpose of introducing cos(n · η + · q) is to reflect the conversion between the positive effect and the negative effect as much as possible. Regarding the form of conversion, it can either be a positive effect that becomes a negative effect or vice versa. However, no matter how it is converted, it will eventually converge to 0 strictly under the A + e −η + ·q . The proof for S( − (λ d,u j,j =i ), q) is similar. The rationality analysis of equations S( + (λ d,u i ), q) and S( − (λ d,u j,j =i ), q) mainly includes two aspects: -One is about the analysis of the visualization results of S( + (λ d,u i ), q) and S( − (λ d,u j,j =i ), q). -The other is the interpretability of S( + (λ d,u i ), q) and S( − (λ d,u j,j =i ), q). The function of cos(n · η +/− · q) To simplify the presentation, we only analyze positive effects in this subsection. The analysis of negative effects is similar. As shown in Fig. 5 , only reflects the nature of exponential decay as q increases. Although S 1 ( + (λ d,u i ), q) also can eventually converge to 0, S 1 ( + (λ d,u i ), q) does not reflect its potential impact on data, because S 1 ( + (λ d,u i ), q) directly describes the positive effect as a strict monotonic decreasing function. However, a representation based on strict monotonic decrement ignores the description of its internal complexities. The effect of + (λ d,u i ) on data may be volatile (the situation may also be more complex). Therefore, in order to describe the volatility of + (λ d,u i ), we introduce the cos(·) function. Apparently, S( + (λ d,u i ), q) presents a trend of exponential decay with volatility. Finally, as q increases, S( + (λ d,u i ), q) will strictly converge to zero. Attenuation parameters e (−η +/− )·q The purpose of introducing the attenuation parameter e −η +/− ·q is to ensure that the positive effect and the negative effect can exhibit exponential decay characteristics as q increases. Although we improve TUA by vectorization, we hope that S( + (λ d,u i ), q) and S( − (λ d,u j,j =i ), q) will have minimal impact on the overall data. Therefore, even while acknowledging the existence of positive and negative effects, we hope that + (λ d,u i ) and − (λ d,u j,j =i ) can decay as quickly as possible in an exponential decay manner. In fact, according to Lemma 1, Lemma 2, and Theorem 1, we provide a milder TUA assumption (referred to as M-TUA for short) through vectorization operations. In particular, (26) provides a formal description of positive effects and negative effects, which makes M-TUA interpretable. In summary, the above conclusion provides a mild form of TUA at the abstract level and an explicit (but not unique) mathematical description. In this section, we compare the traditional TUA and M-TUA to illustrate the similarities and differences between each other. -(u i ) and (λ d,u i ). TUA assumes that the value of ICE is the same for all u i ∈ U (|U | = q), e.g., (u i ) = ACE (u i ), where i ∈ [1, ..., q]. M-TUA transfers the above problem to the constraint of (λ d,u i ) by vectorization operation, that is, where q + + q − = q. -Vector +/− (λ d,u i/j ) and Scalar (λ d,u i ). M-TUA provides a vector description of positive and negative effects for (λ d,u i ) (i.e., +/− (λ d,u i/j )), aiming to distinguish M-TUA from traditional TUA. The vectorization operation allows for differences between individuals to exist, that is, (λ d,u i ) = (λ d,u j,j =i ) is allowed under the premise of + (λ d,u ) = − (λ d,u j,j =i ). Therefore, M-TUA achieves the weakening of TUA. -Variance. For a randomized experiment, the assumption of TUA implies that the variance is constant for all treatments. Constant variance is not a necessary condition for MTUA, MTUA should be used in data with small variance to constrain the dispersion of the population. For the intuitiveness of description, we use a simple example to further illustrate how M-TUA weakens the TUA assumption. Example 9 (Difference between data generated by TUA and M-TUA) TUA is different from M-TUA in a number of respects. A simple goal in this example is to compare the differences in the data under different assumptions via estimating the unobserved potential outcomes from Table 6 . -Similar to Example 8, in Table 7 , we construct a set of data (including 10 subjects u i , i ∈ [1, 2, ..., 10]) that meets the TUA assumption, where - Tables 8 and 9 are constructed based on M-TUA assumption. As can be seen from Table 7 , we know that the data only follows two situations, i.e., O c,u i < O t,u i (i.e., ACE (u i ) > 0), or O c,u i > O t,u i (i.e., ACE (u i ) < 0). However, this strong assumption is often violated in the real world, which forces all subjects to have the same (u i ). M-TUA alleviates this scenario and makes it more in line with the complex situations in real data (note that the values of (λ d,u i ) in Tables 8 and 9 are not unique). As shown in Tables 8 and 9 , it is not difficult to see that based on the assumption of M-TUA (i.e., 10 , the data can be more in line with the assignment mechanism on the condition that the ACE value remains unchanged, thereby avoiding either O c,u i < O t,u i (i.e., ACE (u i ) > 0), or O c,u i > O t,u i (i.e., ACE (u i ) < 0). For example, according to (10), we have that Further, we obtain that, Since ACE (u i ) = ACE (u i ), in (32), only 10 i=1 (λ d,u i ) = 0 needs to be satisfied. There are countless equations that satisfy 10 i=1 (λ d,u i ) = 0. Example 9 shows that the data constructed based on the M-TUA assumption allows for differences between various u i 's (e.g., (u 7 , u 8 , u 9 , u 10 ) < 0, (u 1,2,3,4,5 ) > 0 and (u 6 ) = 0), while ensuring that ACE (u i ) is constant (e.g., ACE (u i ) = 1 ), which is more in line with the diversity of experimental samples in real tasks. However, note that it is not sufficient to simply require that 10 i=1 (λ d,u i ) = 0 holds, which does not guarantee that the data keeps good dispersion with this constraint. Therefore, an indispensable measure is to introduce variance as a metric to constrain the data so that the data constructed based on M-TUA maintains good dispersion. The reason is that the population is larger and the variance is less, the ACE would be closer to the true ACE regardless of the specific units randomly assigned to treatments. As mentioned above, for a randomized experiment, the TUA implies that the variance is constant for all treatments, which means that a necessary condition for TUA is that the variance is constant, while M-TUA only requires a small value of variance (e.g., the variance of (λ d,u i ) in Table 8 is less than 0.5, and the variance of (λ d,u i ) in Table 9 is close to 1). Limitations Although M-TUA has realized the weakening of TUA to a certain extent and expanded the use scope of the original TUA, M-TUA itself is based on some assumptions and finally achieves the equivalence with TUA in the case of large samples, i.e., q → q . Therefore, M-TUA still has the following limitations. -Dimensionality limitation of vector space. We take the 2-dimensional Gaussian distribution as an example. Based on Example 5, we analyze the equivalent form of TUA in 2-dimensional vector space. The vectorization operation in 2-dimensional space can easily be extended to 3-dimensional space. However, the equivalent form of the TUA for data in high-dimensional space has not been rigorously established. Table 9 Assignment mechanism based on M-TUA assumption with ACE (u i ) = 1 -+/− (λ d,u i/j =i ). As shown in Fig. 4d , M-TUA implies a premise that where q + + q − = q. It requires a large enough sample size to ensure that the equation holds with a high probability. Because the effects of any (λ d,u i ) may be positive or negative (this is similar to the classical coin toss experiment, when the number of experiments is sufficient, the numbers of positive and negative coin occurrences are basically equal). -Decay rate. The e −η +/− ·q in Theorem 1 ensures that (26) will eventually converge to 0 with exponential decay. Of course, the purpose of choosing exponential decay is to make + (λ d,u i ) or − (λ d,u − ) converge quickly so that as the amount of sample data increases, the impact of + (λ d,u i ) or − (λ d,u j ) on the data will be minimal (or as small as possible) and eventually reach a negligible level. -Ignorability. Since M-TUA is a constraint imposes on the task of making causal inferences in the POF, ignorability(i.e., (O t,u i , O c,u i ) ⊥ d)) still needs to hold. In addition, we argue that estimating the variance of the data is still necessary (e.g, Example 9). Because if the population is larger and the variance is less, the ACE would be closer to the true ACE regardless of the specific units randomly assigned to treatment. Since the TUA cannot be tested and verified on the observed data, this will lead to limitations in the use of many models (e.g., the model of (7)) [36] . Therefore, it is necessary to obtain a milder and interpretable assumption. In general, M-TUA offers several advantages in terms of interpretability as follows: -Based on the idea of DVE, we establish the relationship between TUA and RCE and try to provide some reasonable explanations for λ d,u . -Through vectorization operations, we endow λ d,u with the ability to describe positive and negative effects on data, and theoretically prove the rationality of M-TUA under the large dataset. -M-TUA not only weakens the strength of the original TUA assumption but also provides a geometric description of the TUA. -In particular, the M-TUA has an explicit mathematical expression that represents the meaning of the original TUA assumption at an abstract level through a set of interpretable parameters. In this paper, we first use an example to illustrate the underlying problems of using the functional model to estimate the probability solution of counterfactual queries. We analyze the inference mechanism of the functional model and point out that there are ambiguous conclusions when the unique output probability solution is 0 under the functional model. In other words, when the probability solution obtained by the functional model is 0, it does not mean that the estimated event will not occur. Secondly, for the TUA assumption commonly used in counterfactual models, we provide an equivalent description form of the TUA in the low-dimensional space. We weaken the TUA assumption by vectorizing the original TUA and finally obtain a milder TUA assumption, i.e., M-TUA. In addition, we also give theoretical proof and exhaustive analysis of the rationality and limitations of M-TUA. As pointed out earlier, in M-TUA, the constraints on the unit are related to the dataset and RCE, instead of mandatory constraints for each unit. We argue this is very necessary, especially in the case of big data. Mild version assumption (not just M-TUA) can be viewed as an abstraction from the micro world to the macro world [46] . An intuitive example is that if we want to measure the water temperature of a swimming pool, it is impossible for us to measure every drop of water in the swimming pool. However, we do not think that the conclusion of this paper is the final form of the M-TUA. Therefore, we will focus on the following points in our future work. Practicality Causal science has shown vigorous vitality in the field of AI and public health [47] . However, a large number of tasks can only be carried out under the premise of satisfying strong assumptions. The use of some assumptions is also not differentiated according to the different tasks. Therefore, including M-TUA, whether the version for different AI task scenarios can be further developed is a topic worthy of our further consideration. Challenges posed by high-dimensional data . As a theoretical exploration of weakening TUA, M-TUA presents the equivalent form of TUA in vector space through vectorization and gives it a certain degree of interpretability. However, with the explosion of data, AI practitioners are confronted with data that are very large in both volume and dimensionality. Although our theorem shows that M-TUA is applicable in the case of big data, high-dimensional data brings new challenges. Therefore, how to develop assumptions based on M-TUA with theoretical guarantees and applicable to high-dimensional data is also the focus of our future work. Counterfactual thinking about one's birth enhances well-being judgments Counterfactuals and causal inference Probabilistic evaluation of counterfactual queries Probabilities of conditionals and conditional probabilities Disentangling policy effects using proxy data: Which shutdown policies affected unemployment during the covid-19 pandemic? Causal inference and bayesian network structure learning from nominal data Sissos: intervention of tabular data and its applications The book of why: the new science of cause and effect What if? counterfactual (hi) stories of international law Counterfactual analysis in macroeconometrics: An empirical investigation into the effects of quantitative easing Constructing effective personalized policies using counterfactual inference from biased data sets with many features Interpreting medical image classifiers by optimization based counterfactual impact analysis Causality matters in medical imaging Causal discovery on high dimensional data Backpropagation-based decoding for unsupervised counterfactual and abductive reasoning Counterfactual clinical prediction models could help to infer individualised treatment effects in randomised controlled trials-an illustration with the international stroke trial Counterfactual vqa: A cause-effect look at language bias Counterfactual vision and language learning Robust counterfactual explanations on graph neural networks Towards multi-modal causability with graph neural networks enabling information fusion for explainable ai Counterfactual explanations without opening the black box: Automated decisions and the gdpr Generating counterfactual explanations with natural language Actionable recourse in linear classification The hidden assumptions behind counterfactual explanations and principal reasons Theoretical impediments to machine learning with seven sparks from the causal revolution Telling cause from effect by local and global regression Specifying and computing causes for query answers in databases via database repairs and repair-programs Data, measurement, and causal inferences in machine learning: opportunities and challenges for marketing Leveraging structured biological knowledge for counterfactual inference: A case study of viral pathogenesis Using causal machine learning for predicting the risk of flight delays in air transportation Data augmentation using pre-trained transformer models Conditional bert contextual augmentation Counterfactual story reasoning and generation Counterfactual inference for text classification debiasing Causal inference without counterfactuals Statistics and causal inference Randomization analysis of experimental data: The fisher randomization test comment Causal inference in statistics: A primer Dynamics of structures Estimating causal effects of treatments in randomized and nonrandomized studies Bayesian inference for causal effects in randomized experiments with noncompliance Building bridges between structural and program evaluation approaches to evaluating policy Mechanical vibrations: theory and application to structural dynamics Approximate causal abstractions Convolutional neural networks and temporal cnns for covid-19 forecasting in france Acknowledgements This work was supported by the National Key R&D Program of China under Grant 2018YFB1403200. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.