key: cord-0659123-vs991oku
authors: Marmarelis, Myrl G.; Steeg, Greg Ver; Jahanshad, Neda; Galstyan, Aram
title: Bounding the Effects of Continuous Treatments for Hidden Confounders
date: 2022-04-24
journal: nan
DOI: nan
sha: 1d1192a3b0582eddba26b006f23a884b1ae6eb54
doc_id: 659123
cord_uid: vs991oku

Observational studies often seek to infer the causal effect of a treatment even though both the assigned treatment and the outcome depend on other confounding variables. An effective strategy for dealing with confounders is to estimate a propensity model that corrects for the relationship between covariates and assigned treatment. Unfortunately, the confounding variables themselves are not always observed, in which case we can only bound the propensity, and therefore bound the magnitude of causal effects. In many important cases, like administering a dose of some medicine, the possible treatments belong to a continuum. Sensitivity models, which are required to tie the true propensity to something that can be estimated, have been explored for binary treatments. We propose one for continuous treatments. We develop a framework to compute ignorance intervals on the partially identified dose-response curves, enabling us to quantify the susceptibility of an inference to hidden confounders. We show with simulations and three real-world observational studies that our approach can give non-trivial bounds on causal effects from continuous treatments in the presence of hidden confounders.

One aspect of machine learning is to build a good predictor. Another piece of the puzzle, crucial to any scientific endeavor, is to identify the causal drivers behind the predicted phenomenon. This entails accounting for as many covariates as possible, and then either physically (with a Randomized Controlled Trial, or RCT) or virtually matching individuals along their covariates between treatment cohorts, before drawing conclusions. Sometimes an RCT is prohibitive due to cost or ethics, e.g. forcing someone to smoke. In those cases, one must rely on observational studies and correct for confounding post hoc. For instance, one could attempt to learn the propensity of the treatment assignments as a function of different covariates and reweigh the sample to remove any uncovered biases. and an unobserved confounder that distorts the observations. The truly causal population-wide response curve flips to face upward. Blue dots mark possible observations with a non-uniform propensity of observed treatments. Refer to §D for details.

To motivate our effort, we highlight the pitfalls of a failure to accommodate possible confounding. The widely noted J-shaped curves are common in physiology [Calabrese and Baldwin, 2001] . The characteristic dynamic occurs wherever a substance in moderate concentration exhibits vastly different effects than either extremes of high or low doses. A simulated example in Figure 1 demonstrates how a Jshaped treatment effect can appear flipped in observational data due to confounding. The phenomenon is an example of Simpson's paradox [Simpson, 1951 , Yule, 1903 .

Attention on continuous treatments (or exposures, interventions) has grown in the fields of econometrics (e.g. Huang et al. [2021] , Tübbicke [2022] ), the health sciences (Vegetabile et al. [2021] ), and machine learning (Ghassami et al. [2021] , Colangelo and Lee [2021] , Kallus and Santacatterina [2019] ). So far, attempts to quantify a partially identified set of potential outcomes on inadequate covariates have focused on the case of binary treatments, the most common setting so far [e.g. Rosenbaum and Rubin, 1983 , Louizos et al., 2017 , Lim et al., 2021 . A number of creative approaches were exhibited in the past few years to make strides in this binary setting. Most of them relied on a sensitivity model for bounding the extent of possible unobserved confounding, to which downstream tasks may be adapted by noting which treatment effects degrade the quickest.

Regarding binary treatments, the so-called Marginal Sensitivity Model (MSM) due to Tan [2006] continues to be studied extensively [Zhao et al., 2019 , Veitch and Zaveri, 2020 , Yin et al., 2021 . Variations thereof include Rosenbaum's earlier sensitivity model [2002] that enjoys ties to regression coefficients [Yadlowsky et al., 2020] . Other groups have borrowed strategies from deep learning [Wu and Fukumizu, 2022] rather than opting for the MSM. Another active line of work constructs bounds not due to ignorance on confounding but instead viewed from the lens of robustness [Guo et al., 2022 , Makar et al., 2020 , Johansson et al., 2020 . The MSM is highly interpretable with its single free parameter, and applicable to a wide swath of models.

Many other approaches to handling unobserved confounders require a more complex data-generating process than that of observed outcome/treatment/covariates (Y, T, X). For instance, proximal causal learning [Tchetgen et al., 2020 , Mastouri et al., 2021 loosens confounding requirements with additional proxy variables. Chen et al. [2022] rely on multiple large dataset partitions.

The MSM is incompatible with continuous treatments. Our first contribution is to propose a unique sensitivity model ( §1.4) that extends the MSM to a treatment continuum. Next, we derive general formulas ( §2) and specialize them to versatile closed forms ( §2.2) culminating in Theorem 1. We devise an efficient algorithm ( §3) to compute ignorance bounds over dose-response curves, following up with experiments on simulated ( §4) and real ( §5) datasets.

Causal inference is often cast in the nomenclature of potential outcomes, due to Rubin [1974] . The broad goal is to measure a treatment's effect on an individual, marked z x t observed outcome potential outcomes y2 y1 y3 confounded model y Figure 2 . A simplified setting with three potential outcomes Y 1 , Y 2 , Y 3 arising from treatment assignments T = 1, 2, 3. Solid arrows represent the ground-truth situation. The red arrow shows what could happen when confounder Z is hidden and covariate X is inadequate to block Z's' influence on T , leading Z to "leak" through T into Y 1,2,3 . The main integrand in Equation 1, p(y t |τ, x), diverges from p(y t |x) due to this leakage. by a set of covariates, while accounting for all the confounding between the covariates and the treatment variable. Effects could manifest heterogenously across individuals. In observational-i.e, not fully randomized-settings, observed covariates may not entirely overlap across treatment regimens. It is typical to estimate two models, (1) the outcome predictor and (2) a model for the propensity of treatment conditioned on the covariates. The latter may help account for biases.

The first two assumptions involved in Rubin's framework are that observations of outcome, assigned treatment, and covariates {Y (i) , T (i) , X (i) } are i.i.d draws from the same population and that all treatments have a chance to occur for each covariate vector: p T |X (t | x) > 0 (overlap/positivity) for all t, x ∈ [0, 1] × X , specifically in our context of continuous treatments. The third and most challenging of these fundamental assumptions is that of ignorability, or sufficiency. Our study is concerned with the scenarios where that assumption is violated: when there exists a dependency, not blocked by the covariates, between the assigned treatment and true potential outcomes. Let p(y t |x) denote the probability density function of potential outcome Y t = y t from a treatment t ∈ [0, 1], given covariates X = x. Formally, we have a violation of ignorability:

It is only realistic to observe samples of Y |T = t, X = x with density p(y t |t, x). However, to express possible hidden confounding, we also require a p(y t |τ = t, x) for

where p(y t |τ, x) is the distribution of potential outcomes conditioned on actual treatment T = τ ∈ [0, 1] that may differ from the potential outcome's index t. Throughout this study, y t will indicate the value of the potential outcome at treatment t, and to disambiguate with assigned treatment τ will be used for events where T = τ . For instance, we may care about the counterfactual of a smoker's (τ = 1) health outcome had they not smoked (y t=0 ), where T = 0 signifies no smoking and T = 1 is "full" smoking. We aim to develop some intuition before introducing the novelties.

On notation. We will use the shorthand p(· · · ) with lowercase variables whenever working with probability densities of the corresponding stochastic variables in uppercase.

In other words,

Interpretation. How would one interpret p(y t |τ, x)? The potential-outcomes vector (Y t ) t∈[0,1] of infinite dimensionality is intrinsic to each individual with true confounder Z, for which X is a noisy proxy. By "true" confounder we refer to any set of variables that suffice to block all backdoor paths between Y t and T . The potential-outcomes vector would only change from knowledge of assigned treatment T = τ if it betrayed additional information about Z, absent in X, that further informed any Y t . We may express p(y t |τ, x) explicitly in terms of hypothetical true confounders as p(y t |z)p(z|τ, x) dz because z subsumes both x and τ . This way, p(y t |z) is the true potential outcome and p(z|τ, x) acts as a filter for how parts of the true confounder mix together into the proxy x and the assigned treatment τ .

Propensities. The probability density p(τ |x) is termed the nominal propensity. A quantity often examined is the complete propensity, specifically referring to p(τ | y t , x) in our realm. The complete propensity can differ from p(τ |x) because of hidden confounders. In that instance, conditioning on potential outcome y t modulates the distribution. Similarly, by connection through Bayes' rule, conditioning the potential outcomes p(y t |x) on assigned treatment τ modulates those distributions. Absent any unobserved confounding, p(y t |τ, x) = p(y t |x) and Equation 1 trivializes. See Figure 2 for a graphical illustration on the influence of τ . Definition 1 (The Marginal Sensitivity Model). For binary treatment t ∈ {0, 1} and violation factor Γ ≥ 1, the following ratio is bounded:

Restricting ourselves to binary treatments affords us a number of conveniences. For instance, one probability value is sufficient to describe the whole propensity landscape on a set of conditions, p(1−t | · · · ) = 1−p(t | · · · ). As we transfer to the separate context of t ∈ [0, 1], we must contend with infinite treatments and infinite potential outcomes.

We require a constraint on the fundamentally unknowable quantity p(τ | y t , x) for any treatment T = τ ∈ [0, 1] and potential outcome y t , for possibly contrary treatment assignments τ = t. As with the MSM, our target is to associate p(τ | y t , x) to the knowable p(τ |x). In other words, we seek to constrain the knowledge conferred on propensity by a single potential outcome y t . It is not necessary for the functions pertaining to (y t ) t∈[0,1] to exhibit any degree of smoothness in t. The potential-outcome variables are treated as entries in an infinitely long vector. However, we do impose that the propensity probability densities p(τ | . . . ) are at least once differentiable in τ . What sort of analogue exists for the notion of "odds" in the MSM?

Contrast treatment τ versus τ + δ locally, for some infinitesimal δ, at any part of the curve. A translation of the MSM might appear as p(τ + δ|x) p(τ |x)

Let us peer into one of those ratios. In logarithms,

Hence, we introduce the notion of an infinitesimal MSM (δMSM), tying ∂ τ log p(τ | y t , x) to ∂ τ log p(τ |x).

For continuous t ∈ [0, 1] and violation factor Γ ≥ 1, the following inequality holds everywhere:

We crafted the δMSM with the intention of functionally mirroring the MSM-locally, on a treatment continuum. Whereas Definition 2 is stated in logarithms, Definition 1 is not; the difference is merely cosmetic and hyperparameter Γ plays an equivalent role in both structures. Nevertheless, the emergent properties are vastly different.

We list the core assumptions surrounding our problem.

Assumption 1 (Bounded Hidden Confounding). Invoking Definition 2, the violation of ignorability is constrained by a δMSM with some Γ ≥ 1.

Assumption 2 (Observed Confounding at No Treatment). The utter lack of treatment is not informed by potential outcomes: p(τ = 0| y t , x) = p(τ = 0| x) for all t and y t .

Assumption 2 is necessary for our derivations, yet we also argue that it is commonly well-founded. We frame the special cohort at T = 0 as a control group; in a practical sense, we expect a dramatically lessened vulnerability to hidden confounders for the well-represented-in observed attributes and unobserved-control group. There is no additional constraint, besides the δMSM itself, on how much the complete propensity function may fluctuate around any T > 0, no matter how close to zero. We motivate this assumption in the empirical case study of §5 as well.

Next, we proceed with derivations. The key to cracking open Equation 1 is to approximate p(y t |τ, x) around τ = t, where p(y t |t, x) = p(y|t, x) can be learned with any predictive model from the data on hand. However, even ∂ τ p(y t |τ, x)| τ =t is intractable.

Suppose that p(y t |τ, x) is twice differentiable almost everywhere in τ . We construct a Taylor expansion

Denote withp(y t |τ, x) an approximation of first or second order as laid out above. We choose to quantify the reliability of this approximation by a set of weights 0 ≤ w t (τ ) ≤ 1, where typically (but need not necessarily) w t (t) = 1.

This separation into recoverable (A) and entirely unknown (B), demarcated by the weights, ensures that the inaccurate regimes of the approximation vanish (as w t (τ ) → 0 away from t) and are replaced with the ignorant quantity. We simplify part B of Equation 3 first:

We witness already that p(y t |x) shall take the form of

To proceed further demands reflecting on Assumptions 1 & 2. Without loss of generality, consider

We may attempt to integrate both sides;

Clearly, |λ(τ |y t , x)| ≤ τ log Γ and subsequently Λ(τ |y, t) is bounded by Γ ±τ . We are now equipped with the requisite tools to properly bound p(y t |x)-or an approximation thereof, erring on ignorance via reliability weights w t (τ ).

The full derivation may be found in §A.

Predicting potential outcomes. Of particular note is our recovery of a fully normalized probability densityp(y t |x) that may be approximated with Monte Carlo or solved in closed form with specific formulations for the weights and propensity. Generally, it takes on the formp(y

where said expectations, E τ [·], are with respect to the implicit distribution q(τ |t, x) ∝ w t (τ )p(τ |x).

To make use of this formula, one first procures the set of admissible d(t|y t , x) ∈ [d(t|y t , x), d(t|y t , x)] that violate ignorability up to a factor Γ according to the δMSM. Then, considering their reciprocals as importance weights [Tokdar and Kass, 2010] , tight bounds on the partially identified expectations overp(y t |x) may be optimized.

Bootstrapping over ensembles. One could exploit the parcelization ofp(y t |x) into a predictor p(y t |t, x) as the numerator and a nominal propensity model p(τ |x) in the denominator. Should models be learned in ensembles, as in our experiments, it would be appropriate to quantify empirical uncertainties [Jesson et al., 2020] alongside robustness to hidden confounding. One way to achieve that is by bootstrap resampling [Lo, 1987] the ensembles and marginalizing over them in each part of the fraction.

In addition to developing the general framework above, we derive analytical forms for a specific paramametrization to the weighting function and propensity distribution. Here, we look to the Beta function and its associated probability density for viable solutions. Suppose that

The reliability weights, which we have complete freedom to design, take on the form of an unnormalized Beta density in this case. Say that w t (τ ) should peak at τ = t, and that w t (τ ) = 1. Immediately we can solve for c t t rt (1 − t) r(1−t) , even though it will turn out irrelevant for our purposes. We do set the mode to t:

Further, we eliminate the remaining degree of freedom by imposing a precision constraint a t + b t − 2 = r for some r > 0. Constraining a more complex dispersion statistic like variance would prove much more difficult.

Here, the expectations found in Equation 7 are available in closed form, and can be bounded in terms of the predictive models and just two more free parameters, Γ and r. Guidance on setting the violation factor Γ is discussed elsewhere, e.g. §5; as for the class of weights, high r conveys poor trust in the Taylor-series approximation in Equation 2, and bears the potential to widen ignorance bounds. r = 4 r = 16 r = 64 The findings. We pose a third and final assumption, which enables us to state Theorem 1. The main insight to unlocking those expectations is that each one of them involves an integral with the product w t (τ )p(τ |x), which yields an unnormalized Beta distribution. So we use the moments of that Beta distribution.

Assumption 3 (Second-order Simplification). The quantitẏ γ(τ |y t , x) cannot be characterized as-is. We grant that γ 2 dominates over the former, and consequently

Theorem 1 (Beta Parametrizations). The formulations in Equations 8 & 9 admit analytical solutions to the ignorance denominator in Equation 7.

where again the operator E τ is employed as in Equation 7, and 1 F 1 denotes Kummer's confluent hypergeometric function [Mathews Jr. et al., 2021] . In addition, with α, β implicitly referring to α(x), β(x),

To take these findings one step further, we compare any weighting function w t (τ ) to a baseline or phantom Beta scheme with relatively low r.

Proposition 2 (Absolute Accuracy). Suppose that 1. ∂ 2 τ p(y t |τ, x) is Q-Lipschitz, and that 2. the compound propensity-weights w t (τ )p(τ |x) and w t (τ )p(τ |y t , x) are bounded on both sides by a phantom Beta scheme, of the form in Equation 9 with narrowness r , up to a margin of c ≥ 1.

More specifically by the second item,

Then the outcome approximatorp(y t |x) of second order (Equation 7) has an absolute error stemming from the Taylor remainder on p(y t |τ, x), stated in terms of these constants:

The unitless k(r ), pertaining to the numerator in Equation 4, was solved numerically for certain values of r ; r 0.5 1 2 4 (r + 1) k(r ) 0.03 0.02 0.01 0.005

Pertinent to the above result is the tendency for compound propensity-weights w t (τ )p(τ |x) to be narrower than the weights w t (τ ) themselves. This holds especially in the Beta family of propensities, per the discussion surrounding Equation 8. See §B for more details on the proposition.

We must characterize E[f (Y t )|X = x] for any task-specific f (y). This is accomplished with a Monte Carlo importance sampler of n realizations y i drawn from proposal q(y):

.

(10) Even thoughp(y t |x) is a normalized probability density, it contains partially identified quantities. It is untenable to constrain a search along the assignments for d(t|y t = y i , x) to even approximately ensure Yp (y t = y|x) dy = 1. For this reason the bias of an estimator without the corrective denominator of Equation 10 would be uncontrollable [Tokdar and Kass, 2010] . A greedy algorithm may be deployed to maximizeẼ[f (Y t )|X = x] in this form by using weights

The minimum may be achieved by a trivial extension. 

We first studied an idealized scenario with one hidden confounder and no covariate shift. In our design we sought simplicity, to induce a complicated enough dose-response function in (T, X), and a true propensity T |Z that deliberately

ordered by ascending f i . output :max w E[f (X)] estimated by importance sampling with n draws.

Initialize w i ← w i for all i = 1, 2, . . . n; for j = 1, 2, . . . n do

obeys Assumption 2. We constructed a piecewise uniform distribution for T |Z with variable categorical probabilities assigned to each "bucket," always residing in the probability simplex. The coefficients (see §E) were concocted prior to experimentation. For covariates, true confounder z 1 passed through entirely and z 2 remained hidden.

Crucially, we placed an intentional relationship between the single covariate X ∈ [0, 1] and the amout of unobserved confounding. The model should suffer more close to X = 0 compared to X = 1. This challenged our δMSM to grow the ignorance interval more for lower X, if our sensitivity were to be informed by the learned propensity. We pinpoint this differential growth in Figure 4 .

In reality, an unconfounded ground truth-like the black line in Figure 4 -would stay unobserved. This instance, however, lends us special tools for model evaluation. Uncertainty of an ignorance interval [y, y] with respect to a true Bernoulli(y ) outcome was measured in an informationtheoretic way. We calculated for every (t, x) the expression y y KL B (y y) dy, where KL B is shorthand for the Kullback-Leibler divergence between two Bernoulli variables with the given probabilities, true y and inferred y. This way, we grounded the ignorance intervals on their average approximation cost.

The most pertinent application for the framework laid out above is an observational study with incomplete or noisy covariates and a continuous treatment variable. More concretely, the treatment variable should be transformed and Left: examples of ignorance intervals at Γ := 1, then lightly shaded Γ := 2.125, then lighter Γ := 3.25, with r = 128, and 95% empirical uncertainty from the ensemble. The black line is E[Y t |X], from integrating the ground truth over the actual (but hidden) p(z 2 ). Bernoulli expectations are plotted in a logit transformation because the significance of a perturbation is proportional to y(1 − y). Right: the selective growth of ignorance intervals across X, computed using the divergence measure described in §4.1. scaled into the unit interval such that T = 0 signifies a control with a complete lack of treatment. Every kind of individual should be about as likely to fall in the (T = 0) cohort (Assumption 2.) As for shaping the (T > 0) regime, the domain should inform whether a linear scale is employed, versus an empirical or parametric (i.e. standard normal) cumulative density function.

The data. We conducted multiple experiments based on observational studies, showcased in Table 1 The estimators. In both experiments, the predictor and ensemble were trained as ensembles of artificial neural networks with two inner residual layers. We chose the "swish" activation function for its well-behaved gradients [Ramachandran et al., 2017] . The flagship ADAM optimizer was employed for stochastic gradient descent with a learning rate of 10 −3 and no mini-batching. Our analysis involved bootstrapping the ensemble to garner empirical uncertainties. Outcomes were modeled as in the last column of Table 1 .

Test-set log-likelihoods do not degrade much between the full and censored datasets. One cannot look to predictive performance for how well the fully unconfounded doseresponse curves are captured, since any hidden confounding would persist. Precise ground-truth curves can either be contrived synthetically on the basis of real covariates, or garnered from a large-scale fully randomized study. A synthetic potential outcome would always carry the risk of presenting an unrealistic scenario [Curth et al., 2021] .

The objective. We chose to investigate the coverage of E[Y t ] from ignorance intervals on censored covariates. The notion of coverage [McCandless et al., 2007] is vital for downstream decision-making. The larger dataset Brain afforded us the ability to approximate a pseudo-populatioñ X by reweighing X such thatX ⊥ ⊥ T , in a manner akin to inverse propensity weighting [Reiffel, 2020] . This allowed us to scrutinize the less overlapping parts of the population. We took an unweighted expectation for Vitamin and Energy.

Our test relies on approximating an unconfounded model by collecting a large set of covariates, and then learning another model on a heavily censored version. Our reasoning is that the censored model would suffer from a greater degree of hidden confounding. The censored model could then be assessed along all potential-outcome predictions, by pretending that the full model represented the real doseresponse curves. A pertinent metric would be how much a sensitivity model with Γ ≥ 1 swallows the "real" dose responses, as a trade-off against the sheer area of the ignorance intervals. These competing quantities can be viewed as a form of recall and (the opposite of) precision, respectively.

Denote (y c , y c ) the partially identified bounds of the censored model and (y, y) the full model's 95% confidence interval forẼ[Y t ], from percentile bootstrapping. The partial identification was bootstrapped as well. For some Γ := s and a t ∈ [0, 1] grid, of length 17 in our case, ignorance is , judged in terms of bounds overlap, versus the relative area of the partial identification (horiz.). The response grid was partitioned into equal halves. Ignorance window was selected to cover the range of recall.

The comparisons. We compared our δMSM with r = 32 throughout (solid in the figures on this page) to other sensitivity models: namely, (dotted) an analogue developed independently and in parallel to the present work, with just one free parameter [Jesson et al., 2022] ; (dashed) the product of shoehorning a continuous model into the binary MSM by triggering a binary treatment at T > 0.5 and discretizing the propensity at the threshold; and (dot/dash) a baseline sensitivity model that emerges from Γ-scaling the Algorithm 1 weights without any propensity. 

The utility of our framework is evident in the above showcased results. We demonstrated that, in the presence of a solid ground truth ( §4), the procedure discriminates between more and less confounded outcomes by selectively growing the ignorance bounds (Figure 4 ).

With the real datasets ( §5), we analyzed how the censored model recovered the full model's continuum of potential outcomes. The novel δMSM consistently outperformed the other contenders for the Brain dataset, in Figure 6 . We noticed via Figure 5 that the δMSM's behavior is markedly different from the alternatives, with clear improvement in the latter half of the response for Vitamin and for small ignorances in Energy. These findings suggest higher coverage efficiency for the δMSM in some realistic scenarios.

Future work. Our scope solely considered the univariate distributions p(y t |x) for each potential outcome. As such, it was only necessary to grapple with the complete propensity of first order, p(τ | y t , x). In theory there could exist more exhaustive "complete" propensities such as p(τ | y t1 , y t2 , x) that we term of second order. These high-order propensities only become relevant alongside joint distributions, like p(y t1 , y t2 |x). It may be fruitful to study these.

We successfully bridged a hidden-confounding sensitivity model to continuous treatments. Our extension is parsimonious in that it imposes minimal additional requirements on the data and the predictor model. Not only did we derive the analytical forms needed to work with Beta-parametrized propensities, but we also provided the ingredients for a more general sampling-based approach. We carried these findings into a complete pipeline for quantifying ignorance intervals on dose-response curves. Namely, we bounded the approximation error in Proposition 2 and detailed an efficient optimization algorithm. Our observational experimentation validated the method's core assumptions. 

where g k (y t |t, x) ∂ k τ p(y t |τ, x)| τ =t . (11)

Lightening the notation with a shorthand for the weighted expectations, · τ 1 0 w t (τ )(·)p(τ |x) dτ, it becomes apparent that we must grapple with the pseudo-moments 1 τ , τ − t τ , and (τ − t) 2

τ . Note that t should not be mistaken for a "mean" value.

Furthermore, we have yet to fully characterize g k (y t |t, x). Observe that

The p(y t |x) will be moved to the other side of the equation as needed; by Equation 6,

Expanding,

Appropriate bounds will be calculated for g 2 (y t |t, x) next, utilizing the finding above as their main ingredient. Let

The second derivative may be calculated in terms of the ignorance quantities γ, Λ:

And finally we addressp(y t |x). Carrying over the components of Equation 11 into Equation 3,

where these expectations E τ [·] are with respect to the implicit distribution q(τ |t, x) ∝ w t (τ )p(τ |x). The first term in the denominator, E τ [Λ(τ |y t , x)], may be approximately bounded by the same Algorithm 1.

The numerics were accomplished via Wolfram Mathematica 12. Our steps for calculating key quantities are listed below.

Firstly, since the second derivative employed in the Taylor expansion is Q-Lipschitz, we have ∂ 3 τ p(y t |τ, x) ≤ Q. Let us denote the Taylor remainder in this scope as ρ(y t |τ, x). By Taylor's theorem,

We carry over the remainder into the formulation ofp(y t |x) as in Equation 4, reproduced more explicitly below.

By acknowledging that an exact approximation requires ρ(y t |τ, x) = 0, we express the absolute error in our imperfect approximator as

We analytically solved the definite integral 1 0 c w t (τ ) 1 6 Q|τ − t| 3 dτ ≥ 1 0 c w t (τ )ρ(y t |τ, x) dτ , bounding the numerator. We omitted the full expansion for brevity. Then, for specific values of r as displayed in the table of the theorem statement, we maximized the quantity numerically over the support 0 ≤ t ≤ 1 in order to obtain k(r ).

Next, on reducing the denominator, we employed the opposite side of our bound:

The resultant fraction is symmetric about t = 1 2 . We posit-for it was too unweildy to prove-that the expression is concave in t ∈ [0, 1] for all r > 0. We validated this claim by testing the non-positivity of the second derivative at relevant values of r across the domain in t. It follows that the fraction achieves its minimum at the boundaries t = 0, 1. In either of those cases it may be evaluated by means of the identity B(r + 1, 1) = (r + 1) −1 . Hence, we assert that the above fraction is no lesser than [c (r + 1)] −1 .

Regardless of the validity of the above conjecture, the quantities displayed in the table of Proposition 2 are correct because the relation surely holds at those points in r .

The algorithm functions by incrementally reallocating mass (relative, in the weights) to the righthand side, from a cursor beginning on the lefthand side of the "tape".

Proof. Firstly we characterize the indicator quantity ∆ j . Differentiate the quantity to be maximized with respect to w j ;

up to some positive factor.

Hence, ∆ j captures the sign of the derivative.

We shall proceed with induction. Begin with the first iteration, j = 1. No weights have been altered since initialization yet. Therefore we have

Since ∀i, f 1 ≤ f i due to the prior sorting, ∆ 1 is either negative or zero. If zero, trivially terminate the procedure as all function values are identical. Now assume that by the time the algorithm reaches some j > 1, all w k = w k for 1 ≤ k < j. In other words,

.

Per the algorithm, we would flip the weight w j ← w j only if ∆ j < 0. In that case,

where both sides are non-negative.

Notice that the above is not affected by the current value of w j . This update can only increase the current estimate because the derivative remains negative and the weight at j is non-increasing. We must verify that the derivatives for the previous weights, indexed at k < j, remain negative. Otherwise, the procedure would need to backtrack to possibly flip some weights back up.

More generally, with every decision for weight assignment, we seek to ensure that the condition detailed above is not violated for any weights that have been finalized. That includes the weights before j, and those after j at the point of termination. Returning from this digression, at k < j after updating w j ,

To glean the sign of this, we refer to a quantity that we know.

The remaining fact to be demonstrated is that upon termination, when ∆ j ≥ 0, no other pseudo-derivatives ∆ j , j > j are negative. This must be the case simply because f j ≥ f j .

The setting in Figure 1 lacks covariates. Its conditional expected outcome E[Y | Z = z, T = t] is defined as 3(t − 1 4 ) 2 − 4z 2 . To produce the dashed line, observations are fully confounded by the treatment regime p T |Z (t | z) = δ(t−z). The amalgamation of individual curves leads to a populationlevel average pointing in the wrong direction. Here, the overlap assumption for inverse propensity weighting is violated here. Still, "true" potential outcomes are revealed by forcing independence between T and Z.

The solid line is the result of taking an expectation over Z ∼ Uniform[0, 1]. The blue dots represent observations from a more realistic, partially confounded scenario where (T | Z = z) ∼ Beta(96z +1, 96(1−z)+1), i.e. treatments are centered at z with a narrow spread.

Setup. 1,500 training and 500 testing examples were generated. We trained ensembles of eight neural networks with two hidden layers of 16 units, for the Bernoulli predictor and Beta propensity each. On constructing the idealized world,

w.p 0.1 U(0.2, 0.5) w.p 0.4 − 0.2z 2 U(0.5, 0.7) w.p 0.3 + 0.3z 2 U(0.7, 0.8) w.p 0.1 − 0.1z 2 U(0.8, 1) w.p 0.1

F EXPERIMENTAL DETAILS Brain dataset. Data from the UK Biobank were accessed under application 11559. From the brain Magnetic Resonance Imaging (MRI) data we extracted the 74 fields corresponding to parcelized cortical volumes on the left and the right hemispheres each [Miller et al., 2016] . Then we summed each left/right pair to arrive at 74 positively valued outcomes. Six groups of semantically related fields composed the long covariate vector:

• 6 basic details: age, weight, sex, standing height, seated height, and month of birth.

• 3 reported activity measurements: weekly minutes spent walking, engaged in moderate activity, and vigorous activity.

• 27 environmental variables surveying the pollution and greenery surrounding the person's life.

• 42 blood measurements from cell counts to calcium concentration.

• 15 cardiac measurements including ECG and PWA modalities.

• 8 welfare indices for English citizens assessing the following: deprivation, income, employment, health, education, housing, crime, and living environment.

Listwise deletion was employed to handle any missing value.

To censor the covariates, the four largest sectors (italicized) encompassing various confounding variables were omitted, one at a time. The treatment variable was the walking field taken from the triad of activity measurements, scaled to the unit interval such that any recording of at least two hours per day was set to T = 0 and any lesser amount had T increase up to 1 according to an empirical CDF.

Brain estimators. Both the predictor and the propensity model, censored and uncensored, were trained in ensembles of 32 artificial neural networks with four inner residual layers of 32 activation units each. A dropout of 0.05 was imposed on these layers. Additionally, an L 2 -regularization on the inner layers with weight 10 −3 was applied. All predictors were trained for 10,000 epochs and propensities for 5,000 epochs. The outcome predictor parametrized a Gamma distribution, and the propensity model parametrized a Beta distribution. See Figure 7 .

Vitamin dataset. We studied individuals from a recently published record of COVID-19 mortality from Israel, with a focus on pre-infection Vitamin D levels. The data acquired from Dror et al. [2022] included 253 individuals, with 38 deaths attributed to COVID-19. The full set of recorded covariates were gender (two categories,) age (three levels,) body-mass index (two levels,) religion (five categories,) and six binary-labeled comorbidities. Vitamin D levels, the treatment variable, were scaled to the unit interval such that "healthy" levels of at least 40 ng/mL were allocated to T = 0 and T increased to 1 linearly as vitamin levels tended to zero. The censored dataset omitted the religion and comorbidity categories.

Vitamin estimators. The singular inner predictor-and propensity-model layers comprised four activation units each, owing to the smaller set of covariates. The dropout was increased to 0.1. The smaller models were trained in ensembles of 64 in this case.

Energy dataset. Hourly time series of electricity generation in Spain were gathered for the period from April 1st, 2021 to November 1st, 2021 [ENTSO-E] . The outcome was the wholesale price of electricity. The market price of natural gas spiked in Europe during that time frame, especially within the testing set that was temporally ordered after the training set-not randomly assigned. Hence, we observed substantial non-stationarity between the model estimates and the out-of-sample covariates used in the ignorance/recall evaluation. The treatment variable was the carbon price dictated by the EU Emissions Trading System (ETS), linearly scaled between 0 and 1.

Energy estimators. The predictor models, censored and uncensored, each had one inner layer of eight activation units. The propensity models had four inner units instead.

The dropout was increased to 0.2 here, and the models were trained in ensembles of 16. 

Please visit https://github.com/marmarelis/ TreatmentCurves.jl.

A physicist's guide to the solution of kummer's equation and confluent hypergeometric functions

Bayesian sensitivity analysis for unmeasured confounding in observational studies

Multimodal population brain imaging in the uk biobank prospective epidemiological study

Searching for activation functions

Propensity score matching: The 'devil is in the details' where more may be hidden than you know

Observational Studies

Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome

Estimating causal effects of treatments in randomized and nonrandomized studies

The interpretation of interaction in contingency tables

A distributional approach for causal inference using propensity scores

An introduction to proximal causal learning

Importance sampling: A review

Entropy balancing for continuous treatments

Nonparametric estimation of population average dose-response curves using entropy balancing weights for continuous exposures

Sense and sensitivity analysis: Simple post-hoc analysis of bias due to unobserved confounding

β-intact-VAE: Identifying and estimating causal effects under limited overlap

Bounds on the conditional and average treatment effect with unobserved confounding factors

Conformal sensitivity analysis for individual treatment effects

Bhattacharya. Sensitivity analysis for inverse probability weighting estimators via the percentile bootstrap