key: cord-0561760-88xqoacg authors: Huitfeldt, Anders; Fox, Matthew P.; Daniel, Rhian M.; Hr'objartsson, Asbjorn; Murray, Eleanor J. title: Shall we count the living or the dead? date: 2021-06-11 journal: nan DOI: nan sha: c52d0c3ed601e1e25fd72a595fe95801ff99a052 doc_id: 561760 cord_uid: 88xqoacg In the 1958 paper"Shall we count the living or the dead", Mindel C. Sheps proposed a principled solution to the familiar problem of asymmetry of the relative risk. We provide causal models to clarify the scope and limitations of Sheps' line of reasoning, and show that her preferred variant of the relative risk will be stable between patient groups under certain biologically interpretable conditions. Such stability is useful when findings from an intervention study must be generalized to support clinical decisions in patients whose risk profile differs from the participants in the study. We show that Sheps' approach is consistent with a substantial body of psychological and philosophical research on how human reasoners carry causal information from one context to another, and that it can be implemented in practice using van der Laan et al's Switch Relative Risk, or equivalently, using Baker and Jackson's Generalized Relative Risk Reduction (GRRR). When evaluating evidence in order to engage in shared decision-making about a medical intervention (such as initiating treatment with a particular drug), patients want to know the potential harms and benefits that they can expect to experience if they undergo the intervention, and if they don't undergo the intervention [1] . However, personalized estimates of absolute risks are rarely published, as very few studies can be powered for estimating risk in each subgroup separately [2, 3] . In the absence of direct evidence for personalized risk under intervention, this can be estimated based on background information about the patient's risk profile if untreated, and a published measure of the magnitude of the effect. This procedure usually depends on a strong assumption that the relative effect of the intervention is stable across groups, i.e. on the absence of effect modification on the risk ratio scale [4, 5] . While the risk ratio is a commonly used scale with attractive properties, stability of the risk ratio is not a universally held belief [6, 7, 8] and has unclear theoretical support. Stability of the effect measure is also a crucial consideration for randomization inference [9] , and when choosing the link function in a GLM model or the summary parameter of a meta-analysis [10] . It is well established that the risk ratio is asymmetric, meaning that the predictions of a risk ratio model are not invariant to recoding of the exposure variable [11] . In this manuscript, we discuss a resolution to this limitation, based on an approach to modelling binary events which originated with Canadian physician and biostatistician Mindel C. Sheps . In her 1958 "Shall we count the living or the dead?" [12] , Sheps argued that in settings where an intervention reduces risk of an outcome, the standard risk ratio (which "counts the dead") is the most suitable effect measure; whereas in settings where the intervention increases risk, a variant of the relative risk which is based on the probability of not having the event (i.e. "counts the living") is preferred. For consistency with the earlier literature, we will refer to the variant of the relative risk which "counts the living" (i.e. counts the complement of the outcome event) as the survival ratio, regardless of whether the outcome is death. While Sheps' ideas are rarely used in the applied literature, variations of her insights have been rediscovered independently multiple times [13, 14, 15, 16] . We expand upon Sheps' argument by attaching it to a causal model consistent both with standard toxicological mechanisms and with influential work on generalizability from the philosophy and psychology literature. We then show that Sheps' recommendations can be implemented in practice by using van der Laan et al's switch relative risk model [17] , or equivalently, using Baker and Jackson's generalized relative risk reduction (GRRR) parameter [18] . All variables considered in the paper are defined in Table 1 and all effect measures are defined in Table 2 . We refer to the marginal risk of the outcome (in the population of interest) under the intervention as p 1 and to the corresponding "baseline" risk under a control condition as p 0 . Because we are considering causal measures of effect, p 1 and p 0 are equated with the expectations of the potential outcomes, Y a=1 and Y a=0 , respectively. When any quantity is considered in a specific subgroup, this is denoted in the subscript, e.g p 0 v . We use ¬ to denote the complement of an event. Each effect measure defines a scale for measuring the magnitude of the contrast between p 1 and p 0 . We use upper case greek letters to refer to the set of all possible values of an effect measure (and to the effect measure when considered as an unspecified parameter), and lower case greek letters to refer to a specific value of an effect measure. We use Λ whenever we need to refer to an unspecified placeholder effect measure. We use subscript i when we wish to refer to the value that a variable takes in individual I; this generally refers to the person in whom a evidence-based decision needs to be made. Any effect measure for a binary outcome can be respresented as a characteristic effect function g, which takes a probability p as input and outputs a real number g(p), often another probability. In different contexts, p could be a marginal risk, e.g. p 0 , or a conditional risk, e.g. p 0 v . Either way, the input risk relates to the control condition and the output risk relates to the corresponding risk under intervention. In general, the effect function is obtained from the definition of the effect measure (considered as a function of p 0 and p 1 ), as the inverse function with respect to p 1 . For example, the effect function for the odds ratio is g γ (p) = p 1−p × γ 1 + p 1−p × γ and the (marginal) odds ratio model can be written as p 1 = g γ (p 0 ). The effect function is useful because it governs many interesting features of an effect measure: • If the effect function g λ is closed on [0,1] for all λ, models based on Λ will not produce invalid predictions • If the effect function g λ is an affine transformation for all λ, Λ is collapsible [19] • Models based on effect measures Λ and Λ ′ are prediction-equivalent (i.e. if both models are fit in the same data, they will lead to identical predictions) if for all valid parameter values λ in Λ, there exists a corresponding parameter value λ ′ in Λ ′ such that g λ (p) = g λ ′ (p) for all p Rational clinical decisions depend primarily on beliefs about the following three inputs: the baseline risk in the group the patient belongs to (p 0 l i ), the risk under treatment in the group the patient belongs to (p 1 l i ), and the patient's utility function (or the collective social welfare function), i.e. a monotonically increasing function over whatever outcome the decision maker is optimizing for, reflecting their values and risk posture [20] . None of these inputs are knowable with certainty, but in almost all applications, it is assumed that the physician has access to a reasonably accurate estimate of the patient-specific baseline risk p 0 l i , and that information about the utility function can be elicited by asking patients (for example using standard gambles [21] ). Any attempt at individualizing the results from a study to a specific patient, or generalizing the results to a different population, necessarily involves invoking an explicit or implicit homogeneity assumption, possibly conditional on some set of effect modifiers. Typically, these homogeneity assumptions come in the form of a claim about conditional stability of an effect measure between patient groups. In most applications, homogeneity assumptions which rely on different effect measures will lead to different clinical predictions. For this reason, we give priority to conditional stability/homogeneity above all other considerations for choice of effect measure. In our view, effect measure stability must be evaluated in the particular context of each specific study. This cannot be based on empirical considerations alone, due to the extrapolator's circle [22] : in order to empirically establish stability of an effect measure between a study population and a target population, scientists would need sufficient data in the target population such that the effect could be determined without the need for extrapolation. Therefore, biological background knowledge is required. To illustrate how homogeneity assumptions are used in practice, consider the Cochrane Handbook, which suggests that authors of systematic reviews should provide individualized risk differences where "the risk in the intervention group (and its 95% confidence interval) is based on the assumed risk in the comparison group and the relative effect of the intervention (and its 95% CI)" [23] . In other words, background knowledge about the patient's baseline risk (p 0 l i ) is combined with a published estimate of the magnitude of the effect (λ), in order to produce an estimate of the risk under treatment (p 1 l i ): p 1 l i is equated with g λ (p 0 l i ). This procedure is inherently scale-specific [24, 25, 26, 27] , and will result in different predictions for p 1 l i , depending on choice of effect measure (see interactive figure S1 ). Therefore, it is only justified if there is reason to believe that the effect of treatment is homogeneous when measured on the λ scale, i.e. that λ is stable between patient groups. In order to account for effect heterogeneity, the procedure may be modified to condition on a measured set of baseline covariates M, by estimating λ m i in the study, and equating p 1 l i with g λm i (p 0 l i ). However, this modification does not overcome the basic problem of scale-dependency, and only shifts the homogeneity assumption to the conditional stratum M = m i . It will therefore only be justified if there is reason to believe M is a sufficient set of effect modifiers on the λ scale, i.e. that two groups from different settings (for example, S = s and S = t) that share the same value of effect modifiers (for example: M = m) will have the same effect on a specific scale. For example, if λ depends only on gender, λ m may be expected to be equal between groups of men from different countries, and between groups of women from different countries. A key advantage of the framework we outline in the remainder of this paper, is that the procedure for choosing the conditioning set M is linked to the choice of effect measure via a model for the mechanism of action, thereby facilitating a meaningful evaluation of the biological plausibility of the conditional homogeneity assumption. It has been argued that interpretability is a key consideration for choice of effect measure, in part because this is central to explainability of the algorithms used for decision making. We recognize that interpretability is a desirable feature of an effect measure, but we note that decision makers have little use for an intuitive interpretation of the effect measure in the study population if the same interpretation does not also validly explain how the intervention will affect the patient. Therefore, in the absence of any argument for stability, interpretability can not be a primary consideration. 6 If the utility function is risk neutral with respect to the number of survivors (i.e. if the second derivative of the utility function over number of survivors is zero), the same decision will always be made in two groups between whom the baseline risks differ but the risk difference is equal. To illustrate, if we know that an intervention reduces risk by one percentage point in all groups, and the baseline risk in group m is 1% and the baseline risk in group m ′ is 99%, then these two groups will get the "same benefit" in terms of how many additional people will survive if given the intervention. For this reason, if a clinician needs to make a decision about whether the benefits outweigh the harms of an intervention in a patient with neutral risk posture, the risk difference is in theory the only input to the decision problem that needs to be known. However, we believe it is rare that a decision maker will find themselves in a situation where this property can be utilized. A risk difference that is applicable to the patient will only be known if both p 0 l i and p 1 l i are known (in which case the risk difference is superfluous, as the decision maker can use the absolute risks, which contain strictly more information and allow arbitrary utility functions) or if the risk difference itself is known to be stable (which would be convenient, but such convenience is very poor justification for choosing to rely upon it for generalizability). We next proceed to outline a general class of biological mechanisms that result in stability of Sheps' preferred variant of the relative risk; this will enable understanding effect heterogeneity as biologically interpretable deviation from that class of mechanisms. These mechanisms are consistent with Patricia Cheng's power-PC framework for causal generative and preventive power [13, 28] , an approach which has considerable support in the psychology and philosophy literature [29, 30, 31] , where it has been argued on both empirical and normative grounds that human reasoners use (and should use) these constructs to carry causal information from one context to another [32] . The mechanisms are also consistent with Bouckaert and Mouchart's Sure Outcomes of Random Events model [14, 33] , and with the independent joint action model from toxicology [34, 35, 36] , which has previously been discussed in the epidemiological literature by Weinberg [37, 38, 39, 40] . Suppose researchers want to quantify the risk of anaphylaxis following treatment with Penicillin. For illustrative purposes and using a purely hypothetical example, suppose Penicillin causes anaphylaxis in the first two weeks of treatment in those individuals who have genetic or other susceptibility to Penicillin allergy, that such susceptibility is independent of risk of anaphylaxis due to other allergens, and has a prevalence of 1% in all populations. We will also assume monotonicity, i.e. that 7 Hypothetical data from randomized trials in populations with different baseline risks, generated under assumption that the intervention has no effect on outcome in absence of gene B, that the intervention is a sufficient cause of the outcome in presence of gene B, that gene B has prevalence of 1% in all populations, and that the gene is independent of the baseline risk. Penicillin never prevents anaphylaxis in anyone who would have it if untreated. We note that these assumptions are strong; they will be discussed in more detail later. The researchers conduct an experiment, and the control arm shows that the baseline risk of non-Penicillin related anaphylaxis in their study population is 0.5%. In the Penicillin arm, the 0.5% who have other allergies still have reactions, whereas among the 99.5% who do not have other allergies, those who have genetic susceptibility also have a reaction. Therefore, the overall risk of anaphylaxis in the active arm will be 0.005 + 0.995 × 0.01 ≈ 1.5%. Now suppose we rerun the experiment in several different populations with different baseline risk but identical prevalence of Penicillin allergy. Table 3 shows the outcome of these experiments. In this table, the risk ratio and the odds ratio differ substantially between the different populations, while the survival ratio is exactly equal and the risk difference is almost equal in every group. This is not a coincidence; the mechanism of action that we assumed imposes sufficient structure to make the effect of Penicillin stable across groups, but only on the survival ratio scale (when the outcome is rare, the risk difference is closely approximated by one minus the survival ratio, and is therefore also nearly stable). This demonstrates that structural knowledge about the distribution and function of the unmeasured covariates that turn treatment effects "on" or "off" can sometimes be sufficient to guarantee treatment effect stability on one specific scale, but not on others. In general, we will use the term "switches" to refer to variables which play a similar role as the genetic factors that cause Penicillin allergy. Table 4 shows four different types of switches, each switch is associated with a characteristic effect measure. For any type of switch, if response to treatment depends only on that Four types of switches, and their characteristic effect measure. If treatment response is determined entirely by one class of switches, the characteristic effect measure associated with that switch will be stable across groups with the same prevalence of the switch. type of switch and the prevalence of the switch is stable between two groups, the characteristic effect measure will also be stable. In practice, treatment response rarely depends only on one type of switch, but in many cases, it will be useful to identify the switch type that is primarily responsible for the effect and use its characteristic effect measure as the default choice, such that effect heterogeneity can be understood as deviations from the "pure" mechanism that would have led to effect measure homogeneity. The central insight that we will expand upon with causal models in the remainder of this manuscript, is that we sometimes have biological reasons for believing that one type of switch is predominantly responsible for the effect of the intervention. Sufficient-component cause models ("causal pie models") [41, 42] consider several different combinations of factors that together comprise a sufficient cause of the outcome ("causal pies"). Each pie contains component causes ("slices of the pie"), such that if every slice of any pie is present, the outcome will occur. These models can be used to generalize the thought experiment from the previous section to settings where the switches are compound constructs determined by a large number of more primitive switches, and to facilitate reasoning about the biological plausibility of the assumed mechanism. 4.1 Shall we count the living or the dead? Like relative risks, causal pie models are not invariant to whether they represent sufficient causes of the outcome (counting the dead), or sufficient causes of not having the outcome (counting the living). Any complete set of conditions which determine whether the outcome occurs can be represented in either form (though these models may differ in their complexity. We discuss this in the appendix). The generalized form of the argument in the previous section depends on using knowledge about biological mechanism to choose between these two models. As we will show, this choice may depend on whether the intervention increases or decreases risk of the outcome. Figure 1 shows a causal pie model for the outcome. The causal pies in this model can be partitioned into three broad classes: Class 1 contains those causal pies that do not depend on Penicillin. Class 2 contains those causal pies in which Penicillin (A) is a component. Class 3 contains those causal pies in which not taking Penicillin (¬A) is a component. The causal pies of class 1 are taken to generate the background risk of Y , i.e. the component of the risk that occurs regardless of whether the intervention is given. We allow the distribution of the component causes of pies of class 1 to vary arbitrarily between groups, resulting in different baseline risks. We call the event that all component causes of any causal pie of class 1 is met U. Penicillin will trigger the outcome in those people who have met every other component in at least one causal pie of class 2, and the absence of Penicillin will trigger the outcome in those who have met every other component of at least one causal pie of class 3. We will refer to the event that every non-A component of at least one causal pie of class 2 is present as B, and that every non-¬A component of at least one causal pie of class 3 is present as C. Under this model, risk under the intervention will be given by p 1 = P r(U ∪ B), and risk in the control group is p 0 = P r(U ∪ C). Figure 2 shows a causal pie model for not getting the outcome. Here, the causal pies can also be partitioned into three different classes: The outcome does not occur if every component cause of at least one causal pie of class 4 is met, we call this W. If every non-intervention component cause of at least one causal pie of class 5 is met (D), the intervention ensures that the outcome does not occur. If every nonintervention component of at least one causal pie of class 6 is met (E), not taking Penicillin ensures that the outcome does not occur. With this causal model, risk of not having the outcome under the intervention will be given by 1 − p 1 = P r(W ∪ D), and risk of not having the outcome in the control group is 1 − p 0 = P r(W ∪ E). As an example of an intervention that increases risk, we will consider the effect of treatment with Penicillin on the risk of anaphylaxis. Suppose first we relied upon the model in Figure 1 to reason about choice of effect measure. If we are willing to assume positive monotonicity (P r(C) = 0) and that B is independent of U, it follows that the survival ratio in any group or population is equal to the prevalence of not having B, and therefore stable across groups with similar prevalence of B: Heterogeneity on the survival ratio scale can then be understood as resulting from deviations from these conditions, for example due to different prevalence of B between groups, correlation between B and U or non-monotonicity: • Heterogeneity between groups due to differences in the distribution of B corresponds to the familiar concept of effect modification [43] . Predictors of B must therefore be accounted for in the analysis as effect modifiers. We note that after controlling for all predictors of B (i.e. for all predictors of the joint distribution B 1 , B 2 and B 3 ), there is a theoretical rationale for conditional effect measure stability between different settings S even if the baseline risk generated by U differs between groups, which satisfies at least one plausible interpretation of the possibly overloaded term "baseline risk independence". • Correlation between B and U might occur, for example, if some people are particularly susceptible to anaphylactic reactions in general (in these people, general-factor susceptibility to allergic reactions is a component both of pies of class 1, and pies of class 2). In order to address the heterogeneity that results from this, investigators will be required to condition on markers for general-factor susceptibility to allergic reactions, or to use sensitivity analysis or partial identification methods to bound the effect. • Deviations from monotonicity will occur if there is at least one complete set of C-components with prevalence greater than 0. Monotonicity is a strong assumption, which must be evaluated separately for each exposure-outcome relationship. However, even in the absence of monotonicity, there may be approximate stability as long as the drug predominantly works in one direction (i.e. if there are only very few people in whom the drug works in the opposite direction from the majority). In such settings, the effect measure in all strata can be bounded using partial identification methods, and these bounds may be quite informative. [16, 44] . If instead we relied upon the model in Figure 2 , a similar argument could be used to show that the risk ratio will be homogeneous, under the monotonicity condition P r(D) = 0 and the independence condition E ⊥ ⊥ W, with analogous methodological implications: Therefore, a fundamental question for the choice between the survival ratio and the risk ratio for safety outcomes will be whether our knowledge about biology guides us towards believing the mechanism of action is best described by the model in Figure 1 or by the model in Figure 2 . This can be evaluated by reasoning about what each model says about the intervention's mechanism of action; i.e. about whether the effect of treatment depends on switches of type B, or switches of type E. Under the model in Figure 1 , the effect of the intervention might for instance occur in those individuals who have a gene that makes them highly susceptible to allergic reactions to Penicillin (B 1 ), or who have a different gene B 2 that also makes them susceptible to allergic reactions to Penicillin, but only when cofactor B 3 is present. This is a plausible mechanism, and determining the potential effect modifiers which predict the prevalence of B is sometimes a tractable task for a human reasoner. In contrast, under the model in Figure 2 , an operative gene of type E would make not taking Penicillin a sufficient cause of not getting anaphylaxis. Such a gene would essentially eliminate the possibility of having allergies to anything if the patient can just avoid taking Penicillin. While it will in theory always be possible to describe a very complicated causal pie of class 5, this would require us to incorporate the absence of every other cause of anaphylaxis as E-components in the pie, which will make the task of accounting for all predictors of the prevalence of E impossible. We therefore believe that when considering the anaphylaxis outcome, the biological mechanism of Penicillin is best described by the model in Figure 1 . A similar argument can be used in many cases where an intervention increases risk (such as when considering adverse events), therefore, we adopt Sheps' conclusion that the survival ratio is usually a more suitable scale for interventions that increase risk. To illustrate the setting where the intervention reduces risk of outcome, we will consider the effect of Penicillin in patients with Streptococcal Pharyngitis, in particular, its effect at reducing risk of rheumatic fever. If the mechanism of action is best approximated by the model in Figure 1 , the survival ratio is determined by the prevalence of C, under the negative monotonicity condition P r(B) = 0 and the independence condition C ⊥ ⊥ U: If alternatively the mechanism of action is best described by the model in Figure 13 2, the risk ratio is determined by the prevalence of D, under the negative monotonicity condition P r(E) = 0 and the independence condition D ⊥ ⊥ W: Under the model in Figure 2 , early treatment with Penicillin is a sufficient cause of not getting rheumatic fever in some patients with a β-lactam susceptible strain of Streptococcal Pharyngitis infection (D 1 ), or with a different strain of β-lactam susceptible Streptococcal (D 2 ) and no abnormalities of drug metabolism (D 3 ). This is a plausible mechanism, and reasoners can plausibly determine the potential effect modifiers which predict the prevalence of D. In contrast, under the model in Figure 1 , the cofactors C 1 , C 2 and C 3 would combine to make not taking Penicillin a sufficient cause of rheumatic fever. This would mean that the patient's own immune system is irrelevant, that there could be no way to prevent the outcome other than to initiate treatment with this specific drug. Specifying a causal pie of class 3 will therefore only be possible by incorporating the absence of every other potential way the body could clear the infection as C-components, making it almost impossible to use predictors of C to reason about effect modification. This leads to a preference for using the model from Figure 2 for this intervention. Similar logic will apply for many interventions that reduce risk. We therefore again adopt Sheps' view, that the risk ratio is preferred in the case of interventions that decrease risk. However, the scope of this conclusion is more limited than the corresponding argument for interventions that increase risk, and does not apply when the outcome is all-cause mortality, as it is generally not plausible to model an intervention as being a sufficient cause of all-cause survival. We discuss this in more detail in the appendix. So far, we have argued that in general, switches of type B are much more prevalent than switches of type E, and that switches of type D are more prevalent than switches of type C. This is a pattern that we believe matches most readers' intuition about how biological systems work. We now proceed to hint at one possible explanation for this asymmetry. For many potential interventions, our ancestors were either almost uniformly exposed or almost uniformly unexposed. For example, virtually no human ancestor was exposed to Penicillin. In such an environment, the presence of a gene of type B, which causes allergy when exposed to Penicillin, will not subject the organism to any particular kind of evolutionary pressure; whereas a gene E that prevents all allergy in anyone who does not take Penicillin, will very quickly reach fixation (and will therefore not be plausible as a determinant of variation in treatment response). In a different but logically possible world, one in which Penicillin molecules were in the water supply, gene B would instead have been eliminated from the gene pool, and gene E would not subject its holder to any evolutionary pressure. However, we do not live in that world, and B is therefore more prevalent. A similar argument could be made for the protective effect of Penicillin, by reasoning about the evolutionary pressures on bacteria, leading to a preference for switches of type D over switches of type C. This argument can only be applied when considering interventions for which there was a "default" state in the evolutionary past. It would for example not be possible to make this argument for an exposure variable such as sex, because all humans descend both from ancestors who were subjected to evolutionary pressure as men, and ancestors who were subjected to evolutionary pressure as women. Therefore, this framework does not provide a reason to expect stability of the effect of sex (and similar variables) on any scale. This again corresponds to Sheps' conclusion, that "in this example, there is no general basis for a preference among several possible denominators". In our view, this is not so much a shortcoming of Sheps' suggestion as a shortcoming of all effect measures: This framework provides rationale for expecting stability of the effect of some interventions but not others, the open problem of finding a stable scale for the effect of variables such as gender is left unsolved. Statistical modellers and clinical scientists often require an effect measure which can be specified before it is known whether the intervention increases or decreases the risk of the outcome. This motivates the switch relative risk [17] , a composite effect parameter which selects a variant of the relative risk depending on whether risk of the event is higher or lower when the intervention is implemented. The switch relative risk is defined as being equal to the risk ratio if the intervention reduces risk of the outcome, and equal to survival ratio if the intervention increases risk of the outcome. Baker and Jackson [18] proposed a notationally convenient representation of the switch relative risk, which they referred to as the "generalized relative risk reduction (GRRR)" and gave the symbol θ. GRRR is prediction-equivalent to the switch relative risk in the sense defined in section 1.3, and is defined as being equal to one minus the survival ratio if the intervention increases risk of the outcome, equal to 0 if the intervention has no effect, and equal to the risk ratio minus one if the intervention reduces risk: The effect function of θ is its inverse with respect to p 1 : To illustrate calculation of θ from data, suppose an RCT shows that risk in the control group is 2% and risk in the intervention group is 1%. Then, θ = −1 + 0.01 0.02 = −0.5. If instead risk in the control group is 2% and risk in the intervention group is 4%, θ = 1 − 0.96 0.98 ≈ 0.02. In general, the causal θ-parameter will be in the range in [−1, 1], and will be positive if treatment increases risk, negative if treatment reduces risk, closer to 0 if effects are small and closer to 1 or −1 if effects are large. If we have information on the baseline risk in the group that our patient belongs to, and wish to combine this with a published estimate of θ in order to predict their risk of the outcome under the intervention, this can be calculated using the effect function. To illustrate, if a doctor believes that her patient belongs to a group whose baseline risk is 3%, and is told that θ = 0.02, she will predict that risk under the intervention will be 1 − (1 − 0.03) × (1 − 0.02) ≈ 5%. If she is instead told that θ = −0.5, she will predict that the patient's risk under the intervention is 0.03 × 0.5 = 1.5%. The effect function is closed on the interval [0, 1], this procedure will therefore not result in predicting invalid probabilities. Fig. 3 illustrates the θ scale on a number line. In somewhat of an oversimplification, if we assume that the intervention only works in one direction (monotonicity), a positive causal θ can be interpreted as the probability of "outcome changing" in response to treatment among those who would not experience the outcome if untreated, and the absolute value of a negative causal θ can be interpreted as the probability of outcome changing in response to treatment among those who would have experienced the outcome if untreated [16] . These probabilities are closely related to sufficiency scores, which differ only slightly in the counterfactual definition of the conditioning event, and which have recently been argued to improve upon state-of-the-art approaches to explainability of artificial intelligence [45] . We note that the switch relative risk is a disjunctive effect measure, and that it may therefore be challenging to give it a realist interpretation. A realist interpretation is however not necessary; as we have shown, stability of the switch relative risk is simply a useful mathematical consequence of certain underlying biological structures. If the switch relative risk is found to be too cumbersome for practical use, Sheps' recommendations can often be approximated with a careful choice from standard effect measures. If the risk of the outcome is low, the survival ratio can be closely approximated by one minus the risk difference. Therefore, if the survival ratio is stable, the risk difference will also be stable under a rare-disease assumption. This justifies individualizing treatment based on "relative benefits and absolute harms", as previously suggested by leading practitioners of evidence-based medicine [46, 47] . Sheps' recommendations are generally equivalent to the standard approach when considering the primary effectiveness outcome of an intervention, but would result in a clinically meaningful change in how empirical evidence is used to inform predictions about the risk of adverse events. This can have substantial implications in settings where a clinician must determine whether the predicted benefits outweigh the predicted harms for patients whose risk profile differs from the typical participant in the study. To illustrate, we will consider the Pfizer BNT162b2 mRNA Covid-19 Vaccine, which has been shown to have an effectiveness of 95 percent [48] at preventing Covid-19, corresponding to θ = 0.95 or RR = 0.05. The effectiveness of the vaccine was presented on a scale that is consistent with Sheps' recommendations, following her advice would therefore not alter any predictions about the benefits of vaccination. For questions about safety, this will not be the case. For example, a nationwide study in Israel has shown that the vaccine is associated with a small but possibly relevant elevated risk of Myocarditis. Barda et al reported this in terms of a risk ratio of 3.2 [49] . Taking this result at face value, a clinician with a patient who has a baseline risk of Myocarditis of 1% (significantly higher than the population average, perhaps reflecting a history of HIV infection or other prognostic factors for myocarditis) would conclude that the patient has a 3.2% risk of myocarditis if given the vaccine. Depending on the risk of infection if unvaccinated, and depending on the availability of other vaccines, this may well lead to a determination that the harms of vaccination outweigh the benefits for this particular patient. If the results from Barda et al had instead been presented in terms of the switch relative risk or the survival ratio (θ = 0.000027, SR=0.999973), as Sheps would have recommended, this would enable the clinician to conclude that the risk of myocarditis changes from 1% to approximately 1.0027% when the patient is vaccinated. We would argue that this approach leads to a much more realistic estimate, consistent with a biologically interpretable hypothesis that approximately 0.0027% percent of the population carry some form of a "switch" that makes them susceptible to myocarditis if vaccinated. This hypothesis may not perfectly describe the underlying biology, it is for example possible that the presence of this switch is correlated with baseline risk of myocarditis, in which case a sensitivity analysis is needed to explore the potential consequences of such correlation. We maintain that even the upper bounds of this sensitivity analysis is unlikely to produce risk estimates as high as what one would obtain if the analysis relied on homogeneity of the risk ratio or the odds ratio. In our view, Sheps' approach provides a starting point for reasoning about what factors must be accounted for in the analysis, in order to meaningfully summarize the risk of adverse effect on a numerical scale. In many situations, the extrapolator's objective is to choose the effect modifiers M by reasoning about the predictors of the individual-level determinants of treatment response Q (for example, the switches B, C, D and E), because this may provide justification for conditional stability of a measure of effect. Next, we show that this is impossible for the odds ratio: there cannot exist a set of individual-level covariates which determine treatment response, such that if two groups are equal in their conditional distribution of those background covariates, then they always have the same odds ratio, unless this also guarantees equality of all effect measures. Let s and t be two settings (values of S), for example representing countries. Suppose we are able to construct a set of effect modifiers M to make Q ⊥ ⊥ S|M, i.e. f (q|m, s) = f (q|m, t) = f (q|m, s ∪ t). If the odds ratio is equal between groups with the same distribution of Q, this will then imply the following relationship: e. treatment has no effect). In both cases, it follows not just that the odds ratio is stable but that every conditional effect measure is stable between the groups m, s and m, t. In other words, conditional stability of the odds ratio due to equal conditional distribution of individual-level determinants of treatment response can only be obtained by controlling for enough variables to also obtain conditional stability of every other effect measure. This observation is closely related to non-collapsibility [50, 19] , and a similar argument can be made for any non-collapsible effect parameter. While we caution against making overly general conclusions from this simple mathematical argument, it does demonstrate that a scientist who is selecting what effect modifiers to account for, aiming to obtain conditional homogeneity of the odds ratio, cannot be guided by biological beliefs about predictors of individual-level determinants of treatment response. We also note that Doi et al [51] have recently made a claim that the odds ratio is independent of baseline risks. An immediate corollary of our result is that this alleged baseline risk independence of the odds ratio cannot be a consequence of equal conditional distribution of individual-level determinants of treatment response. We have presented a purely theoretical argument for approximate stability of a specific variant of the relative risk, in some situations where the joint distribution of unmeasured determinants of treatment response is reasonably expected to be approximately stable. Ideally, our argument would be supported by empirical evidence. Empirical evaluation of relative stability of different measures of effect is not theoretically straightforward, as standard tests for homogeneity have different power for different measures of effect [6] . We note that earlier literature contains convincing empirical evidence for stability of the risk ratio in settings where the intervention reduces risk [52] . Testing the empirical stability of the survival ratio for interventions that increase risk of the outcome should be a priority for future work. It is not always possible to convincingly establish stability of any effect measure. If this is not possible, it may instead be necessary to aim for conditional stability of counterfactual distributions across populations, in order to allow generalizability of p 1 m rather than λ m . This is a much more ambitious undertaking, and will require investigators to account for all causes of the outcome whose distribution may differ across populations [53, 54] . Under certain biologically interpretable assumptions about the distribution and function of switches that turn the effect of treatment "on" or "off", the survival ratio will be stable between different settings if the intervention increases risk of the outcome, and the risk ratio will be stable between settings if the intervention reduces risk of the outcome. This supports the recommendations from Sheps' landmark 1958 paper "Shall we count the living or the dead?" and motivates the switch relative risk, which becomes the survival ratio if the intervention increases risk of the outcome, and becomes the risk ratio if the intervention reduces risk of the outcome. The models which justify these conclusions are consistent with Cheng's theory of generative and preventive causal power and with the independent action model from toxicology; the conditions which lead to stability of Sheps' preferred variant of the relative risk are thus better understood than for any other measure of effect. While the model will rarely be a perfect description of reality, an advantage of linking the choice of effect measure to a causal mechanism is that effect measure heterogeneity can then be understood as biologically interpretable deviation from the mechanism, which may lead to clearer reasoning about how to account for potential effect measure modification when generalizing experimental findings to patients whose risk profile differs from the participants in the study. Causal pie models for survival Some readers may be troubled by doubts about whether the sufficient-component cause model can be applied to the absence of the outcome event. For example, the textbook Modern Epidemiology [55] argues that "Sheps (1958) once asked, "Shall we count the living or the dead?". Death is an event, but survival is not. Hence, to use the sufficient-component cause model, we must count the dead. This model restriction can have substantive implications". We do not accept this restriction. In theory, any complete list of causal pies leading to the event can be restated as a complete list of causal pies that are sufficient causes of not having the event. These two models are therefore different representations of the same underlying process, each model will be fully valid if used appropriately, and each may be useful for making a different methodological point. However, it will usually not be realistic to represent the intervention as a component in a sufficient cause of survival: this would mean that the intervention would prevent even unrelated causes of death. We believe this observation accounts for at least some intuitive discomfort with considering survival as the outcome event. This does not necessarily mean that such models are unrealistic for more restricted non-event outcomes, such as non-incidence of a specific disease. It is often entirely plausible to represent an intervention as a component of a sufficient cause of not experiencing a more restricted outcome, such as rheumatic fever. For these reasons, our examples intentionally relate to settings where the outcome of interest is short-term incidence of a specific disease rather than all-cause mortality. In many cases where this model is plausible, for example when the intervention is sufficient cause of an absorbing state in which the patient is no longer at risk from the outcome, human language allows us to talk about the negative outcome event (in which the patient survived) as a positive event, for example the patient was"cured" or "recovered from disease". We note that if a model states that E makes ¬A a sufficient cause of ¬Y , and the data implies that the baseline risk that is higher than the prevalence of ¬E, then the model is inconsistent with observations. This phenomenon is closely related to the fact that multiplicative models sometimes result in predictions outside the range of valid probabilities. In general, models that are based on switches of type B and D are consistent with any baseline risk but may be falsified by some values of risk under treatment; whereas models that are based on C and E are consistent with any risk under treatment but may be falsified by some values of baseline risk. In this appendix, we show that an identical argument for stability of Sheps' preferred variant of the relative risk can be made using directed acyclic graphs rather than causal pie models. The reasoning outlined in the main manuscript depends upon counterfactual independence relations of the type Y a=1 ∐ S|Y a=0 , V . On traditional causal graphs, such independence relations cannot immediately be inferred, as the graphs do not contain separate nodes for the counterfactuals Y a=1 and Y a=0 . Recently, Cinelli and Pearl [44] introduced a graphical approach that enables reasoning about such independence conditions, by showing the counterfactuals on the graph. To illustrate a simplified version of this idea, we will first consider a simple example of how such a graph might look. A naive attempt to draw a causal graph with nodes for counterfactuals is shown in Figure 5 : Consider the nodes representing anaphylaxis under the intervention (Y a=1 ) and anaphylaxis under the control condition (Y a=0 ). In most settings, Y a=1 and Y a=0 share most of their causes. The true graph therefore almost certainly has dense connections between these two nodes, running via the node labelled U. Trying to measure sufficient covariates in order to block all paths that result in D-connection between Y a=1 , Y a=0 and S would normally be hopeless. But let us next consider settings where we are additionally willing to impose constraints that arise from our background knowledge that Y a=1 and Y a=0 are very closely related constructs, for example such that Y a=1 is set to Y a=0 unless a specific covariate is present. This covariate then acts like a switch, it turns on or off the effect of A. If only one such type of switch is present, the graph in Figure 4 can be simplified. For example, if the effect of A on Y depends only of switches of type B, the entire assignment mechanism for Y a=1 can be specified with a graph where its only parents are Y a=0 and the switch (see Figure 6 ). On such a graph, if we condition on sufficient variables V to block all paths between the switch and the population indicator S, we can read off the independence condition Y a=1 ∐ S|Y a=0 , V , which will play a key role in analysis of stability of effect measures, since in combination with a monotonicity assumption, it ensures that if the intervention increases risk, conditioning on V is sufficient for the survival ratio to be stable across populations S. In the absence of monotonicity, there may be approximate stability, within bounds that may be quite informative. Now consider a possible world where instead of there only being switches of type B, there were only switches of type E. In such a world, the data generating mechanism would be described by the graph in Figure 7 . Now, the analysis is reversed, and the risk ratio will be stable across groups. This raises an obvious question: Why would an investigator assume that the true data generating mechanism is better described by Figure 6 than by Figure 7 ? We argue that the answer to this lies in the same background knowledge as we discussed in the previous sections: Switches of type type B are often more plausible than switches of type E, leading to a preference for Figure 5 . This analysis can be generalized to some settings where there are multiple types of switches. For example, if the effect of A on Y depends only on switches of B and D, the assignment mechanism for Y a=1 can be represented as depending only on the node for Y a=0 and on B and D. However, if the assignment mechanism depends on both switches of type B and switches of type E, our specification leads to paradoxical circular assignment, with a bidirectional arrow between Y a=1 and Y a=0 , complicating any attempt to infer independences of relevance to effect measure stability. In general, switches of type B are coherent with switches of type D, and switches of type C are coherent with switches of type E. If background knowledge suggests that two incoherent types of switches play a significant role, these approaches will not be applicable. Patients and investigators prefer measures of absolute risk in subgroups for pragmatic randomized trials Subgroup analysis in clinical trials Detecting moderator effects using subgroup analyses Evidence-based medicine: How to practice and teach EBM. Churchill livingstone Can we individualize the 'number needed to treat'? An empirical study of summary effect measures in meta-analyses Is the Risk Difference Really a More Heterogeneous Measure? Commentary: On Effect Measures, Heterogeneity, and the Laws of Evaluating Public Health Interventions: 6. Modeling Ratios or Differences? Let the Data Tell Us Controversies concerning randomization and additivity in clinical trials Handbook for Systematic Reviews of Interventions Version Analysis of Binary Data Shall We Count the Living of the Dead? From Covariation to Causation: A Causal Power Theory Sure outcomes of random events: a model for clinical trials On the measurement of susceptibility in epidemiologic studies The choice of effect measure for binary outcomes: Introducing counterfactual outcome state transition parameters Estimation of treatment effects in randomized trials with non-compliance and a dichotomous outcome A new measure of treatment effect for randomeffects meta-analysis of comparative binary outcome data Making apples from oranges: Comparing noncollapsible effect estimators and their standard errors after adjustment for different covariate sets Notes On The Theory Of Choice. Underground classics in economics Across the Boundaries: Extrapolation in Biology and Social Science Chapter 14: Completing 'Summary of findings' tables and grading the certainty of the evidence Explanation in causal inference: Methods for mediation and interaction. Explanation in causal inference: Methods for mediation and interaction On effect-measure modification: Relationships among changes in the relative risk, odds ratio, and risk difference The Interaction Continuum Disagreement Concerning Effect-Measure Modification Publication Title: Nature's Capacities and Their Measurement Causal Powers. The British Journal for the Philosophy of Science Causal mechanism and probability: A normative approach The mind's arrows: Bayes nets and graphical causal models in psychology. The mind's arrows: Bayes nets and graphical causal models in psychology When Is a Cause the "Same"?: Coherent Generalization Across Contexts Pharmacological and residual effects in randomized placebo-controlled trials. A structural causal modelling approach A Method of Computing the Effectiveness of an Insecticide The Toxicity of Contrasting Theories of Interaction in Epidemiology and Toxicology Applicability of the simple independent action model to epidemiologic studies involving two factors and a dichotomous outcome Can DAGs Clarify Effect Modification? Inference from a multiplicative model of joint genetic effects for ovarian cancer risk Interaction and Exposure Modification: Are We Asking the Right Questions? The Cement of the Universe: A Study of Causation Effect heterogeneity and variable selection for standardizing causal effects to a target population Generalizing experimental results by leveraging knowledge of mechanisms Explaining Black-Box Algorithms Using Probabilistic Contrastive Counterfactuals An Evidence Based Approach to Individualising Treatment Large trials with simple protocols: Indications and contraindications Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine Doron Netzer Covid-19 Vaccine in a Nationwide Setting On the collapsibility of measures of effect in the counterfactual causal framework Questionable utility of the relative risk in clinical research: a call for change to practice Issues in the Selection of a Summary Statistic for Meta-Analysis of Clinical Trials with Binary Outcomes Confounding and Effect Modification: Distribution and Measure External Validity: From Do-Calculus to Transportability Across Populations Modern Epidemiology