key: cord-0570318-g0wv29fo authors: Ribeiro, Andre F. title: The External Validity of Combinatorial Samples and Populations date: 2021-08-09 journal: nan DOI: nan sha: d3e6a5df6152bf8e168dfd5848793beadab98f07 doc_id: 570318 cord_uid: g0wv29fo The widely used 'Counterfactual' definition of Causal Effects was derived for unbiasedness and accuracy - and not generalizability. We propose a simple definition for the External Validity (EV) of Interventions, Counterfactual statements and Samples. We use the definition to discuss several issues that have baffled the counterfactual approach to effect estimation: out-of-sample validity, reliance on independence assumptions or estimation, concurrent estimation of many effects and full-models, bias-variance tradeoffs, statistical power, omitted variables, and connections to supervised and explaining techniques. Methodologically, the definition also allow us to replace the parametric and generally ill-posed estimation problems that followed the counterfactual definition by combinatorial enumeration problems on non-experimental samples. We use over 20 contemporary methods and simulations to demonstrate that the approach leads to accuracy gains in standard out-of-sample prediction, intervention effect prediction and causal effect estimation tasks. The COVID19 pandemic highlighted the need for learning solutions to provide general predictions in small samples - many times with missing variables. We also demonstrate applications in this pressing problem. Donald Rubin's seminal research 25, 32, 33 still provides the most broadly-used and well-accepted definition for what is a causal effect. If y is an outcome of interest and a is a treatment indicator, then the causal effect of a is the difference where y +a i is the outcome of individual i under the treatment. The central concept behind Eq. (1) was inspired by experimental estimation: by fixing every factor, other than the treatment, we can declare that the observed difference in outcome was certainly caused by the treatment -and the treatment alone. The definition is an ideal, as it is impossible to observe outcomes for an individual, concurrently, in two different and totally fixed conditions. The theory goes that we may, instead, 'fix' factors in expectation, and across individuals. If the treated and non-treated subpopulations have the same expected values across all factors then any difference between the groups is due to the treatment -given large enough samples. This can be paraphrased with an independence statement: the treatment must be conditionally independent on all other relevant factors. This train-of-thought lead to the notion of Sample Balance in non-experimental estimation and the objective most current causality estimators minimize. We consider, instead, where Π S (m) is a set of permutations of a very large number of factors m, and y < is the outcome of the set of elements before a in the permutation order, and y ≤ the set including a. This is also an ideal, but defines causal effects in a way that is almost opposite to Eq. (1). It calls for effects to be observed under large variation -as opposed to no variation. The most important element of this definition is Π S , the number of permutations observed in a sample. EV is here the inverse of the variance (i.e. the precision) of effects under a large Π S . Very importantly, these definitions can be extended to multiple causes, a∈X. We demonstrate that, due to their high number of permutations and equal subpopulation representation, causes defined this way are both predictive and free of sample-biases. We thus look at non-experimental samples as random draws of squares. This is in contrast to simply permuting sample observation orders 34 A maximally accurate estimator (i.e., one with minimum variance) minimizes the covariance between the individuals, or individual states, it is comparing. Combinatorically, this can be seen as maximizing their intersecting factors, Fig.1 intersecting factors) with the first. A partial permutation is a permutation with d fixed-points, d