key: cord-0636092-nh7v300j authors: Ziegler, Gabriel title: How many people are infected? A case study on SARS-CoV-2 prevalence in Austria date: 2020-12-22 journal: nan DOI: nan sha: 268c0d1a06618c28b0b85efc9fa85bb7a5baffc6 doc_id: 636092 cord_uid: nh7v300j Using recent data from voluntary mass testing, I provide credible bounds on prevalence of SARS-CoV-2 for Austrian counties in early December 2020. When estimating prevalence, a natural missing data problem arises: no test results are generated for non-tested people. In addition, tests are not perfectly predictive for the underlying infection. This is particularly relevant for mass SARS-CoV-2 testing as these are conducted with rapid Antigen tests, which are known to be somewhat imprecise. Using insights from the literature on partial identification, I propose a framework addressing both issues at once. I use the framework to study differing selection assumptions for the Austrian data. Whereas weak monotone selection assumptions provide limited identification power, reasonably stronger assumptions reduce the uncertainty on prevalence significantly. An important measure for (health) policy during a pandemic is prevalence. As a quantification of current disease status, prevalence is the proportion of people in a given population having the disease. (Rothman, 2012) Besides providing an evaluation of how widespread the diseases is, prevalence is relevant in how accurate diagnostic tests are. (Zhou et al., 2014) However, prevalence is not directly observable in many situations. Often prevalence is inferred from diagnostic tests results, which are indicative of the disease. There are at least two problems arising in such a situation. First, tests are usually not perfectly indicative for the disease. A test might produce either false positives, false negatives, or both. Recently, Ziegler (2020) discusses how this problem worsens when the test accuracy is evaluated with respect to an imperfect reference test. In such cases, the test's information is ambiguous. Second, the tested population is usually not the whole population and the testing pool'composition matters for inference of overall prevalence. The latter problem is even more severe in cases, when then the composition of the testing pool is unknown. For example, when testing is voluntary it is not obvious whether disease-susceptible people are more or less likely to take the test. That is, people self-select into the testing pool. Selection problems as occurring with voluntary testing are ubiquitous within Economics. Without strong assumption on unobservable data, Manski (1989) shows that this problem leads to an identification problem and therefore it might not be possible to assign a unique number to the relevant statistic. In this note, I use data of voluntary COVID-19 mass testing in Austria in December 2020 to provide bounds on SARS-CoV-2 (point) prevalance at that time. Building on the work of Manski and Molinari (2021) , Stoye (2020) , and Ziegler (2020), 1 I address both of the problems mentioned within one framework. This allows me to illustrate how much knowledge about prevalence can be obtained just from the data alone (with minimal assumptions). Furthermore, the framework provides a simple method to address the identifying power of varying (stronger) assumptions about the selection problem. In the first half (4/12-15/12) of December 2020, every Austrian municipality provided voluntary SARS-CoV-2 tests for their population via rapid Antigen tests. In many municipalities testing was available only for a few, consecutive days. The goal of the policy was to identify otherwise undetected SARS-CoV-2 infected people. For this, people with typical symptoms, people who were tested regularly before in their workplace, quarantined people, and children below school starting age 2 were explicitly asked to not attend the testing. (Sozialministerium, 2020) This and anecdotal evidence suggests negative selection into testing. That is, tested people are less susceptible of being infected by SARS-CoV-2. DerStandard (2020) provides data of test results and participation on the county level. The dataset only covers 7 out of the 9 Austrian states (Bundesländer). 3 Let c = 1 denote a person who is infected with SARS-CoV-2 and c = 0 otherwise. Participation in the mass testing is indicted with t = 1 (again t = 0 otherwise). Only if the person was tested, she can obtain a positive test result denoted with a = 1 (and a = 0 otherwise). The population (of a county) is a distribution P (a, c, t), but observed data are just P (a|t = 1). In particular, note that P (a = 1) = P (a = 1|t = 1)P (t = 1) because a positive test can only be observed for tested people. γ := P (a = 1|t = 1) . . .test yield ρ := P (c = 1) . . .prevalance τ := P (t = 1) . . .proportion of tested people. Accuracy of the test is given by sensitivity and specificity given by σ := P (a = 1|c = 1, t = 1) = P (a = 1, c = 1|t = 1) P (c = 1|t = 1) (1) π := P (a = 0|c = 0, t = 1) = P (a = 0, c = 0|t = 1) P (c = 0|t = 1) , respectively. In line with Ziegler (2020) , Antigen tests correspond to ambiguous information and therefore I make the following assumption the both, sensitivity and specificity, are only known to lie within an interval. Assumption 1 (Ambiguous Information). The test satisfies σ ∈ [σ, σ] and π ∈ [π, π] . Assumption 1 alone provides sharp bounds on prevalence ρ := P (c = 1): Proposition 1. If Assumption 1 holds, then Proof. First, consider fixed σ and π. Then by the law of total probability and P (c = 1|t = 0) ∈ [0, 1]. Then, again by the law of total probability The fraction is increasing in π and decreasing in σ. The result follows by evaluating at the respective extremes. The prevalance bounds in Proposition 1 are pretty wide in applications as will be seen later. However, they are not completely trivial in the sense of just stating prevalence is bounded by 0 and 1 although they rely on minimal assumptions about the (untested) population. As can be seen in the proof of Proposition 1, P (c = 1|t = 0) is trivially bounded without stronger assumptions, which leads to wide bounds on prevalence. As explained above, there is some indication of negative selection into the testing pool in the case of Austrian mass testing. This knowledge can be used to narrow bounds on prevalence. This extraneous information is formalized in Assumption 2. 4 4 A potentially more satisfying way of modeling selection is bounding the odds ratio. Stoye (2020) uses such bounds in his analysis under the assumption of π = π = 1, i.e. the test does not Assumption 2 (Selection). The population satisfies with κ ≥ 0. When κ ≥ 1, then P (c = 1|t = 0) ≥ P (c = 1|t = 1) so that tested people are less likely to be infected than untested people. This corresponds to the negative selection explained above. On the other hand, if κ ≤ 1, then there is positive selection, which seems more appropriate in the case of PCR testing. Indeed, Manski and Molinari (2021) use such an assumption in their study on prevalence of SARS-CoV-2. Their assumption (called test-monotonicity) corresponds to (κ, κ) = (0, 1). Proposition 2. Suppose Assumption 1 and Assumption 2 hold. If κ ≤ σ+π−1 γ+π−1 , then Otherwise the upper bound is given by Proposition 1. Proof. As in the proof of Proposition 1, P (c = 1|t = 1) = γ+π−1 σ+π−1 , but now P (c = 1|t = 0) ∈ [κP (c = 1|t = 1), κP (c = 1|t = 1)], which is below one because of κ ≤ σ+π−1 γ+π−1 . Then The fraction is increasing in π and decreasing in σ. The result follows by evaluating at the respective extremes. The dataset was already explained in Section 1. It remains to get data on the test's accuracy as formalized in Assumption 1. Rapid Antigen test were used in the mass testing in Austria. To best of my knowledge, there is no publicaly available data on which specific test was used by each municipality. However, I personally obtained produce false-positives. Without this assumption, bounds on the odds ratio seem rather intractable. Furthermore, such an assumption is problematic in the application to Antigen tests. It remains to specify the selection parameters (κ, κ). For this, I will consider several cases corresponding to different assumptions about selection and explain the effects using the city of Graz (Graz-Stadt) as an example. Graz had a particpation rate of slightly more than 20% and of these 0.9% obtained a positive test result. The results for each county (together with participation τ and test yield γ) are shown in Table 1 and Table 2 . No Assumption. This case corresponds to Proposition 1. Here any kind of selection, negative or positive, is allowed for. Correspondingly, the bounds on prevalence are rather wide. For Graz the bounds are [0.10%, 79.70%], which excludes the possibility of 0% prevalence in contrast to many other counties in Table 1 . No selection. Next, consider a scenario of no selection, which embodies a very strong assumption and one which might not be appropriate in the current context. No selection means that the testing decision was as if randomly assigned and therefore the results from the tested people is representative for the entire 5 Many Antigen tests currently on the market have very similar quality in terms of observed sensitivity and specificity relative to a PCR test. Therefore, the use of a different test would not change the results significantly. 6 For all considered PCR tests, their 95% confidence intervals exclude perfect sensitivity. 7 In these calculations, I use the point estimates of Kaiser et al. (2020) . population. Mathematically this means P (c = 1|t = 0) = P (c = 1|t = 1) (or equivalently in the current framework, κ = κ = 1). With this strong (and most likely unwarranted assumption), the bounds for Graz reduce to [0.49%, 1.68%], which still leaves a interval width of more than 1%. This uncertainty stems from the imprecise testing technology. Negative Selection. As explained before this assumption is credible in the current context, but it is still very weak as it just imposes P (c = 1|t = 0) ≥ P (c = 1|t = 1) (with κ = 1 and κ = ∞). However, with negative selection only it is still possible that every untested person is infected and therefore the upper bound is not reduced relative to the No Assumption scenario. For Graz the bounds are [0.49%, 79.70%]. Note that the lower bound is almost five times higher than in the No Assumption scenario, meaning that this assumptionalthough weak-has quite some power for the 'best-case' prevalence. Restricted Negative Selection. Here, the selection assumption from before is maintained, but there is also an upper bound on selection. In particular, at most P (c = 1|t = 0) = 2 × P (c = 1|t = 1), i.e. non-tested people are twice as likely infected than tested ones. Formally, this case is obtained with κ = 1 and κ = 2. For Graz this additional assumption gives prevalence bounds [0.49%, 3.01%]. Here, and for all the other counties as well, the upper bound is reduced by a large amount relative to the Negative Selection scenario. Thus, this assumption of restricted negative selection has quite some identification power, although "twice as likely" does not appear to be unrealistic. Small Ambigous Selection. Finally, consider a case where no knowledge about the direction of selection is warranted, but there is evidence for small selection, meaning P (c = 1|t = 0) is close to P (c = 1|t = 1). In particular, here the assumption is that non-tested people have an infection probability between 95% and 105% of the tested people's infection probability (i.e. κ = 0.95 and κ = 1.05). In this scenario, bounds are pretty tight and for Graz they are [0.47%, 1.75%]. Since there is the possibility of positive selection, the lower bound here is lower than in the previous cases. Sensitivity of Different RT-qPCR Solutions for SARS-CoV-2 Detection Gemischte Beteiligung, wenige Positive: Die Bilanz der Massentests auf Bezirksebene Validation Report: SARS-CoV-2 Antigen Rapid Diagnostic Test Anatomy of the Selection Problem Estimating the COVID-19 Infection Rate: Anatomy of an Inference Problem Epidemiology: An Introduction What Can We Learn about SARS-CoV-2 Prevalence from Testing and Hospital Data FAQ: Bevölkerungsweite Testungen Bounding Disease Prevalence by Bounding Selectivity and Accuracy of Tests: The Case of COVID-19 Statistical Methods in Diagnostic Medicine Binary Classification Tests, Imperfect Standards, and Ambiguous Information