key: cord-0146881-0eajm63g authors: Ziegler, Gabriel title: Binary Classification Tests, Imperfect Standards, and Ambiguous Information date: 2020-12-21 journal: nan DOI: nan sha: 2b740dffc94b683e7ccaed3aefead94efd333cf8 doc_id: 146881 cord_uid: 0eajm63g New binary classification tests are often evaluated relative to a pre-established test. For example, rapid Antigen tests for the detection of SARS-CoV-2 are assessed relative to more established PCR tests. In this paper, I argue that the new test can be described as producing ambiguous information when the pre-established is imperfect. This allows for a phenomenon called dilation -- an extreme form of non-informativeness. As an example, I present hypothetical test data satisfying the WHO's minimum quality requirement for rapid Antigen tests which leads to dilation. The ambiguity in the information arises from a missing data problem due to imperfection of the established test: the joint distribution of true infection and test results is not observed. Using results from Copula theory, I construct the (usually non-singleton) set of all these possible joint distributions, which allows me to assess the new test's informativeness. This analysis leads to a simple sufficient condition to make sure that a new test is not a dilation. I illustrate my approach with applications to data from three COVID-19 related tests. Two rapid Antigen tests satisfy my sufficient condition easily and are therefore informative. However, less accurate procedures, like chest CT scans, may exhibit dilation. An important aspect of evaluating a new diagnostic test is to assess its accuracy. Intuitively, a sensible binary test should have test results highly correlated with the underlying health condition. In other words, a positive test result should be likely if and only the tested person is indeed infected or sick. 1 However, establishing whether a person is truly infected is often costly or even impossible. Therefore, a new test is analyzed relative to an established test. An established test is perfect when a positive test result occurs if and only if the person is truly infected. The medical literature calls these perfect tests a "gold standard" (Watson et al., 2020) . In these situations the joint distribution of the new test's outcomes and the underlying true health condition is the same as the joint distribution of test results from both tests. Thus, this observed joint distribution can be used to evaluate the new test's accuracy. In practice, however, a perfect reference test does not exist. In such a case, the researcher would need the joint distribution of the health conditions and the outcomes of both tests. 2 This overall joint distribution is not observable (or maybe only if the researcher incurs a high costs for obtaining the data). This missing data problem leads to two distinct problems: (i) the marginal distribution of the underlying health condition is missing and (ii) the correlation between new test's outcome and health status is missing too. 3 The latter of these problems will introduce ambiguity in the information provided by the new test. The first of these problems, missing data about the underlying health condition, is well-known. Recently, Manski and Molinari (2021) use methods known from the literature on partial identification to provide bounds on prevalence-the fraction of infected people in the population. 4 Measuring prevalence is different from the usual inference problem because the tested population might not be representative of the overall population. Here the data are observed selectively which corresponds to a selection problem as introduced by Manski (1989) . Furthermore, Manski (2020) illustrates how this problem carries over to evaluating accuracy of new tests in the context of COVID-19 Antibody tests maintaining the assumption of a perfect reference test. The second problem of missing data about the correlation is different in nature and avoided when a perfect reference is available. Even if one would assume knowledge of prevalence, potentially multiple 'correlation structures' are consistent with the observed data. The reason for this multiplicity is well-known from copulas as studied in probability theory. Knowledge of prevalence provides the marginal distribution of the health condition, whereas the observed testing data provides (a bivariate) marginal distribution. In general, there are multiple (trivariate) joint distributions with these marginal distributions. Due to this multiplicity a simple, unambiguous interpretation of the new test is not possible. Without knowledge of prevalence, the problem identified before carries over and therefore exacerbates the overall multiplicity. However, as discussed in more detail later, the ambiguous information stems only from the missing data on correlation and therefore occurs whether or not the researcher has knowledge about prevalence. In this paper, I provide a theoretic framework that combines insights from Manski and Molinari (2021) and Stoye (2020) about selective testing with the missing correlation data due to an imperfect reference test. Within this framework, it is possible to address informativeness of both tests. First, Proposition 1 shows that the established test's negative predictive value 5 is usually not given by a unique number, but it always informative nevertheless. This multiplicity arises because of problem (i) only. Then, I analyze the new test's informativeness for the test population only. The focus on the tested population simplifies the algebra and furthermore shuts down the ambiguity about prevalence (cf. problem (i)) and therefore allows me to study the essence of ambiguous information for the new test in separation (problem (ii) only). Finally, I study the implications on informativeness if both effects are present. Studying the informativeness of tests has a long tradition in probability theory, statistics, economics, and philosophy. Blackwell (1951 Blackwell ( , 1953 introduces a notion of "(more) informative" for (what is now called Blackwell) experiments. 6 An experiment is a mapping from states of the world to a distribution over signals. In the current setting, an experiment is a function that associates a distribution over test results to each of the possible health conditions, i.e. for being infected and for being healthy. In such a setting, the value of information is defined as the amount a Bayesian decision maker is willing to pay for the experiment. Since every 5 A test's negative predictive values is the probability of being healthy conditional on obtaining a negative test result. Another important informativeness measure is the positive predictive value, which is the probability of being infected conditional on a positive test result. I will assume throughout that the established has a perfect predictive value in line with the application to SARS-CoV-2 testing. experiment is more informative than an uninformative experiment, 7 Blackwell's theorem shows that the value of information is (weakly) positive for every Bayesian decision maker. 8 Ideally a diagnostic test should satisfy Blackwell's definition of an experiment in order to ensure that it is always informative. However, this is typically only true for the established test in my framework. The new test fails to be a Blackwell experiment because it does not map each state to a unique distribution over test results. Rather, due to the multiplicity of joint distributions, there is a set of distributions over test results for a given health condition. 9 Therefore, Blackwell's informativeness notion does not apply to the new test. Furthermore, the value of information needs to be adjusted because a Bayesian analysis does not readily apply with sets of probabilities. Such a situation is usually referred to as a situation of "ambiguity" and the literature has identified several extensions of Bayesian decision making to the realm of ambiguity. 10 Instead of defining the value of information for a specific decision criterion in such a situation, I adopt a very weak notion of informativeness: the diagnostic test is informative if and only if it is not a dilation. Seidenfeld and Wasserman (1993) introduce the notation of dilation for situations with sets of probabilities. In the current context, a dilation occurs if, no matter what test result is obtained, the set of probabilities conditional on this information contains the original set of probabilities. Figure 1 illustrates an example of a dilation. Here, the set of probabilities indicating the infection likelihood before the test (black set) lies within both sets after the test result (blue for a positive result and red indicating the set after a negative result). Thus, in a sense, the decision maker is worse-off after taking the test than before taking the test no matter what the test result is. For this reason, Seidenfeld and Wasserman call a dilation a "counterintuitive phenomenon" and Gul and Pesendorfer (2018) refer to it as "all news is bad news". My framework allows to fully characterize when a new diagnostic test is a dilation (cf. Expression (3)). Since the definition of informativeness for tests is rather weak, any reasonable test should satisfy this criterion. The characterization provides a method to verify whether the new test is informative. The WHO (2020) recommends a minimum standard of accuracy for rapid Antigen tests. 11 Usually a PCR test is the established test used to evaluate these Antigen tests (Esbin et al., 2020) . Table 1 Shishkin and Ortoleva (2020) . To the best of my knowledge, the possible occurrence of dilation with diagnostic tests (or SARS-CoV-2 tests more specifically) is the first observation of this phenomenon 'in the field.' 13 Of course, researchers studying diagnostic tests are well aware of the general issues addressed here. The problem of selection leading to unobserved prevalence is known as Verification Bias, whereas the problem arising from unobserved correlation due to an imperfect reference test is descriptively named Imperfect Gold Standard Bias. (Zhou et al., 2014, Chapters 10-11) This paper is not the first to document that either of of these problem leads to non-identified models; rather the novelty of this paper comes in the approach. Diagnostic test research seeks to avoid non-identified models by introducing additional assumptions and then address resulting biases relative to a baseline assumption. Proposed methods include simply imputing missing data or considering more sophisticated correction methods. Moreover, the two problems are often addressed separately. By contrast, my framework requires minimal assumptions and addresses both problems simultaneously. 14 2 Main Analysis I consider the following situation. Let x = 1 denote that a person is infected and x = 0 if the person is healthy. Initially, there is binary test available where y = 1 indicates a positive test result and y = 0 a negative result. Finally, a new test is introduced which again can be either positive (z = 1) or negative (z = 0). Let P (x, y, z) denote the population distribution under consideration with p := P (x = 1) denoting prevalence. However, the population distribution is not directly observable. This is, of course, almost always the case, because a researcher usually only observes a sample from the population distribution. This leads to the usual inference problem. Throughout, I will abstract away from inference altogether. Instead, the data is given for people who were tested to obtain data on the new 13 Manski (2018) mentions that a dilation might occur in a different medical context, but does not address this issue further. 14 Reitsma et al. (2009) provide a flowchart as guidance for applied researchers to address several problems arising when establishing accuracy of diagnostic tests. The two problems addressed here are in two distinct branches of the flowchart. test. For this denote tested people with t = 1 and t = 0 otherwise. Then, the data are given by P (y, z|t = 1) and I assume that P (t = 1) > 0. 15, 16 Furthermore, since the established test is well-known, precise information about the sensitivity and specificity of this test is available as well. The following assumption ensures that both of these measures are well defined. Assumption 1 (Non-trivial prevalence). The population satisfies p ∈ (0, 1). With this assumption, sensitivity and specificity for the initial test are respectively defined as: As discussed in Manski (2020) , for decision making sensitivity and specificity are not the relevant measures. The relevant measures are positive predictive value (PPV) and negative predictive value (NPV) . For the established test these measures can be obtained from specificity and sensitivity via Bayes' rule if prevalence p and P (y = 1) are known: PPV y := P (x = 1|y = 1) = p P (y = 1) P (y = 1|x = 1) = p P (y = 1) σ NPV y := P (x = 0|y = 0) = 1 − p P (y = 0) P (y = 0|x = 0). Since the tested people are usually not representative of the overall population, 17 even for the established test these two measures are not point-identified. (Manski and Molinari, 2021; Manski, 2020; Stoye, 2020) To simplify the analysis and in-line with the application to SARS-CoV-2 testing, I also consider the following three baseline assumptions. P (x = 0, y = 1) = 0. 15 Furthermore, the following logical implications of (not) being tested hold: (i) t = 0 =⇒ z = 0 and (ii) z = 1 =⇒ t = 1. Note that y = 1 is possible even if not tested, because the participation pool concerns only the new test. 16 Equivalently, the data is given by sensitivity and specificity of the new test relative to the established test with the additional information about how many established or new tests had a positive result. 17 For example, supposedly infected people may be oversampled in order to get meaningful results. Assumption 2 implies that the established test achieves a maximum specificity and PPV y of 1. 18 Additionally, I will assume test-monotonicity as in Manski and Molinari (2021) , meaning conditional on being tested the probability of being infected is greater than if not being tested. 19 Assumption 3 (Test-monotonicity). The population satisfies P (x = 1|t = 1) ≥ P (x = 1|t = 0). Lastly, I assume that the established test's sensitivity does depend on the underlying health status x, but not on whether the person is in the testing pool t = 1. 20 Assumption 4 (Health-sufficiency). The population satisfies P (y = 1|x = 1, t = 1) = P (y = 1|x = 1) = σ. To reduce cumbersome lengthy notation in the following, I will use this simplified notation henceforth: . . . test yield for established test ζ := P (z = 1|t = 1) . . . test yield for new test τ := P (t = 1) . . . measure of data representativeness. To avoid trivial cases, assume that γ, ζ, τ > 0. Note that τ has a slightly different interpretation as in Manski and Molinari (2021) or Stoye (2020). Here, τ = 1 means the data P (y, z|t = 1) is perfectly representative of the overall population. In particular, such a parameter value implies no oversampling of infected participants. 21 In particular, even if the participation pool is small (as is often the case), this does not mean that τ should be close to zero. 22 With this notation, we have P (z = 1) = τ ζ since the new test is positive only if the person was tested. Furthermore, Assumption 2 combined with Assumption 4 18 This holds because P (x = 0, y = 0) = P (x = 0, y = 0) + P (x = 0, y = 1) = P (x = 0) = 1 − p and P (x = 1, y = 1) = P (x = 1, y = 1) + P (x = 0, y = 1) = P (y = 1). 19 This might not be true, if there is voluntary enrollment into the testing pool. However, for establishing the accuracy of new tests this assumptions seems to be applicable often. See Footnote 17. 20 Recall that the testing pool is obtained for the new test. This assumption might be violated, if, for example, the medical staff performing the established test for the testing pool is extra careful. In this case, the established test might be more sensitive for the testing pool. 21 See Footnote 17 for why such an assumption might be problematic. 22 A small participation pool might worsen the statistical inference problem: suppose the participation pool is perfectly representative but small. In this case τ = 1, but inference usually relies on some sort of central limit theorem which would not be appropriate in this scenario. However, recall that I abstract away from inference problems as mentioned above. gives P (x = 1|t = 1) = γ /σ. Then, the Law of Total Probability together with Assumption 3 provides sharp bounds 23 on prevalence p ∈ [τ γ /σ, γ /σ] =: χ, χ because p := P (x = 1) = P (x = 1|t = 1) In turn, bounds on the established test's overall positivity rate are implied by sensitivity σ and Assumption 2: Since we consider the non-trivial case of p ∈ (0, 1), consistency of the data with the maintained assumptions requires the established test's sensitivity to be sufficiency high , i.e. γ < σ ≤ 1. In turn, the assumptions imply P (y = 1) ∈ (0, 1). Assumption 2 implies a perfect positive predictive value for the established test (PPV y = 1). However, the negative predictive value is only partially identified and Proposition 1 provides sharp bounds. tive predictive value is sharply bounded as follows: Proof. Fix α = P (y = 1) ∈ [τ γ, γ] and define prevalence as a function of α by With this in hand, the established test's informativeness can be analyzed. Remark: The second row corresponds to PPVy and the third row is given by 1 − NPVy. It is well known that knowledge of prevalence is needed in order to apply Bayes' rule to obtain NPV. Since in most applications prevalence is not known, a common practice is to assume a given prevalence level. For example the United States Food and Drug Administration (FDA, 2020b) assumes a prevalence of 5% to calculate PPV and NPV. If such an assumption (p = χ) is added to the maintained assumptions, then P (y = 1) = χσ and furthermore P (y = 1|t = 0) = χσ−γτ 1−τ . 26 This additional assumption allows to exactly pin down the established test's NPV as 1−χ 1−χσ and therefore P (x = 1|y = 0) = χ 1−σ 1−χσ . Thus, this additional assumption not only assumes away the ambiguity about prevalence, but also illustrates that the established test does not provide ambiguous information itself. 27 The apparent ambiguity reflected in the non-trivial interval for values of prevalence after a negative test result (cf. 26 Alternatively, one could drop the assumption that P (t = 1) = τ is known exactly. In this case (and allowing P (y = 1|t = 0) ∈ [0, γ] as in the general case) the assumed prevalence bounds τ . Calculations show that τ ∈ 0, χσ γ . Since the lower bound is always τ min = 0, we do not find this case very interesting. 27 Technically, the established test is an experimentá la Blackwell (1951) , where sensitivity and specificity can be seen as functions mapping (health) states to distributions over signals (i.e. test results). As mentioned in the introduction, this implies that the established test's value of information is (weakly) positive under these assumptions. Next, the new test's informativeness is analyzed. First, I will discuss informativeness only based on the tested population. For this subpopulation the prevalence is given by χ = γ /σ and therefore the ambiguity about prevalence is muted. Subsection 2.4 extends the analysis then to the informativeness of the new test for the overall population. For the test population, the relevant measures are again positive-predictive value (PPV) and negative-predicative value (NPV), but now they are also conditional on being tested: To obtain these measures, the distribution P (x, z|t = 1) is needed. For a fixed τ , I use a result from Joe (1997) that provides the set of all possible joint distributions P (x, y, z) compatible with the data P (x, y|t = 1) (cf. Appendix A). Setting τ = 1 in this construction gives the possible distributions P (x, y, z|t = 1). Finally, To simplify the algebraic expressions it will be useful to differentiate between four cases defined in Table 3 . Fixing the established test's sensitivity σ, the test data P (x, y|t = 1) immediately reveals the case the test belongs to. Figure 2 illustrates this for three real tests considered later (StQ, BiN, CT) and three hypothetical tests (including the dilation test from Table 1 ). When σ → 1, then all but the informative case (I) cease to be relevant. For SARS-CoV-2 detecting Antigen test the WHO recommends a minimum specificity close to one. Tests close to the (top-right) frontier in Figure 2 satisfy this criterion. 28 Thus, for most applications, either the confirmatory (if σ < 1) or the informative case (if σ ≈ 1) will be the relevant ones. In contrast to the established test, the new test's PPV could be less than one and is, in general, only set-identified. The reason for set-identification is that P (x, z|t = 1) is not directly observed. As explained above, there are multiple distributions P (x, z|t = 1) consistent with the data and each distribution leads to a potentially different PPV. Proposition 2 establishes the sharp identified set for values of PPV. Uni is a test corresponding to a uniform distribution P (y, z|t = 1) = 1/4. Anti always produces the opposite result of the established test P (y = 1, z = 0|t = 1) = P (y = 0, z = 1|t = 1) = 1/2. StQ, BiN, and CT are real tests studied later in Section 3. Dil is the dilation test given by Table 1 . Proof. Conditional on t = 1 is the same as if τ = 1. Thus, From Table 20 and Table 21 , we obtain respectively: 29 and divide by ζ to obtain PPV. To avoid partially identified predictive values, these measures for the new tests are often reported as if the reference test is perfect. In this case, the data P (y, z) alone delivers a unique predictive value: Corollary 1 (Perfect Gold Standard -PPV). Suppose Assumption 1, Assumption 2, and Assumption 4 hold. If σ = 1, i.e. the established test has perfect sensitivity, then P P V z = P (y = 1|z = 1, t = 1). Proof. If σ = 1, then χ = γ = P (y = 1|t = 1). Thus, the relevant case is (I). 30 Therefore the lower bound is P (y = 1|z = 1, t = 1) and the upper bound is We saw before that the established test always achieves a maximal PPV of one and therefore provides a lot of information in case it delivers a positive result. How informative is a positive result of the new test? To answer this question, note that since we condition on being tested, there is no prior uncertainty as the prevalence in the testing pool is given by χ. Even without this prior ambiguity there remains ambiguity in the test result. For example, in the confirmatory case (C) the interval's width of possible values for PPV z is 1 − P (y = 1|z = 1, t = 1), which is usually small-but non-zero-in applications. Thus, the information obtained from a new test is ambiguous at least after a positive test result. 31 29 Here and in the following I will use P to denote lower bound distributions and P for upper bounds. 30 Technically, one could be on the border of case (C) or (U), but by continuity the resulting bounds do not change. 31 This observation alone implies the test is not an experimentà la Blackwell (1951) . Similarly to deriving PPV, it is possible to derive the new test's sensitivity. This will be a set in general arising from the new test allows for the occurrence of dilation. In the current setting a dilation occurs if χ = P (x = 1|t = 1) is contained in the intersection of the two sets with possible values for P (x = 1|y = i, t = 1) for each test result i ∈ {0, 1}. Is it possible that after a positive test result the set of possible values for P (x = 1|y = 1, t = 1) contain χ = P (x = 1|t = 1)? Corollary 2 provides a full characterization. The corresponding case after a negative test result will be discussed afterwards. Proof. For the upper bounds, note that a strict decrease can only happen if and only if cases (I) or (X) occur (i.e. 1 − χ > P (y = 0, z = 0|t = 1)) and P (y = 1|z = 1, t = 1) + χ 1−σ ζ < χ. The first is equivalent to σ > γ 1−P (y=0,z=0|t=1) and the second to σ > 1−ζ P (z=0|y=1,t=1) = γ P (y=1|z=0,t=1) . Since P (y = 0, z = 0|t = 1) ≤ P (y = 0|z = 0, t = 1) we also have P (y = 1|z = 0, t = 1) ≤ 1 − P (y = 0, z = 0|t = 1). Thus, a strict decrease happens if and only if σ > γ P (y=1|z=0,t=1) . For the lower bound, note that 1 − 1−χ ζ ≤ χ always holds. For the other cases (C and I, i.e. P (y = 0, z = 0|t = 1) ≥ χ(1 − σ)), a strict increase is P (y = 1|z = 1, t = 1) > χ, which is equivalent to σ > γ P (y=1|z=1,t=1) . Whereas the case condition is equal to σ ≥ γ 1−P (y=0,z=1|t=1) . Similar to above, P (y = 1|z = 1, t = 1) ≤ 1 − P (y = 0, z = 1|t = 1) and therefore a strict increase happens if and only if σ > γ P (y=1|z=1,t=1) . The inequality of Corollary 2 becomes non-trivial in case the testing data does not correspond to an independent distribution, which will be the case for most applications. In these cases, a dilation cannot occur when the non-trivial inequality of Corollary 2 is violated. It remains to analyze the information contained in a negative test result. in case (X). Proof. Conditional on t = 1 is the same as if τ = 1. Thus, From Table 20 and Table 21 , we obtain respectively: P (x = 0, z = 0|t = 1) = min {1 − χ, P (y = 0, z = 0|t = 1)} . Division by P (z = 0|t = 1) = 1 − ζ gives NPV. The uninformative (U) and contradictory (X) case seem problematic in light of Proposition 3. In both cases, the lower bound is zero and also the width of the interval is rather large. This is another indication that any reasonable test should not fall in either of these two cases. However, even for the other cases-and like for PPV-the NPV is generally only set-identified. Therefore a negative test result also produces ambiguous information. Avoiding this ambiguity can be achieved with a perfect reference test. Corollary 3 verifies that if the reference test is perfect, Proposition 3 reduces to the expression used in many applications and can be calculated directly from the data P (y, z). and Assumption 4 hold. If σ = 1, i.e. the established test has perfect sensitivity, then NPV z = P (y = 0|z = 0, t = 1). Proof. If σ = 1, then γ 1−ζ 1−σ σ = 0 and as in the proof of Corollary 1 the relevant case is (I). If there is no perfect reference test available, the negative new test's result leads to ambiguity. Similar to the case of a positive test result, this ambiguity allows for the occurrence of dilation. Using Proposition 3, Corollary 4 provides a characterization of when the set of possible values of P (x = 1|z = 0, t = 1) = 1 − NPV z contains the prior information P (x = 1|t = 1) = χ. Rearranging gives, σ > ζ P (z = 1|y = 1, t = 1) = γζ P (y = 1, z = 1|t = 1) . As in the proof of Corollary 2 the condition for being in case (C) or (I) is implied by this condition. For the lower bound, a decrease occurs if First, χ − ζ ≤ χ(1 − ζ) always holds as χ ≤ 1. Second, rearranging P (y = 1, z = 0|t = 1) ≤ (1 − ζ) γ σ provides the condition. Corollary 2 combined with Corollary 4 provides an exact characterization for when the new test is a dilation. In fact, as the conditions are the same a dilation occurs if and only if σ ≤ min γζ P (y = 1, z = 1|t = 1) , γ(1 − ζ) P (y = 1, z = 0|t = 1) . When evaluating a new test's accuracy it is important to make sure the data violates Expression (3). Otherwise, the test is uninformative in an extreme sense. In typical applications, the data often satisfies P (y = 1, z = 0|t = 1) ≤ γ(1 − ζ). 32 test In these cases, a dilation can only occur if σ ≤ γζ P (y=1,z=1|t=1) . The WHO (2020) recommends minimum quality requirements using only information directly provided by the data P (y, z|t = 1). In light of this analysis, an evaluation should also take σ, the established test's sensitivity, into account and with this also make sure that the test is not a dilation. σ ≤ γζ P (y=1,z=1|t=1) combined with a given minimum standard provides an easy-to-verify sufficient condition to avoid dilation. For this, let Σ be a minimum (apparent) sensitivity threshold below which a test is deemed not reliable and denote the new test's apparent sensitivity with Σ = P (z = 1|y = 1, t = 1), so that a test is reliable if Σ > Σ. 33 Then, the application relevant case from Expression (3) to avoid a dilation can be expressed as σ > ζ/Σ or equivalently as Σ > ζ/σ. If Σ ≥ ζ/σ, then any test meeting the minimum requirement cannot be a dilation. Thus, it suffices to make sure the new test's yield is not too high: 34 ζ := P (z = 1|t = 1) ≤ σ × Σ. If the established test is highly specific, i.e. σ ≈ 1, then Expression (4) Cassaniti et al. (2020) satisfies this inequality. I thank Filip Obradovic for making me aware of this report. 33 Usually, the minimum requirements include also a threshold for specificity, but this does not matter here. 34 It is worth recalling that this is a sufficient condition when, additionally, γζ P (y=1,z=1|t=1) ≤ γ(1−ζ) P (y=1,z=0|t=1) and in many applications this inequality becomes irrelevant for Expression (3) because the right-hand side is greater than one. P (x = 1|y = 1, z = 1, t = 1) = P (x = 1|y = 1, z = 0, t = 1) = 1, y=0,z=0|t=1) in case (X), and P (x = 0|y = 0, z = 1, t = 1) ∈ Proof. If y = 1 the the PPV has to be one independent of the new test's result because of Assumption 2. For NPV, again start from Table 20 and Table 21 with τ = 1. First, the case of both tests matching, i.e. y = 0 = z: P (x = 0, y = 0, z = 0|t = 1) = max {0, γ − χ + P (y = 0, z = 0|t = 1)} = max {0, P (y = 0, z = 0|t = 1) − χ(1 − σ)} and P (x = 0, y = 0, z = 0|t = 1) = min {1 − χ, P (y = 0, z = 0|t = 1)} . Now, divide by P (y = 0, z = 0|t = 1) to obtain P (x = 0|y = 0, z = 0, t = 1). In case of differing test results, the relevant probability are: Note that P (x = 0, y = 0, z = 1|t = 1) ≥ P (x = 0, y = 0, z = 1|t = 1) in this case. Division by P (y = 0, z = 1|t = 1) gives P (x = 0|y = 0, z = 1, t = 1). Proposition 1 bounds the established test's NPV for the overall population, not only for the tested population. The new test, on the other hand, was analyzed for the testing pool only so far. The full characterization in Appendix A allows to extend the analysis of the new test to make an evaluation for the overall population. Since this involves more cumbersome notation, I only illustrate the resulting bounds for the NPV = P (x = 0|z = 0). The analysis of PPV would proceed in a similar matter. Proof. From Table 20 and Table 21 , we obtain respectively: Now, the result follows from dividing by P (z = 0) = 1 − τ ζ. specificity in the whole population another complication arises. Conditional on the testing pool, both of these measures can be derived as in Subsection 2.2. For example, for sensitivity one could use the proof of Proposition 2 and divide by P (x = 1|t = 1) = χ instead of P (z = 1|t = 1) = ζ. The bounds for sensitivity are again determined by considering the extremes of Table 20 and Table 21 . For the unconditional sensitivity, however, the the numerator and the denominator are both set identified because P (x = 1) ∈ [χ, χ]. Therefore, the lower bound might not be attained at either of the extreme distributions. This makes solving for a closed-form expression for sensitivity intractable. Nonetheless, the bounds can easily be obtained computationally by considering a fixed Γ := P (y = 1) ∈ [τ γ, γ] Applications with corresponding p = Γ/σ. For this Γ, sharp bounds of sensitivity, say [L Γ , H Γ ], can be obtained by using Table 18 and Table 19 . To find the overall bounds for sensitivity, two (non-linear) optimization problems across all values of Γ need to be performed to give [min Γ L Γ , max Γ H Γ ]. In this section, the theoretic framework will be illustrated with several applications. First, I analyze the (hypothetical) dilation test presented in the introduction. Then, I examine two real SARS-CoV-2 detecting tests. Finally, I show that CT-scanning procedures to detect COVID-19 are prone to being dilations. As argued before the hypothetical test data in Table 1 corresponds to a dilation. Suppose the test data is derived for an Antigen test to detect SARS-CoV-2 and the reference test is a PCR test. 35 . The test satisfies the WHO's (2020) minimum requirements with apparent sensitivity (Σ = 80.6%) and specificity (97.1%) above the specified thresholds of 80% and 97%, respectively. 36 For such a setting the current framework is applicable. Especially, Assumption 2 seems to be warranted because a PCR test is highly specific. However, it is known that a PCR test might lack high sensitivity. Alcoba-Florez et al. (2020) report sensitivity for several PCR tests with point estimates ranging from σ = 60.2% to σ = 97.9%. 37 All of the 95% confidence intervals exclude perfect sensitivity, σ = 1. Using the results from Subsection 2.2, Table 4 summarizes some key statistics for the dilation test. When the PCR sensitivity is close to one, the new (hypothetical) test produces relative accurate measurements with PPV close to one and NPV above 75%. However, if the PCR test lacks high sensitivity then we cannot be sure of the dilation test's quality. In the worst-case for PCR sensitivity (σ = 0.6), the new test is indeed a dilation: Before a test result was obtained the prevalence (in the testing pool) is 98.8%, after obtaining either dilation test's result the possible probability of being infected is at least the interval [97.7%, 100%]. In fact, potentially even more puzzling is that the lowest value after a negative test is strictly 35 Recall that for SARS-CoV-2 detection a PCR test is the established test used to evaluate other tests. (Esbin et al., 2020) 36 These numbers are calculated as if the reference test is perfect. This is similar to Corollary 1 and Corollary 3. higher than after a positive result. Using Expression (3), σ * = 60.8% represents the cutoff PCR sensitivity below which a dilation occurs. Next, consider the Standard Q (StQ) COVID-19 Rapid Antigen Test of SD Biosensor/Roche for detection of SARS-CoV-2 as analyzed by Kaiser et al. (2020) . They use results of PCR tests as comparison (see Footnote 35). The testing data is summarized in Table 5 . When σ = 1, then StQ's PPV and NPV are obtained with Focusing first on the testing pool only, Table 6 summarizes PPV and NPV for different values of PCR sensitivity (σ) using Proposition 2 and Proposition 3. Even if the PCR test lacks high sensitivity, StQ has a close to perfect positive predicative value (PPV z ≈ 1). However, the values for NPV drop significantly as σ decreases. In the worst case, a negative StQ result becomes close to a fair coin flip. However, the test is very informative overall as can be seen by the low dilation threshold σ * = 36.3%. Kaiser et al. (2020, p . 1) state "study participants were representative of the usual population seeking testing in our center (main testing center in Geneva). The majority were presenting with symptoms compatible with a SARS-CoV2 infection and a minority were asymptomatic but with a known positive contact or were asymptomatic healthcare workers." The current framework allows to use the obtained testing data to evaluate StQ's quality for the overall population (of Geneva) as analyzed in Subsection 2.4. Furthermore, this explanation supports Assumption 3. Table 7 shows bounds on prevalence using the baseline analysis in Section 2. For low values of τ , i.e. the testing pool was highly non-representative of the overall population, the width of the intervals is rather wide. However, even the lowest number is close to 2% indicating a thorough spread of the virus in Geneva at the time of testing. 39 When testing becomes representative (τ → 1) the prevalence converges to the prevalence in the testing pool. Table 9 provides the numbers using Proposition 5. The lower bounds are significantly lower than for the PCR test. This makes the width of the interval also significantly wider. The widening is a reflection of the combination of the two missing data problems inherit in the testing procedure without a perfect reference test: (i) unknown overall prevalence (which also affects PCR's NPV) and (ii) missing correlation data (which does not affect the PCR's NPV). Table 10 and Table 11 shows the implied accuracy measures. Table 12 shows this data. This is additional data a PCR test produces, which can be used to refine bounds on predictive values of the Antigen test. However, an extension of the current setting is needed, because such additional information is not accounted for in the current setting with binary tests. Subsection 4.2 discusses a possible extension of the current setting to allow for this additional data. Ai et al. (2020) and Gietema et al. (2020) propose using chest CT scans for early identifying COVID-19 in patients. In Gietema et al.'s study, tomatic patients of a single Dutch emergency department have a chest CT scan and a PCR test for detecting SARS-CoV-2. Their study design exactly fits the framework of the current paper: (i) non-representative sampling of the testing pool and (ii) missing correlation data due to use of an imperfect reference test (with perfect specificity). The testing data are reproduced in Table 13 . Compared to the previously studies Antigen tests, the data for CT scans seems less aligned with the PCR test results. This is an indication that such a CT test is less informative: the dilation threshold of σ * = 63.35% is higher than for the Antigen 40 Public Health England (2020) explains: "Cycle threshold (Ct) is a semi-quantitative value that can broadly categorise the concentration of viral genetic material in a patient sample following testing by RT PCR as low, medium or high -that is, it tells us approximately how much viral genetic material is in the sample. A low Ct indicates a high concentration of viral genetic material, which is typically associated with high risk of infectivity. A high Ct indicates a low concentration of viral genetic material which is typically associated with a lower risk of infectivity." be quite low, but it as at least close to 50%, irrespective of τ . On the other hand, the CT scan has both a sizable interval and a low lower bound of possible NPVs. Since σ = 0.6 is below the dilation threshold, the CT scan is completely uninformative for the tested population (equivalently for τ = 1). Furthermore, this remains true for the overall population if τ ≥ 1/2 as shown in Table 14 . For example, for a non-COVID-indicative CT scan the set of possible infection probabilities P (x = 1|z = 0) increases relative to the prior information p. 41 Even more striking is the data of Ai et al. (2020) shown in Evaluations of a new test sometimes have more data available than just P (y, z|t = 1). For example, blood samples from before the existence of a virus can serve as true-negative samples. On the other hand, specific blood samples could be analyzed with more sophisticated (and usually much more expensive) methods than just using an established test as reference. These methods would lead to samples with true positives (or at least with very high probability). 43 Either of these methods would be provide additional data and therefore would also reduce the missing data problem. In general, this supplementary knowledge leads to narrower bounds, but unless these extra methods are performed for the whole tested population, the missing correlation issues remains. Of course, these methods cannot be applied for the untested population. Therefore the the missing data on prevalence cannot be avoided with these extraneous data. The current framework only allows for binary outcomes for both tests and also for the underlying health state. This seems to be the most common situation or allow for even more detailed information, like the Cycle Threshold Count of a PCR test as mentioned in Subsection 3.3. In such situations, the theoretic analysis does not provide the appropriate machinery. However, the crucial application to characterize the set of all joint distribution is a result in copula theory (Joe, 1997, Theorem 3.10), which does not rely on any dimension being binary. Indeed, the result even works for continuous outcomes on each dimension. Similarly, one could use other results in Joe (1997) to characterize the set of possible joint distributions if multiple tests are conducted simultaneously as studied in Zhou et al. (2014, Chapter 9) . In this case, and like in the characterization of Appendix A, the testing data are higher-dimensional marginal distribution of an overall joint distribution with an additional dimension (the health state). When considering such an extension, a caution has to be taken because sometimes sharp bounds on the set of possible higher-dimensional distributions may not be known. Since testing has a potentially big impact on the economy, an accurate description of the available testing technology is crucial. From the microeconomic perspective, the testing technology affects how test should be optimally allocated (see Ely et al. (2020) , Lipnowski and Ravid (2020) ) and also how much people engage in social distancing as studied by Acemoglu et al. (2020) . But also the macroeconomy is highly affected by testing strategies and an optimal choice might reduce the economic costs of pandemics considerably. (Alvarez et al., 2020; Eichenbaum et al., 2020) Although, these studies establish the importance of testing and also address varying testing technologies, all of them assume that a test corresponds to an experimentà la Blackwell (1951) and therefore is always informative (sometimes the assumption is even that the test itself provides perfect information). This paper demonstrates that the assumption of unambiguous information in test results is only applicable if a perfect reference is available when evaluating new tests. In particular, new Antigen test for detection of SARS-CoV-2 are evaluated relative to an imperfect PCR test and therefore-as shown in this paper-these Antigen test produce ambiguous information. An optimal testing procedure should take this ambiguity into account. Similarly, practitioner guides (like Galeotti et al. is given by Table 16 . P (x\y) y = 0 y = 1 1. P (x = 1, y = 1) = P (y = 1) − P (x = 0, y = 1) =0 by Assumption 2 = Γ 2. P (x = 1, y = 0) = P (x = 1) − P (x = 1, y = 1) = Γ(1 − σ)/σ. 3. P (x = 0, y = 0) = P (y = 0) − P (x = 1, y = 0) = 1 − Γ/σ. Using the law of total probability and rearranging gives P (y = 1|t = 0) = Γ−τ γ 1−τ . Furthermore, by the nature of testing P (y = 0, z = 1|t = 0) = P (y = 1, z = 1|t = 0) = 0. Therefore, P (y, z) is given by Table 17 . Table 17 : Joint distribution of P (y, z) P (z\y) y = 0 y = 1 z = 0 1 − Γ − P (y = 0, z = 1|t = 1)τ Γ − P (y = 1, z = 1|t = 1)τ 1 − τ ζ z = 1 P (y = 0, z = 1|t = 1)τ P (y = 1, z = 1|t = 1)τ τ ζ 1 − Γ Γ By Joe (1997, Theorem 3.10) the set of all distributions P (x, y, z) with marginals given by P (x, y) and P (y, z) is bounded by two extreme distributions: 44 F Γ ≤ F ≤ F Γ , where F is the CDF corresponding to P (x, y, z), F Γ is given by Table 18 , and F Γ is given by Table 19 . 44 The set of all these distributions, often called Fréchet class, is a convex set. Hence, here it suffices to consider the extreme points only. x = 0 z = 1 z = 0 z = 1 z = 0 y = 1 1 1 − τ ζ 1 − Γ σ max {0, 1 − Γ /σ − P (y = 0, z = 1|t = 1)τ } y = 0 1 − Γ 1 − Γ − P (y = 0, z = 1|t = 1)τ 1 − Γ σ max {0, 1 − Γ /σ − P (y = 0, z = 1|t = 1)τ } Table 19 : CDF F Γ x = 1 x = 0 z = 1 z = 0 z = 1 z = 0 y = 1 1 1 − τ ζ 1 − Γ σ min 1 − Γ σ , 1 − Γ − P (y = 0, z = 1|t = 1)τ y = 0 1 − Γ 1 − Γ − P (y = 0, z = 1|t = 1)τ 1 − Γ σ min 1 − Γ σ , 1 − Γ − P (y = 0, z = 1|t = 1)τ Since F Γ and F Γ are both nonincreasing in Γ, sharp bounds for the CDF F across all Γ := P (y = 1) ∈ [τ γ, γ] are F := F γ ≤ F ≤ F τ γ =: F . For the lower, we have 1 − Γ /σ − P (y = 0, z = 1|t = 1) = (1 − γ)(1 − τ ) + τ . For the upper bound, note that 1 − τ γ − P (y = 0, z = 1|t = 1)τ = 1 − τ (1 − P (y = 0, z = 0|t = 1)). The corresponding probability mass functions are given by Table 20 and Table 21 . Table 20 : Lower bound PMF with P 01 := P (y = 0, z = 1|t = 1) x = 1 x = 0 z = 1 z = 0 z = 1 z = 0 y = 1 P (y = 1, z = 1|t = 1)τ γ − P (y = 1, z = 1|t = 1)τ 0 0 y = 0 max 0, γ σ + P 01 τ − 1 min γ 1−σ σ , 1 − γ − P 01 τ min 1 − γ σ , P 01 τ max 0, 1 − γ σ − P 01 τ Table 21 : Upper bound PMF with P 00 := P (y = 0, z = 0|t = 1) x = 1 x = 0 z = 1 z = 0 z = 1 z = 0 y = 1 P (y = 1, z = 1|t = 1)τ P (y = 1, z = 0|t = 1)τ 0 0 y = 0 τ min γ σ , 1 − P 00 − γ τ max P 00 + γ σ − 1, 0 τ max 1 − γ σ − P 00 , 0 1 − τ max γ σ , 1 − P 00 Testing, Voluntary Social Distancing and the Spread of an Infection Correlation of Chest CT and RT-PCR Testing for Coronavirus References Disease 2019 (COVID-19) in China: A Report of 1014 Cases Sensitivity of Different RT-qPCR Solutions for SARS-CoV-2 Detection A Simple Planning Problem for COVID-19 Lockdown Equivalent Comparisons of Experiments Performance of VivaDiag COVID-19 IgM/IgG Rapid Test Is Inadequate for Diagnosis of COVID-19 in Acute Patients Referring to Emergency Room Department Blackwell's Informativeness Theorem Using Diagrams The Macroeconomics of Testing and Quarantining Optimal Test Allocation Overcoming the Bottleneck to Widespread Testing: A Rapid Review of Nucleic Acid Testing Approaches for COVID-19 Detection BinaxNOW COVID-19 Ag Card Home Test: Healthcare Provider Instructions for Use Merit of Test: Perspective of Information Economics CT in Relation to RT-PCR in Diagnosing COVID-19 in The Netherlands: A Prospective Study Evaluating Ambiguous Random Variables and Updating by Proxy Multivariate Models and Multivariate Dependence Concepts Validation Report: SARS-CoV-2 Antigen Rapid Diagnostic Test Pooled Testing for Quarantine Decisions of Handbook of the Economics of Risk and Uncertainty Credible Ecological Inference for Medical Decisions with Personalized Risk Assessment Bounding the Accuracy of Diagnostic Tests, with Application to COVID-19 Antibody Tests Estimating the COVID-19 Infection Rate: Anatomy of an Inference Problem A Serology Strategy for Epidemiological Studies Based on the Comparison of the Performance of Seven Different Test Systems -The Representative COVID-19 Cohort Munich Understanding Cycle Threshold (Ct) in SARS-CoV-2 RT-PCR A Review of Solutions for Diagnostic Accuracy Studies with an Imperfect or Missing Reference Standard What Can We Learn about SARS-CoV-2 Prevalence from Testing and Hospital Data