key: cord-132120-u5s5heqm authors: Dempsey, Walter title: The Hypothesis of Testing: Paradoxes arising out of reported coronavirus case-counts date: 2020-05-21 journal: nan DOI: nan sha: doc_id: 132120 cord_uid: u5s5heqm Many statisticians, epidemiologists, economists and data scientists have registered serious reservations regarding the reported coronavirus case-counts. Limited testing capacity across the country has been widely identified as a key driver of suppressed coronavirus case-counts. The calls to increase testing capacity are well-justified as they become a more frequent point of discussion in the public sphere. While expanded testing is a laudable goal, selection bias will impact estimates of disease prevalence and the effective reproduction number until the entire population is sampled. Moreover, tests are imperfect as false positive/negative rates interact in complex ways with selection bias. In this paper, we attempt to clarify this interaction. Through simple calculations, we demonstrate pitfalls and paradoxes that can arise when considering case-count data in the presence of selection bias and measurement error. The discussion guides several suggestions on how to improve current case-count reporting. The World Health Organization has declared the coronavirus disease 2019 (COVID-19) a public health emergency. As of April 27th, 2020, a total of 2,993,000 cases have been confirmed worldwide. The New York Times reports at least 965,214 people across the United States have tested positive for the virus, and at least 49,465 patients with the virus have died. Aggressive policies have been put in place across the US with at least 50% of the US population officially urged as of late April to stay home via state-wide executive actions. Despite these steps, the data landscape for understanding COVID-19 remains limited. Public databases maintained by Johns Hopkins University [Dong et al., 2020] and the New York Times Smith et al. [2020] provide incoming county-level information of confirmed cases and deaths. Statisticians, epidemiologists, economists, and data scientists have used this granular data to forecast COVID-19 case-counts, deaths, and hospitalizations [Song et al., 2020 , Ray et al., 2020 , IHME and Murray, 2020 . In many cases, variations on the SIR models have been used to draw inferences about infection rates and intensity. Parametric inference often uses observed casecount and/or observed COVID-19 related deaths to infer latent trajectories of the pandemic. The goal of this paper is to express reservations at the use of case-counts as a proxy for disease prevalence and intensity as well as its use as direct input into estimation of standard epidemiological models for inference and forecasting. The reason is straightforward: current models do not account simultaneously for selection bias and measurement error. Selection bias enters due to differences in testing strategies across countries and states. In the US, for example, limited testing capacity has caused local and state health departments to focus on testing only high-risk populations. Moreover, testing requires the individual to self-select into asking for a test and then receiving approval by their doctor. Demands for increased testing capacity, while laudable, ignore the issues of self-selection and measurement error. While increasing testing capacity increases data quantity, there is no guarantee for increased data quality. We will show that aggressive pushes for ramped up testing capacity that are tied to decreases in data quality may have a mitigated impact in our ability to estimate quantities of interest such as prevalence and effective reproduction number. In this paper, we will demonstrate the complex interaction between these two fundamental concepts. In classical statistics, measurement error is often associated with parameter attenuation; we will show, however, that the combination with selfselection can cause bias in parameter estimates to change sign and increase/decrease in magnitude. Moreover, the bias may depend on the sampling fraction, prevalence, and population size. Without a complete understanding of these interactions, we are destined to misinterpret case-counts and come to erroneous conclusions. We will demonstrate why there is a strong need for more precision and care in presenting COVID-19 analyses to the broader research and non-academic communities. 1.1. Outline. This article discusses the relationship between three statistical concepts: selection bias, measurement error, and the too often forgotten population size. We clarify mathematically why the situation is much more complex than it first appears. Through simple mathematical arguments, we clarify five important issues related to COVID-19 case-count analysis. First, we show that unadjusted prevalence rates are (unsurprisingly) biased when tests are imperfect. What is surprising is that the direction and magnitude of the bias can vary substantially and interacts with the sampling fraction. Next we show that data quality for adjusted prevalence rates includes an additional term in the error decomposition that captures the interplay among measurement error, selection bias, and prevalence. Third, we show that daily trends in rate of positive tests is equally problematic even if testing rates and selection protocols are held constant. We show that the rate of change in COVID-19 observed case-counts overestimates the true rate of change prior to the peak time and underestimates it immediately after. This implies a data scientist analyzing observed rates will be (on average) overly pessimistic in the early stages of the pandemic, and overly optimistic subsequent to the peak times. We show that estimates of the effective reproduction number may be similarly impacted. Fourth, we show that cross-country comparisons are difficult at best with population size, sampling fraction, and data quality all interacting to impact null hypothesis testing. Fifth, we discuss a sensitivity analysis method for estimating data quality and its use in understanding the potential biases in COVID-19 case-count analyses. We then end the paper with a discussion of the benefits of randomized testing and use of auxiliary information to improve over simple random sampling. We start with some simple notation. Let N denote the population size. For state-level analysis N is the state's total population, while for country-level analysis N is the country's total population. At a fixed moment in time, let Y j denote COVID-19 status for the jth individual in the population, j = 1, . . . , N . Here, like in survey methodology [Cochran, 1977] , we treat COVID-19 status as a fixed but unknown quantity of interest. For simplicity, we start by ignoring the dynamic nature of the viral outbreak as well as the fact that individuals can recover from the disease and assume either individual j is COVID-19 positive and Y j = 1 or is COVID-19 negative and Y j = 0. We also let I j ∈ {0, 1} be an indicator that the individual was selected for testing (I = 1) or not (I = 0). To start, we assume the primary questions of interest are estimates of the overall number of COVID-19 cases and/or disease prevalence. That is, we are interested in either the population total Y = N j=1 Y j or the population averageȲ = Y /N . Suppose that n tests are performed and we observe the values y 1 , . . . , y n ∈ {0, 1}. Then a natural candidate for prevalence is the proportion of positive testsȳ = 1 n n i=1 y i , and a natural candidate for overall cases is N ×ȳ. Under simple random sampling (SRS) or any other epsem 1 design, the above are unbiased estimators of the population-level quantities of interest. Under SRS, the variance of the estimator can be expressed as 1 . The above selection mechanisms are random and independent of the outcome of interest. When this is not the case, selection effects may cause bias in the above estimates. To better understand this issue, Meng [2018] provided the following intuitive and powerful statistical decomposition of the error betweenȳ and the true proportionȲȳ The first term represents data quality, the second data quantity, and the third problem difficulty. The term ρ I,Y is the empirical correlation between the population values {Y j } N j=1 and the selection values {I j } N j=1 . Under simple random sampling, E I [ρ I,Y ] = 0, where the expectation is with respect to the selection mechanism I, so there is no bias. The SRS variance formula above shows that E I [ρ 2 I,Y ] = 1/(N − 1). The key issue with selective testing is that E I [ρ I,Y ] = 0. Meng identified this as the fundamental issue that can lead to paradoxes in the analysis of big data. Here we highlight two key insights from Meng [2018] that are relevant to the COVID-19 crisis. First, comparing the mean-squared error under selection mechanism I and SRS, we see that This points to a troubling and paradoxical situation: the error relative to SRS increases as a function of population size. Meng [2018] termed this the "Law of Large Populations" (LLP). This points to a critical issue in the current media practices in communication of case-count data: two countries with the same testing strategy (i.e., E I [ρ I,Y ] equal) can yield wildly different estimates due to population size. Large countries like the US may have similar true prevalence rates as smaller countries like the UK. Even under similar testing strategies, the mean-squared error in the prevalence rate for the US will have substantially more variation in comparison to SRS. Comparing the US to the UK, for example, the MSE will increase by a factor of almost five. Thus conclusions drawn from observed case-count records may be not just wrong, but very wrong. Second, there have been calls for increased testing. While important, many conflate increased testing capacity with increased quality of testing. We will discuss this point further in Section 5. As of May 8th, the US has performed 8, 412, 095 tests in total. Figure 1 shows the trajectory of testing per day and positive cases per day in the US. The US population is roughly 328 million, meaning the fraction sampled is f = 0.026. After smoothing, the empirical prevalence is 9.1% on May 8th. Supposing COVID-19 positive individuals are 2 times more likely to get tested than those individuals who are COVID-19 negative, the question is "What is the sample size from a SRS that would yield equivalent MSE in the estimated prevalence?" Using the above MSE result, Meng showed the effective sample size is equal to n ef f ≤ f 1−f . Under these parameters, the effective sample size is 14. Recent proposals [Siddarth and Weyl, 2020] have argued in favor of increasing overall testing capacity. With more tests available, the relative sampling rate may reduce. Supposing it drops to 1.2 then the effective sample size will increase to 299. See Appendix B for the mathematical details and calculations in additional scenarios. The above calculations point to a sad state of affairs, saying that the effective sample size even in best case scenarios is not better than a small random sample from the population. The remainder of this paper aims to build upon these fundamental insights by extending the decomposition in two directions: accounting for measurement error and the temporal nature of the pandemic. We then discuss effective sampling methods and their importance in the current pandemic. 2.1. Imperfect testing. Tests are imperfect. COVID-19 testing is no exception. Here we investigate the interplay between imperfect testing and selection bias. In discussions of inaccurate testing, the standard assumption is measurement error leads to parameter attenuation. When paired with selection bias, however, the two sources of error become entangled, and resulting errors can become magnified, muted, or even switch signs. First we require some additional notation. Let P j be an indicator of measurement error, equal to 1 when we incorrectly measure the outcome and 0 when we observe the true outcome. We suppose this is a stochastic variable where pr(P j = 1 | Y j = 1) =: F N is the false-negative rate and pr(P j = 1 | Y j = 0) =: F P is the false-positive rate. If individual j is selected (i.e., I j = 1) then the observed outcome can be written as Y j = Y j (1 − P j ) + (1 − Y j )P j . Suppose disease prevalence was estimated as the fraction who tested positive for i.e., ȳ We can again investigate the error compared to the true prevalenceȲ in statistical terms: where Z = 1 − 2Y ; see Appendix A for the derivation. The first term in the large brackets represents the perfect testing regime; to this end, we refer to ρ I,Y as the true data quality. The second term represents the interaction between imperfect testing and selection bias. The variable P Z = 1[Y = 0, P = 1] − 1[Y = 1, P = 1] is non-zero only when P = 1 (i.e., the outcome is incorrectly reported); to this end, we refer to ρ I,P Z as the observed data quality that accounts for both selection bias and measurement error. In the appendix, we show the sign of ρ I,P Z is the opposite of the sign of ρ I,Y , which implies that the observed data quality adds error in the opposite direction from the true data quality. Finally, the third term represents the bias due to imperfect testing. We start by considering the first two terms and assess whether the sign of the bias can reverse due to the interaction of measurement error and selection bias. To do this, we define the sampling rates differential. Let f 1 := pr(I J = 1 | Y J = 1) and f 0 := pr(I J = 1 | Y J = 0) be the sampling rates. Then ∆ = f 1 − f 0 is the sampling rate differential. Using these terms, we can re-express the first two terms as The final term in brackets is the measurement error adjustment to data quality which is a complex function of sampling rate differential, the odds ratio, and the ratio of measurement error interaction with prevalence and sampling rates interaction with prevalence. Note sgn(∆) = sgn(ρ I,Y ) by equation 3 in the appendix, which implies the measurement error adjustment either shrinks the data quality measure toward zero or reverses its sign. While prior investigations have noted the interaction between measurement error and selection bias [Beesley et al., 2020 , Beesley and Mukherjee, 2019 , van Smeden et al., 2019 , the interaction with the sample size relative to the population, i.e., f , has largely been ignored. The above statistical decomposition clarifies the importance of this quantity f . In particular, note that the statistical error also includes a bias term due to measurement error and this term increases as the sampled fraction f increases. Therefore, how the first two terms interact with the final term depends on the fraction of the population sampled. This interaction is complex, but implies that whether the estimateȳ n is an overestimate or underestimate is a complicated question due to the relation amongst these three pieces. Consider again the current COVID-19 pandemic. For now, we continue to assume the ratio of conditional selection rates f 1 /f 0 is equal to 1.5. In Section 3.4, we discuss recent research suggesting a false negative rate around 17.2% and a false positive rate around 0.05%. Under these rates and the current US prevalence rates, the ratio of the MSE to MSE under no measurement error is 0.436; if we switch false negative rates down to 5% and increase the false positive rate to 5% then the relative MSE is 2.89. This is merely to demonstrate that in some cases we see a huge increase in MSE and in other settings we have a huge decrease in MSE. What drives this is the false positive and negative rate interaction with prevalence and sampling rates. Therefore, whether we are better or worse off with respect to the MSE is a very difficult question to answer. The attentive data analyst will recognize the estimatorȳ n is biased even for simple random samples and, if sensitivity and specificity were known a priori, may suggest the alternative estimatorỹ n =ȳ n + (1 −ȳ n )F P + F Nȳ n which is unbiased under simple random sampling. We again wish to express the errorỹ n −Ȳ in statistical terms. In the appendix, we show that the error now can be expressed as (1) The first term is the same as before but increased by (1 + F P + F N ) to account for the additional uncertainty due to measurement error. The second term is the interaction between selection bias and measurement error. Figure 2 presents the measurement error data quality adjustment as a function of the relative frequency and odds ratio. We see that the adjustment can be both positive and negative as well as a range of magnitudes. We next investigate the impact of measurement error on the effective sample size calculation. The effective sample size can now be bounded by In Section 3.4, we discuss recent studies that estimate the COVID-19 false positive rate of 0.5% and false negative rate 17.2%. Supposing COVID-19 positive individuals are 2 times more likely to get tested than those individuals who are COVID-19 negative, the effective sample size drops from 14 to 10. If the relative sampling fraction drops to 1.2 then the effective sample size drops from 299 to 216. Figure 2 . Measurement error data quality adjustment: relative frequency f 1 /f 0 (x-axis) against odds ratio (y-axis) for F P = 0.005 and F N = 0.172. Color scaled so blue = −10, white = 0, and red = 10. The above argument shows that the effective sample size is affected in a complex manner by the interaction between selection bias and measurement error. Here, we discuss the impact of the interaction on the effective sample size when testing capacity is tied to measurement error. In the current COVID-19 pandemic, there have been well-justified calls for increases in testing capacity. These increases, however, may come at the cost of increases in false-positive and false-negative rates. In terms of effective sample size, if the data quality term remains constant, then testing increases may not significantly improve the effective sample size. Consider the simplest case where f increases from 0.05 to 0.1; suppose this increase was associated with false positive rate increasing from 0.5% to 5% and the false negative rate rising from 5% to 20%. Then the standard analysis would suggest a reduction in the MSE by 47% but the actual decrease is by 26%. This implies that doubling the sampling rate will only increase the effective sample size by a factor 2.9 rather than the expected 4 fold increase that is expected. In a worst case scenario, if the false negative rate climbs to 0.30 and the false positive rate to 0.10 then the factor effective sample size increases by only a factor of 2.3. Regrettable rates: complex biases resulting from self-selection. The data analyst, now wary of estimating prevalence and total counts, pauses and thinks. They return shortly thereafter, with a follow-up consideration: Ok, perhaps estimating prevalence is difficult. Certainly, however, we can estimate the rate of growth. All I want to know is when we hit the point at which the curve flattens and number of deaths decrease. That can't be too hard, surely! Unfortunately, ratio estimators do not cancel errors as now both the numerator and the denominator are uncertain. This means selection bias and measurement error can have paradoxical effects. Here we consider the ratio estimators for the relative change in the prevalence rate. We letȲ t−1 andȲ t denotes the prevalence (across the entire population) at time step t − 1 and t respectively (i.e., prevalence on two consecutive days). Then the ratio estimator is given by r =ȳ t /ȳ t−1 . We assume both numerator and denominator are the prevalence estimates adjusted for measurement error. If the sample size at each time were equal (i.e., n 1 = n 2 = n) then this analysis would be equivalent to comparing the increase in observed casecounts (i.e.,ȳ t /ȳ t−1 = y t /y t−1 ). We wish to express the errorȳ t yt−1 −Ȳ t Yt−1 in statistical terms. Using a Taylor series approximation, we show in the appendix that the error can be expressed approximately as where ρ Ij ,Yj is the data quality, f j is the sampling fraction, D Mj is the measurement error adjustment, and CV (Y j ) = σ Yj /Ȳ j is the coefficient of variation at time step j. The magnitude of the error depends on the population-level ratioȲ t /Ȳ t−1 so a large decrease in the prevalence rate will have a relatively small error as compared to a large increase in the prevalence rate. The second term represents the "cancellation" that the naive analyst is hoping will occur. The analyst's intuition only holds when the difference is equal to zero. This occurs at times when data quality, sampling fraction, measurement error, and prevalence are all constant across time. Consider the state of New York. Figure 5a shows the number of tests and positive tests per day in New York state. Figure 5b presents the fraction of tests that were positive per day in New York state. For cancellation, we would need the expected data qualities to be uncorrelated and equal. Testing appears relatively stable over a week which suggests assuming ∆ t (i.e., differential in the selection rate) varies slowly over time may be a plausible assumption. Note, however, that constant ∆ t does not imply constant data quality (see Appendix B for details). Figure 5a shows the number of tests has remained relatively constant in the last few weeks, specifically relative to the total NY population, assuming f t also varies slowly in time is reasonable. Finally, the measurement error adjustment is likely to be constant as no new testing procedures have been reported. This would not be true in states like Michigan, where testing criteria have been expanding every few weeks. The final component is an adjustment to this differential error based on the statistical error at the first time point. Figure 3a shows a trajectory of the true ratio and the potential biased estimators under an SIR model for the epidemic dynamics, with state evolution given by the following ODE where S t is the number of susceptible individuals at time t, I t is the number of infected individuals at time t, N is the total population size, and R t is the number of removed (recovered or deceased) individuals at time t. Figure 4a presents the fraction of population newly infected individuals at each time step for β = 1.4 and γ = 0.2 (black line). In each case, the rate of change in prevalence is likely to be overestimated prior to the peak and then underestimated afterwards. From a decision making perspective, such biases have potential impact. First, the overestimation may give lawmakers and governors more leverage in proposing aggressive actions that reduce prevalence. Of course, the analysis supports the argument that the estimates based on available data are overly pessimistic. What both sides miss is that the direction of bias is non-constant over time. After we hit the peak, estimates will likely under-estimate prevalence. Such bias puts pressure on lawmakers and governors to prematurely relax social distancing measures. Moreover, the question is "Can we trust the observed data to let us know if we have reached the peak?" It appears that the peak time is the easiest to estimate in terms of having minimal error using availability data. Note, however, that the standard errors on such estimates will be incredibly wide due to the low effective sample size as discussed in Section 2. 2.3. Estimation of effective reproduction number. Many well-respected epidemiologists argue that tracking the effective reproduction number R t is the only way to manage through the crisis [Leung, 2020] . We show here how these estimates are also impacted by selection bias and measurement error. For simplicity, we again assume the estimates are adjusted for known false positive and false negative rates. Bettencourt and Ribeiro [2008] show that under a Poisson likelihood, a simple relation between the trajectory of new cases and the effective reproduction number can be derived. In particular, under an SIR model, the number of case counts on day the number of new cases on day t − 1 and γ is the serial interval which is approximately 7 days for COVID-19 [Sanche et al., 2020] . Using this, a simple estimator of the effective reproduction number at time t can be given by Of course, we do not observe K t and K t−1 but noisy proxies y t and y t−1 , i.e., the observed case-count on day t and t − 1 respectively. The hope is that y t /y t−1 is a good proxy for K t /K t−1 where here y t is conceptualized as the number of new cases on day t. However, if we think about estimating number K t under SRS of new cases among those susceptible on day t, the natural estimator is S tȳt . Unfortunately the number of individuals susceptible on day t is unknown. Here, we assume that R t = 1 + 1 γ log (ȳ t /ȳ t−1 ) is the estimated effective reproduction number on day t. This estimator adjusts for sample size variation across days but not the varying population size. The estimator is close in spirit to the estimator derived under assuming the process is fully observed. Similar to Section 2.2, the proxy has error in both numerator and denominator that may cause issues. We again wish to express the statistical error ofR t − R t in useful terms. We can re-arrange the error decomposition from Section 2.2 to show that this error is given by So here we see the exact same trade-off as with the rates but on the logarithmic scale. For small values of x, we have log(1 + x) = x and so the biases are similar. However, the error is no longer scaled by the rate of change in prevalenceȲ 2 /Ȳ 1 . Instead the error is scaled by the serial interval γ −1 = 1/7. Moreover, the error depends not on aggregate prevalenceȲ t but fraction of new casesK t . The final term is the error due to varying population sizes. Since S t ≤ S t−1 , the final term − log(S t /S t−1 ) ≥ 0. Figure 3b displays the bias for the effective reproductive number and how it is impacted by the relative sampling fraction. Here for simplicity, we assume the fraction of the population that is susceptible remains large and near the total population size. Under these assumption, the potential bias is similar to that for the ratio estimator adjusted to be on the logarithm scale. Changes to this assumption can impact the potential bias. The key difference between Figure 3a and Figure 3b is that prevalence depends on the fraction of the population infected at time t whileR t depends on the fraction of new cases in the population. This leads to differences in when the bias is most pronounced. Here we show how selection bias and measurement error can creep into clinical trial analysis. The key concern is whether clinical trials on COVID-19 recruit from the pool of individuals who have tested positive for COVID-19 or whether they sample randomly from the population, test, and then recruit from this subset of tested individuals. To see this issue, suppose we have an outcome Y , a treatment A and an unobserved variable U . Suppose treatment is assigned at random, i.e., A = 1 with probability 50% and A = 0 with probability 50%. Suppose the conditional mean of the outcome satisfies E(Y | U, A) = β 0 + β 1 A + β 2 U + β 3 AU . Typically, we are interested in the causal effect of A on Y . In counterfactual language, Even in a randomized experiment, the question is what the correct value for E(U ) is. For COVID-19, the value of interest is the expected value of U in the population of COVID-19 positive individuals. Due to selection bias, however, the bias in this marginal treatment effect compared to the true marginal effect of interest is Selection bias then leads to the estimated effect being biased if β 1 = 0. Directionality of the bias will depend on the relationship between U and Y . If U is a measure of disease severity, then it could be expected to see β 1 > 0 and E I [ρ I,U ] > 0. In such situations, the treatment effect estimates will tend to be overly optimistic as most individuals recruited are potentially higher on the severity index than the population average. Not only that, but for fixed data quality, the error compared to SRS in terms of effect estimation scales with the population size (i.e., the law of large populations). Randomization of treatment assignment in clinical trials negates unobserved confounders. This, however, does not negate effect modifiers. Therefore, for marginal treatment effects to be interpretable, there must be a well-defined population. Most often, our main interest is in causal effects on the population of COVID-19 positive. Randomization within the clinical trial design yields internal validity, but we also need external validity to estimate the correct marginal effect of interest Keiding and Louis [2016] . So far we have focused on understanding the limitations of using case-count data to understand population quantities of interest for a single population (i.e., United States, Germany, France or Michigan, New York, London in isolation) at a single moment and over time. Many are interested in cross-population comparisons. Many pieces in the media as well as academic articles have plots of case-count over time aligned by time of first known case. Others have claimed such comparisons unfair due to unequal population sizes and instead plotted case-count per million. In this section, we focus on the statistical issues relevant to such comparisons. We do not make any comments on the multi-causal nature of success/failure in countries. 3.1. Prevalence comparisons. At a fixed time, suppose prevalence estimatesȳ 1 andȳ 2 are observed for two populations. We can expressȳ 1 −ȳ 2 in statistical terms as The first term is the difference in population prevalence (i.e., the quantity of interest). If random sampling were performed in each population, E I [ρ Ij ,Yj ] = 0 and so the estimate would be unbiased. However, the second term represents the complex error that results from selection bias. The particular test of interest is H 0 :Ȳ 1 =Ȳ 2 . Under the null, σ Y1 = σ Y2 . In classical statistics, we would compute a Z-score: Where the last equality holds under f 1 = f 2 , i.e., same sampling rate relative to population size. This tells us that the exact error in the comparison of sample means, as an estimate of the difference in population means, scales as a function of the difference in data quality ρ I1,Y1 − ρ I2,Y2 times a square root of a population adjustment. For highly unequal population sizes (e.g., N 1 N 2 ), the adjustment is approximated well by the square root of the smaller population size minus one (e.g. √ N 2 − 1 ). Take, for example, a comparison of the US and Canada. The two population sizes are approximately 328 million and 38 million people. Then the population adjustment is 5.84 million. As of May 5th, Canada has performed a total of 24.92 tests per 1,000 people while the US has performed a total of 22.01 tests per 1,000 people 2 so the data quantities are approximately equal to f = 23/1000 = 0.023. Then, under a prevalence of 10%, one would need the differential ∆ 1 − ∆ 2 to be much smaller than 11.68 × 10 −6 in order for the selection bias to not impact. So only under the very strong assumptions of equal data quantity and quality will the comparisons be valid and the Z-score can be treated as in the classical setting. A simple alternative is to use the effective sample sizes to build more appropriate Z-score. That is, setting n 1,ef f −1 + 1 n 2,ef f −1 allows us to build Z-scores that account for the true sample size. While still not perfect, this allows us to build Z-scores that are not overly confident due to selection bias and measurement error. See Section 3.4 for discussion of how to estimate data quality. 3.2. Case count comparisons. Two alternative comparisons are common. First, the data analyst may directly compare the observed number of new cases on a given day. If Y 1 and Y 2 are the number of cases in locations j = 1, 2 respectively, then we can express the error of y 1 − y 2 as Under simple random sampling, the first term is zero in expectation. However, the question is whether f 1Ȳ1 − f 2Ȳ2 is of scientific interest. We would argue that this is only of true interest when f 1 = f 2 and we are interested in comparing total number of cases. In this setting, however, the difference in error terms is unlikely to be zero. To see this, assume that we have similar expected data quality, data quantity, problem difficulty, and measurement error. Then the error is a function of n 1 − n 2 = n 2 N1 N2 − 1 . So we see that the error now is a function of the relative difference in the population sizes. Alternatively, the data analyst may be wary of such concerns in population differences and scale counts by respective population size, i.e., y 1 /N 1 − y 2 /N 2 . We can again express the error as Under simple random sampling, the first term is again zero in expectation. However, the question is whether f 1Ȳ1 − f 2Ȳ2 is of scientific interest. We would argue that this is only of true interest when f 1 = f 2 and we wish to compare disease prevalence. Importantly, the error in this comparison is less problematic. To see this, assume that we have similar expected data quality, data quantity, problem difficulty, and measurement error. Then the error is equal to zero. It is important to note that the standard law of large populations caveats apply to both comparisons. In both cases, the point is that we must ascertain what is the population-level contrast and whether this should be of genuine scientific interest. Otherwise, we are comparing apples and oranges. Not only this, but the two comparisons come with different levels of error induced by selection bias and measurement error. An alternative is the comparison on the rates of casecount change over time. Here, for simplicity, we focus on comparing the estimated effective reproductive rate. We assume the two time-series are aligned so that t = 0 is the time of first case. This negates alignment issues and is common in practice. While the above issues on Z-scores and effective sample size are important, here we highlight a separate issue. Namely, the interaction of biases in estimation of the two trajectories when the peak infection times differ slightly. We can write the difference in the rates as R t1 − R t2 = 1 γ log 1 + e 1t 1 + e 2t where e jt is the error associated given in Section 2.3. Recall that the error allowed for over-estimation prior to the peak and then under-estimation post-peak. Here, these errors can mingle in interesting ways. Consider two countries (A and B) in which the peak occurs 2 weeks prior for country A then country B but other. Figure 4 presents such a comparison where each country's disease trajectory follows an SIR model as described in Section 2.2 with different parameters. Figure 4b shows how the errors interact in complex ways. First, the gap is correctly estimated; then the gap is over-estimated as country A sees a rapid rise in cases; then the gap is even more over-estimated as country A improves and the rate is under-estimated while country B sees a rapid increase in their case-count and their rate is over-estimated; then country A's rate is correctly estimated while country B's rate is under-estimated as it sees improvement in infection rates; finally, the gap disappears. We are not claiming this will always be the case, and indeed for small enough selection bias and/or large gaps in case-counts such complex behaviour will not occur; however, this is simply to point out how observed information can tell a more complex story than the truth (i.e., country A recovers 1 month prior to country B). In prior sections, we have discussed the impact of selection bias and measurement error on a range of outcomes of interest in COVID-19 research. Here, we try and address the empirical question of how to estimate the data quality E I [ρ I,Y ] in the current context. In Meng [2018] , estimation of the data quality relied on observing the true outcome. In election data, for example, we observe the vote totals after the fact and can use this as ground truth. We are not so lucky in the current crisis. Here, we propose a simple procedure that uses survey samples as noisy but unbiased estimates of the true prevalence. Our goal is less inferential and more to provide a sensitivity analysis that can aid in understanding the amount of information in observational COVID case-count data. On April 23rd, Andrew Cuomo announced results from a study in New York state. It found 13.9% of 3,000 people tested across the state had signs of the virus. The study did not report sensitivity and specificity; therefore, we take reported measures from the Santa Clara study [Bendavid et al., 2020] 3 . The reported specificity is 82.8% (95% CI: 76.0-88.4%) and sensitivity is 99.5% (95% CI: 99.2-99.7%). This corresponds to a false negative rate of 17.2% (11.6-24.0%) and false positive rate of 0.5% (0.3 -0.8%). Correcting for these rates we have an estimated prevalence of 15.9% with a range of 14.8% to 17.0% using the ranges of the false positive and negative rates. Figure 5a plots the number of new total and positive tests per day. Figure 5 plots the fraction of tests that were positiver per day. In both cases, we applied an exponential smoothing to the reported values because the spiky pattern is likely due to testing backlogs. Given the report was on the 23rd, we suppose for simplicity that the study was run several days prior on the 20th of April. On the 20th, there were 19, 654 tests performed and 617, 555 tests had been performed up to to and including the 19th. The population of New York state is 19, 378, 102. Subtracting off the number of individuals who had already been tested yields a sampling fraction Here, we assume the New York survey is a simple random sample from the state's population. We recognize that the survey is not such a sample, but techniques like poststratification and raking can be applied to calibrate estimates. Then the error is 16.6%. Using (1), we can use the error to construct an estimate of the relative sampling rate: The range for the CIs is (2.05, 2.54). We can also perform sensitivity as a function of the false negative and positive rates. Under these measurement error ranges, we saw that the impact on the effective reproductive rate estimator was much smaller. However, for F P = 0.05 and F N = 0.005 we have M = 4.31. While nothing above is definitive, it points to the interplay of selection bias and measurement error. This calculation gives us simple tools to understand how sensitivity and specificity impact our estimators and therefore how much we should trust conclusions based on these assumptions. Under the Santa Clara study estimates of these quantities, while we expect impact, the bias is much smaller than under other estimates. This article argues why random sampling is a powerful tool and the limitations of data obtained through a self-selection process. As COVID-19 survey studies become more readily accessible, we encourage researchers to focus on building representative samples. One difficulty in the current crisis is the low prevalence relative to the population size in certain geographies. Take the recent Santa Clara study, for example, that estimated less than 2% prevalence in the area. The key issue with low prevalence is that if the test's specificity is not lower than the prevalence, then the observed data is consistent with zero recovered individuals. In such settings, simple random sampling is doomed to fail; however, there are potential solutions. In particular, one can stratify the population into risk categories (based on population density, occupation, age, and other important factors) and perform Neyman allocation [Cochran, 1977] . That is, sample individuals at random in strata h with size proportional to P h Ȳ h × (1 −Ȳ h ) where P h is the strata frequency in the population andȲ h the prevalence in the strata. Of course, P h and Y h are not known a priori; however, if decent estimates can be obtained, we can see huge benefits in settings where the prevalence varies widely by strata. One could go attempt to go further and incorporate contact network information. However, such data is difficult to obtain in practice. Indeed, geographic location, occupation, age and basic demographic information are much easier to obtain and are a good proxy for the latent network connectivity. There is nothing routine about COVID-19, including the corresponding statistical questions. The goal of this paper was to point out questionable statistical routines. Indeed, case-count trajectories are being reported daily by governments and broadcast by the media. The precision in the reported numbers gives the illusion of accuracy when what needs to be understood is the level of uncertainty. Selection bias not only means uncertainty, it also implies analysts feel certain about incorrect conclusions. We end this paper with a brief discussion of several important topics related to selection bias and measurement error. Decision-making: data versus information. Having read through the above technical discussions, one may come to the incorrect conclusion that because we are concerned about implicit biases in observational data, we must wait for better data sources in order to act [Ioannidis, 2020] . This conclusion confuses data and information. Lack of high quality data means we should be skeptical regarding conclusions drawn from these data sources alone. Being skeptical of observed data does not imply governments and communities should wait for high quality data to act; this is especially true in the current high risk scenario. Instead, we should aim to have high quality information. Yet in the current crisis we have just that. Indeed, epidemiologists and public health officials have a strong understanding of the basic facts of pandemics and disease spread. Social distancing, contact tracing and mass testing are all important. Individuals and communities with low quality data and only basic information will hopefully make decisions that respect inherent risks to their well being as well as the high degree of uncertainty in the observed data. Of course, there are trade-offs and long-term stay-at-home orders will have negative economic consequences. A full discussion of these topics is beyond the scope of the current article. High quality information on prevalence. Governments are pushing for increased testing capacity and robust contact tracing. Contact tracing is incredibly useful for identifying carriers early and preventing spread of the disease. We would argue that, after the current wave, understanding prevalence is also key. We know that the probability of an outbreak is a function of current prevalence and network connectivity. Epidemic critical thresholds have been derived for many infectious disease models Pastor-Satorras and Vespignani [2001] , Newman [2002] , Parshani et al. [2010] . Knowing prevalence will help governments determine long-term communitylevel risk and allow for more targeted interventions -shutting down only certain locations when necessary. Data quantity versus quality. Governments implicitly argue that increased testing capacity will perform the above task. Without complete compliance, however, our understanding of future outbreaks may be plagued by self-selection bias, compounded by changing sensitivity and specificity rates. Random testing removes these effect modifiers, giving governments more information to fight the disease. This is not to say random testing should replace current testing practices; only that the random testing should supplement and will provide valuable additional information even with moderate sample sizes. Model-based solutions. One alternative approach to what is presented in this article is a model-based extension of SIR models to try and capture both selection bias and measurement error; however, without strong assumptions on the selection mechanism, the estimates are often not nonparametrically identifiable. When an issue "cannot be resolved nonparametrically then it is usually dangerous to resolve it parametrically" [Cox and Hinkley, 1974, pp. 96 ]. In the absence of randomized experiments, the best route forward using any approach is the use of detailed sensitivity analyses and humility in conclusions drawn from such data sources. We considered the mean estimator The term in parentheses can be re-written as which agrees with Meng's (2019) decomposition. For the other term, first we define Then the term can be re-written as The term in parentheses can be re-expressed using the previous technique as: where now the "data defect" and "problem difficulty" are with respect to P Z rather than Y . The final term is equal to Combining these yields: For the binary outcome Y , we have σ Y = Ȳ (1 −Ȳ ) . Moreover, Then the formula for the error is given by: We can then re-write ρ I,Y σ Y + ρ I,P Z σ P Z as Then ∆ = f 1 − f 0 = f 0 (M − 1) and f 0 (1 −Ȳ ) + f 1Ȳ = f 0 ((1 −Ȳ ) + MȲ ) and we can re-write above as A.2. Ratio estimator. Let u = (u t−1 , u t ) ∈ R 2 and g(u) = ut ut−1 , i.e., a differentiable function g : R 2 → R. Centering a Taylor series expansion of second-order around coordinates (U 2 , U 1 ) ∈ R 2 yields Plugging in (ȳ t−1 ,ȳ t ) for (u t−1 , u t ) and (Ȳ t−1 ,Ȳ t ) for (U t−1 , U t ) yields theȳ t yt−1 −Ȳ t Yt−1 is equal to where the second equality is obtained by plugging in the statistical decomposition of the error for both time points and the coefficient of variation being defined as CV (Y ) := σ Y /µ Y . Under measurement error, the extra terms D 2 and D 1 can be inserted in the correct locations. A.3. Estimation of effective reproduction number. Let Then results from Section A.2 show that the estimate of the number of new cases on day t is given by Then setting e t = δ t × [1 − ρ It−1,Kt−1 D Mt−1 Appendix B. Derivation for the effective sample size Using this computation, we compute effective sample size under a range of valuesȳ, M with f = 0.026 (i.e., current sampling fraction) and f = 1/7 (sample individuals roughly every 7 days). An analytic framework for exploring sampling and observation process biases in genome and phenomewide association studies using electronic health records Statistical inference for association studies using electronic health records: handling both selection bias and outcome misclassification. medRxiv Covid-19 antibody seroprevalence in santa clara county Real time bayesian estimation of the epidemic potential of emerging infectious diseases Sampling Techniques Theoretical Statistics Forecasting the impact of the first wave of the covid-19 pandemic on hospital demand and deaths for the usa and european economic area countries. medRxiv A fiasco in the making? as the coronavirus pandemic takes hold, we are making decisions without reliable data. Stat News, 2020. N. Keiding and T. A. Louis. Perils and potentials of selfselected entry to epidemiological studies and surveys Lockdown cant last forever. heres how to lift it Statistical paradises and paradoxes in big data (i): Law of large populations, big data paradox, and the 2016 us presidential election Epidemic threshold for the sis model on random networks Epidemic spreading in scale-free networks Predictions, role of interventions and effects of a historic national lockdown in india's response to the covid-19 pandemic: data science call to arms. medRxiv High contagiousness and rapid spread of severe acute respiratory syndrome coronavirus 2. Emerg Infect Dis Why we must test millions a day. COVID-19 Rapid Response Impact Initiative Coronavirus in the us An epidemiological forecast model and software assessing interventions on covid-19 epidemic in china. medRxiv, 2020. Maarten van Smeden Appendix A. Imperfect testing: derivation