key: cord-0753403-eyu6hcbc authors: Sahlu, Ida; Whittaker, Alexander B title: Obtaining Prevalence Estimates of COVID-19: A Model to Inform Decision-making date: 2021-04-08 journal: Am J Epidemiol DOI: 10.1093/aje/kwab079 sha: baef1d4685112f10fdbf563a7b690652494c9363 doc_id: 753403 cord_uid: eyu6hcbc We evaluate whether randomly sampling and testing a set number of individuals for coronavirus disease 2019 (COVID-19) while adjusting for misclassification error captures the true prevalence. We also quantify the impact of misclassification error bias on publicly reported case data in Maryland. Using a stratified random sampling approach, 50,000 individuals were selected from a simulated Maryland population to estimate the prevalence of COVID-19. We examined the situation when the true prevalence is low (0.07%-2%), medium (2%–5%) and high (6%–10%). Bayesian models informed by published validity estimates were used to account for misclassification error when estimating COVID-19 prevalence. Adjustment for misclassification error captured the true prevalence 100% of the time, irrespective of the true prevalence level. When adjustment for misclassification error was not done, the results highly varied depending on the population’s underlying true prevalence and the type of diagnostic test used. Generally, the prevalence estimates without adjustment for misclassification error worsened as the true prevalence level increased. Adjustment for misclassification error for publicly reported Maryland data led to a minimal but not significant increase in the estimated average daily cases. Random sampling and testing of COVID-19 are needed with adjustment for misclassification error to improve COVID-19 prevalence estimates. Since not all COVID-19 cases are symptomatic, testing individuals who present with symptoms or who have known exposure to COVID-19 and relying on voluntary testing likely underestimates the true prevalence of COVID-19 in communities (9, 10) . Indeed, an Icelandic study reported that only 57% of those confirmed with COVID-19 through random testing in the population had COVID-19 symptoms (9) . Therefore, to address selection bias when obtaining population-level prevalence estimates of COVID-19, the reported testing results should derive from a randomly selected population as it helps identify potential asymptomatic individuals. It is especially important to identify asymptomatic individuals in the population because presymptomatic or asymptomatic individuals may transmit disease (9, 11, 12) . Additionally, current estimates of COVID-19 cases in the United States do not account for misclassification error that may arise due to the imperfect nature of the tests. Studies have reported that testing for COVID-19 may lead to false positives and negatives depending on the test and its associated validity estimates, defined by the sensitivity and specificity (4, 6) . Two -19 diagnostic test with reverse transcriptase-polymerase chain reaction (RT-PCR) may be as high as 89% and as low as 63% (4, 5) . The specificity for this same test may be as high as 98.8% (5) . A recently published study reported a lab-free, point-of-care test with RT-PCR may have a sensitivity of 94% (95% CI 86-98) and a specificity of 100% (99-100%) (13 Ideally the number of new cases would be sufficient for decision-making, but this parameter is difficult to correctly estimate with asymptomatic cases. An alternative is to regularly obtain repeated population-level based prevalence estimates at specific points in time that account for selection and misclassification error biases. The primary aim of this study is to evaluate with a simulated population whether randomly sampling and testing a set number of individuals (i.e. the weekly average number of tests) for COVID-19 while adjusting for misclassification error captures the true prevalence of COVID-19. As a secondary aim, we quantify the impact of misclassification error on publicly reported case report data. We use the state of Maryland as a worked example, but the methods implemented can be applied to any setting. To create the simulated Maryland population from which to draw the study sample, we used as our primary data source the American Community Survey . The American Community Survey is a yearly nationally representative cross-sectional survey of the United States population conducted by the Census Bureau (17) . The survey is sent to approximately 3.5 million residences and captures individuals' demographic and socioeconomic characteristics (18) . We simulated the household, age and race distributions for each county in Maryland using the 2018 American Community Survey data for counties with more than 20,000 residents and the 2014-2018 American Community Survey data for counties with fewer than 20,000 residents (19, 20) . One- year data was not reported for smaller counties. Assigning true prevalence of COVID-19 to simulated Maryland Population To allow generalization of our findings to different situations, we examined the situation when the total true prevalence is low (0.07%-2%), medium (2%-5%) and high (6%-10%) for Maryland. These prevalence values are referred to as the true prevalence. Within these three prevalence value categories, we took 100 samples from the population and analyzed the data to capture the variability that occurs when a small proportion of a large population is sampled. We also varied the true COVID-19 prevalence at the county level. To do so, we first grouped Maryland's 24 counties into four groups based on the distribution of the cumulative number of cases reported by the Maryland Department of Health by June 10 (21). These county groupings aligned with the population distribution in the state, where more densely populated counties had a higher cumulative number of COVID-19 cases (21) . We then assigned a true prevalence value range drawn from a uniform distribution for each group as described in We also assigned symptoms to 57% of those with COVID-19 based on the Icelandic study (9) and a 3% general rate of viral symptoms in the population not due to COVID-19 (22) . To identify individuals for testing, we randomly sampled 50,000 individuals (approximately 1% of total population) to test for COVID-19 from the simulated population using a stratified probability proportional-to-size random sampling approach. The sampling described below was done 100 times to capture the variability that occurs when a small proportion of a large population is sampled. This sample size corresponds to approximately the average weekly number of tests reported in Maryland in May 2020 (21) . We used the counties of Maryland to partition the sampling frame into mutually exclusive exhaustive strata. This approach ensured individuals were selected from each county. To sample the 50,000 individuals, we first selected 50,000 households from the simulated population with representation from each stratum. The households' selection probability was proportional to the household population size in the county. We then randomly selected a person from that household ( Figure 1 ). This sampling design is self-weighting and thus allows the flexibility to not use weights in analyses because more individuals are selected from larger counties but their selection probabilities are proportionally lower resulting in equal selection probability for each individual (23) . We refer to this sample as the "study sample." This approach led to 100 study samples within for each true prevalence level. In addition, we examined the situation where 20% of tests are reserved for those who are symptomatic. To implement this situation, we randomly selected 10,000 individuals with symptoms and randomly selected 40,000 individuals from the remaining population. This approach would allow for a comparison between symptomatic testing versus testing irrespective of symptoms. We refer to this sample as the "symptomatic sample" study sample. (27) . These mean and 95% quantile intervals generated the following prior distributions of sensitivity and specificity for the RT-PCR test: and for the rapid antigen test: Model 2 estimated prevalence as a binomial distribution: where ( ) is the probability of testing positive, is the population prevalence, is the sample size, and is the number of individuals with COVID-19. We calculated the posterior mean and the 95% BCI of . We calculated the posterior distribution using the No-U-Turn Sampler, a variant of Hamiltonian Monte Carlo (28) . We ran four Markov Chains with 4,000 iterations of warmup and 9,000 iterations total. We used the default hyperparameters, adjusted only the maximum tree depth, which was set to 30. Four chains are the default for the algorithm. The tree depth was adjusted until the algorithm converged. We analyzed the 100 study and symptomatic samples for each range of true prevalence values (low, medium and high) and type of test. We ran Models 1 and 2 for the study sample, and only Figure 2 illustrates the daily percent of positive COVID-19 cases before and after adjustment for misclassification error. Overall, randomly sampling and testing individuals from the population for a set number of tests (i.e. the weekly average) combined with adjustment for misclassification error captured the true prevalence of COVID-19 regardless of the underlying true prevalence level and diagnostic test used. When adjustment for misclassification error was not done, the results highly varied depending on the population's underlying true prevalence and the type of diagnostic test used. Generally, the prevalence estimates without adjustment for misclassification error worsened as the true prevalence level increased. We found when the true prevalence was low and adjustment for misclassification error was not done, the RT-PCR and antigen tests performed similarly and the true prevalence was captured almost every time. An important contributing factor is that the two tests had very high specificity values. Since prevalence estimates are more prone to false positive over false negative results when the prevalence is low (26, 29) (29) . Therefore, it is especially important to account for misclassification error analytically if the known prevalence is medium or high to avoid underestimating the true prevalence. The improvement in COVID-19 prevalence estimates would also improve the COVID-19 case fatality rate, which uses the number of individuals with COVID-19 as the denominator (30) . Oversampling those who are symptomatic led to consistently missing the true prevalence. In our study, we only reserved 20% of tests for those who are symptomatic, yet this consistently led to an overestimate of the true prevalence irrespective of the true prevalence level and type of diagnostic test used. These results demonstrate the importance of sampling independently of symptoms for a disease such as COVID-19, which has asymptomatic individuals. Our methodology for the simulation portion of our study improved on Bendavid, E. et al's seroprevalence study which accounted for misclassification error (6) in that we randomly selected individuals for testing. As a result, our estimates were not subject to selection bias and were a direct representative sample of the population. Bendavid, E. et al did attempt to account for selection bias by using post-stratification weights (6) . However, adjustment is only as good as the variables included in the post-stratification weights and, therefore, it is difficult to assess whether their adjustment was successful. Finally, we employed a Bayesian approach to allow testing sensitivity and specificity to vary, which was a more realistic assumption compared to the frequentist approach used by Bendavid, E. et al (6) . are of greater concern than false positives when the prevalence is high (29, 31) . This confirms our findings from the simulation portion of our study, where we found that the estimated prevalence consistently underestimated the true prevalence when the true prevalence was medium or high. In this case, the daily percent positive reported by the Maryland Department of Health during that time period was greater than 10% and could be considered high. If the percent positive values were lower, for example closer to 1%, then false positives would be more of a concern. Therefore, our findings allowed us to confirm misclassification error may present a problem for already collected data and adjustment is needed to minimize bias. There are some limitations in our study. First, our study used simulated data to randomly select individuals and estimate the prevalence of COVID-19 in Maryland while adjusting for misclassification error. As a result, our analysis does not reflect some additional biases that may arise in the field, such as non-response bias from individuals refusing to be tested. Any Since we examined COVID-19 as opposed to past exposure to SARS-CoV-2 and our results were unsurprising, we do not believe this is an important concern for our estimates. It should also be The added benefit of implementing random testing in the community is that it helps with the identification and isolation of asymptomatic cases and bolsters contact tracing efforts. If largescale testing of individuals is not possible due to budget constraints, testing costs could be reduced by testing pooled samples (33 Clinical Testing For Covid-19 COVID-19) Modern epidemiology Diagnostic Performance of CT and Reverse Transcriptase-Polymerase Chain Reaction for Coronavirus Disease 2019: A Meta-Analysis The Appropriate Use of Testing for COVID-19 COVID-19 Antibody Seroprevalence Antibody Tests in Detecting SARS-CoV-2 Infection: A Meta-Analysis Spread of SARS-CoV-2 in the Icelandic Population Universal screening for SARS-CoV-2 in women admitted for delivery Presymptomatic SARS-CoV-2 Infections and Transmission in a Skilled Nursing Facility Early Release -Evidence Supporting Transmission of Severe Acute Respiratory Syndrome Coronavirus 2 While Presymptomatic or Asymptomatic Assessing a novel, lab-free, point-of-care test for SARS-CoV-2 (CovidNudge): a diagnostic accuracy study BD Veritor System for Rapid Detection of SARS-CoV-2 -Instructions for Use Misclassification errors in prevalence estimation: Bayesian handling with care Estimation of diagnostic test accuracy without full verification: a review of latent class methods Medicaid expansion for adults had measurable 'welcome mat'effects on their children United States Census Bureau. ACS and the American Community Survey 1-Year Public Use American Community Survey 5-Year Public Use Department of Health Rates of asymptomatic respiratory virus infection across age groups Sampling with varying probabilities without replacement: rotating and non-rotating samples Affairs (ASPA) AS for P. COVID-19 Rapid Point-Of-Care Test Distribution NIH delivering new COVID-19 testing technologies to meet U.S. demand Bayesian analysis of tests with unknown specificity and sensitivity Towards reduction in bias in epidemic curves due to outcome misclassification through Bayesian analysis of time-series of laboratory test results: Case study of COVID-19 in Alberta, Canada and Philadelphia, USA The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo Probability and statistics in aerospace engineering The many estimates of the COVID-19 case fatality rate When laboratory tests can mislead even when they appear plausible Information for Laboratories about Coronavirus (COVID-19) Pooling RT-PCR or NGS samples has the potential to cost-effectively generate estimates of COVID-19 prevalence in resource limited environments Poverty, inequality and COVID-19: the forgotten vulnerable Population point prevalence of SARS-CoV-2 infection based on a statewide random sample-Indiana