key: cord-1018265-dwiwbbew
authors: Chiu, Weihsueh A.; Ndeffo-Mbah, Martial L.
title: Using test positivity and reported case rates to estimate state-level COVID-19 prevalence and seroprevalence in the United States
date: 2021-09-07
journal: PLoS Comput Biol
DOI: 10.1371/journal.pcbi.1009374
sha: b8d08f9a101bfa7fcee35fc2f19f7dc932257ea8
doc_id: 1018265
cord_uid: dwiwbbew

Accurate estimates of infection prevalence and seroprevalence are essential for evaluating and informing public health responses and vaccination coverage needed to address the ongoing spread of COVID-19 in each United States (U.S.) state. However, reliable, timely data based on representative population sampling are unavailable, and reported case and test positivity rates are highly biased. A simple data-driven Bayesian semi-empirical modeling framework was developed and used to evaluate state-level prevalence and seroprevalence of COVID-19 using daily reported cases and test positivity ratios. The model was calibrated to and validated using published state-wide seroprevalence data, and further compared against two independent data-driven mathematical models. The prevalence of undiagnosed COVID-19 infections is found to be well-approximated by a geometrically weighted average of the positivity rate and the reported case rate. Our model accurately fits state-level seroprevalence data from across the U.S. Prevalence estimates of our semi-empirical model compare favorably to those from two data-driven epidemiological models. As of December 31, 2020, we estimate nation-wide a prevalence of 1.4% [Credible Interval (CrI): 1.0%-1.9%] and a seroprevalence of 13.2% [CrI: 12.3%-14.2%], with state-level prevalence ranging from 0.2% [CrI: 0.1%-0.3%] in Hawaii to 2.8% [CrI: 1.8%-4.1%] in Tennessee, and seroprevalence from 1.5% [CrI: 1.2%-2.0%] in Vermont to 23% [CrI: 20%-28%] in New York. Cumulatively, reported cases correspond to only one third of actual infections. The use of this simple and easy-to-communicate approach to estimating COVID-19 prevalence and seroprevalence will improve the ability to make public health decisions that effectively respond to the ongoing COVID-19 pandemic.

Accurate and reliable estimates of the prevalence and seroprevalence of infection are essential for evaluating and informing public health responses and vaccination strategies to mitigate the ongoing COVID-19 pandemic. The gold standard method to empirically measure disease prevalence and seroprevalence is to conduct periodic large-scale surveillance testing via random sampling [1] . However, this approach may be time-and resource-intensive, and only a handful of such surveillance studies has been conducted so far in the United States (US) [2] [3] [4] [5] [6] [7] . Therefore, public health officials have relied on alternative metrics, such as test positivity, reported cases, fatality rates, hospitalization rates, and epidemiological models' predictions, to inform COVID-19 responses. Test positivity has, for instance, been commonly used to infer the level of COVID-19 transmission in a population and/or the adequacy of testing [8] [9] [10] [11] [12] [13] [14] . However, the justifications for use of this metric often reference a WHO recommendation intended to be applied only in a sentinel surveillance context [15] ), rather than in the more general context in which it has been frequently implemented. As measures of prevalence, test positivity and reported cases, although readily available and well-understood by public health officials, are very likely to provide biased estimates of disease transmission/prevalence and seroprevalence [1, 16, 17] ). Hospitalization and death rates are also similarly readily available, but tend to lag infections by several weeks and only reflect the most severe outcomes [1] . Finally, epidemiological models are generally complex mathematical, computational, or statistical models that require extensive data and information for model training, and are perceived as a "black box" by most public health practitioners and decision makers [18] [19] [20] .

Here, we develop a simple semi-empirical model to estimate the undiagnosed prevalence and seroprevalence of COVID-19 at the US state level based only on reported cases, test positivity rate, and testing rate (Fig 1) . Specifically, we hypothesized that passive case finding employed in the US leads to preferential diagnostic testing for individuals at higher risk of infection and can be modeled as a convex function of the overall testing rate, reflecting the "diminishing return" from expanding general population testing (Fig 1B and 1C) . We modeled this convexity using a negative power function, with power parameter n that is either fit to each state (random effects model) or fixed at ½ (geometric mean model). We also included seroprevalence in our simple semi-empirical modeling framework by adding an offset term SP o to account for missed infections during the early part of the pandemic before regular and large-scale testing was established. We calibrated and validated the power parameter and other model parameters by fitting our seroprevalence model to statewide seroprevalence data (Tables A and B in S1 Text), which has only recently become available across all U.S. states [2] [3] [4] [5] [6] [7] 21] , using a Bayesian inference approach. We also compared our model predictions against two independent data-driven mechanistic models [18, 22, 23] and showed that our model's predictions of infection prevalence approximate those of more complex models. We found that the state-level prevalence of undiagnosed

Four independent Markov chain Monte Carlo chains were simulated, and reached adequate convergence after 20,000 iterations per chain for the random effects model and 2,000 iterations per chain for the geometric mean model (PSRF � 1.15 for all parameters) (see Table 1 and Table C in S1 Text) and the multivariate PSRF�1.11. For inference, 2,000 samples were selected randomly from across the available iterations (80,000 for random effects and 8,000 for geometric mean).

The 95% credible intervals for the power parameter n include ½ (corresponding to an unweighted geometric mean) both for the fixed effect and for all but three states' random effects (ME, NH, RI) ( Table C and Fig A in S1 Text) . For the seroprevalence offset SP o , the posterior median for most states was < 1% initial condition, but three states had posterior medians > 5%. Specifically, for NY, PA, and LA, it was estimated that initial cases that were missed constituted 14% [95% CrI: 8.9%-18.6%], 5.2% [0.3%-8.1%], and 5.1% [2.6%-7.7%] of the population, respectively. For NY and LA, these values are consistent with these two states having large initial surges of cases when testing was highly limited, and therefore were likely to have missed a large number of cases. For PA, this value is consistent with its high death-tocase ratio observed in the initial phase of the pandemic which indicates a large number of cases were likely missed [24, 25] . The large variation in SP o values across states is consistent with high heterogeneity that has been noted both in the size of their initial surge of infections and in their testing capacity and availability.

Comparison of posterior estimates and observations by state are shown in Fig 2, and show the model to be consistent with available seroprevalence calibration and validation data both in terms of level and trends. For four states (AK, IL, OH, WI), model validation's predictions underestimated empirical seroprevalence data, though the trends were correctly predicted (Fig  2) . For calibration data, the residual standard error (RSE) was estimated to be 0. 27 

We compared our model estimates for the prevalence of active infections with those from two independent epidemiologic models of U.S. states. As shown in Fig 3, the posterior estimates of the semi-empirical model are largely consistent with posterior credible intervals from the epidemiologic models, with the most notable difference in NY, where the initial surge was underestimated. This is not unexpected because this surge includes the "missed" cases, which for seroprevalence was addressed by the seroprevalence offset SP o , but which is not included in the prevalence estimates. Across all states in aggregate, the RSE difference between the posterior medians of semi-empirical estimate and the extended SEIR model is 0.67 natural log units (see Fig D in S1 Text), corresponding to a CV of 75%, with an R 2 of 0.68. Similarly, the comparison with the Imperial model yields an RSE of 0.66, corresponding to an 74% CV, and an R 2 of 0.68. These RSE values should be taken in context of the posterior uncertainty in the epidemiologic models themselves, which have individual reported uncertainties corresponding to CV of 45% and 23% for the extended SEIR and Imperial models, respectively, as well as the differences between the two models, which have a CV of 63%. Thus, the difference between the semi-empirical model and the epidemiological models is not much greater than the difference between the two epidemiologic models themselves. Overall, the semi-empirical model estimate of infection prevalence is consistent with the results of the available epidemiologic models.

As of December 31, 2020, our calibrated and validated semi-empirical model estimates that in the US, total infection prevalence was 1.43% [CrI: 0.99%-1.86%], with more than half undiagnosed (0.83% [0.41%-1.25%]), and a seroprevalence of 13.2% [CrI: 12.3%-14.2%] (Fig 4) . Between April 1 and December 31, 2020, the test positivity rate bias b and the ratio between estimated seroprevalence and cumulative reported cases were shown to decrease over time (Fig H in S1 Text) . In April, the median positivity rate bias across states was 65 and the cumulative cases underreporting bias ranged between 0.02 to 0.09 (i.e., only 2%-9% of cases were reported). By December, the median positivity rate bias had declined to 17 and the cumulative cases underreporting bias ranged between 0.14 and 0.69 (Fig H in S1 Text) . Across the U.S. in aggregate, from April to December, the median positivity rate bias declined from 61 to 15, and the cumulative cases underreporting bias improved from 0.01 to 0.33 (Fig H in S1 Text) . Results for the simpler geometric mean model were very similar.

The pitfalls of relying on reported cases or test positivity rate alone to estimate the course of the epidemic are illustrated for five states, MN, VA, WI, KY, and TN where reported cases and positivity trends were in opposite directions in May or December (Fig I in S1 Text). Specifically, in May, reported cases were rising substantially in MN, VA, and WI at the same time that the test positivity rate was declining, testing rate was increasing, and the model (either the primary random effects or the simpler geometric mean) predicted total prevalence was flat or decreasing (Fig I in S1 Text). By contrast, in December, the states of KY and TN all showed declining reported case rates while positivity was increasing, while our model predicted that COVID-19 prevalence was actually flat or increasing during this time. In both scenarios, the increase (decrease) in reported cases was due to expanding (declining) testing rates, respectively.

Reported case rates and test positivity rates have been widely used to inform or justify public health decisions, such as increasing or relaxing non-pharmaceutical interventions, for the control of the COVID-19 pandemic in the US [26, 27] . A recent report of the National Academies of Sciences and Engineering Medicine (NASEM) has urged caution about the reliability/validity of directly using data such as reported case rates and test positivity rates to inform decision making for COVID-19 [1] . Though these data are usually readily available, the NASEM report concludes that they are likely to substantially underestimate or overestimate the real state of disease spread [1] . Therefore, there is a critical need to develop simple and more reliable datadriven metrics/approaches to inform local public health decision-making.

We have developed a simple semi-empirical approach to estimate the prevalence and seroprevalence of COVID-19 infections in a population using only reported cases and testing rates that does not require developing and maintaining a complex, data-driven mathematical model. Based on a simple hypothesis that the bias in test positivity is a convex, negative power function of the testing rate, we find that the undiagnosed COVID-19 prevalence, with a 1-week lag, is well-approximated by the (weighted or unweighted) geometric mean of the positivity rate and the reported case rate averaged over the last 2 weeks (Eq 4). Seroprevalence can be calculated by taking a cumulative sum while accounting for the duration between infection and seropositivity, a period of typically 2 weeks, ongoing diagnoses as reflected in the testing rate, and a state-specific offset accounting for missed infections in the early part of the pandemic prior to establishment of regular testing (Eqs 10 and 11). Our model resulted in an accurate fit to recently available state-level seroprevalence data from across the U.S. Additionally, the prevalence estimates of our semi-empirical model were shown to compare favorably to those from two data-driven epidemiological models. We estimate the nation-wide total prevalence rate as of December 31, 2020 to be 1.4% [CrI: 1.0%-1.9%], corresponding to a test positivity bias of around 15, and nation-wide seroprevalence to be 13.2% [CrI: 12.3%-14.2%], so that cumulative reported cases correspond to approximately one-third of actual past infections. At the state level, estimated seroprevalence was 1.4 to seven times cumulative reported cases. These estimates compare favorably to those previously published using more complicated approaches [28, 29] .

Our analysis suggests that public health policy related to either non-pharmaceutical (masking and social distancing) or pharmaceutical interventions (vaccination) may be informed by available data in three main ways:

• First, decline in either positivity rate or reported case rates alone is insufficient to infer that prevalence is declining. In the case where one is increasing and one is decreasing, our model suggests that the direction of their geometric mean is a better indicator of increasing or decreasing prevalence (Fig I in S1 Text). Reported cases are particularly unreliable indicators when population testing rates are increasing or decreasing substantially, and at low testing rates, when the positivity rate bias is higher.

• Second, reported cases, test positivity, and testing rates should be publicly reported at the county or municipal level in order to provide local governments, health agencies, medical personnel, and the public with the necessary information to evaluate local pandemic conditions. Currently, only reported cases are routinely provided at the local level, with positivity and testing rates aggregated (often inconsistently) only at the state level.

• Finally, seroprevalence estimates can play a key role in forecasting future potential spread of the pandemic and threshold vaccination coverage needed to stamp out disease transmission at the state or community-level.

As with any model, ours has a number of limitations. The most significant limitation is the lack of more comprehensive, random sampling-based data with which to further validate the model. However, our model did accurately fit all the available seroprevalence data, including recent CDC data at multiple time points across all 50 states and the District of Columbia [6] . As further validation of the approach, we applied our model internationally to 15 countries for which both nation-wide seroprevalence data (Table F in S1 Text) and daily testing data were available early in the pandemic (March-August). The 95% CrI for our model, using the random effects posterior distributions from our U.S. state-level calibration, covered all the seroprevalence data except for Russia (Fig J in S1 Text) , suggesting that this approach might be more broadly applicable, though requiring nation-specific calibration. With respect to prevalence, we could only compare epidemiologic model-based estimates of prevalence due to lack of random sample-based surveillance data. However, we believe this limitation is mitigated by our use of two independent estimates with completely different model structures, one of which is a more traditional extended-SEIR model, and other of which is a "semi-mechanistic" model partially statistical in nature. Another important limitation is the relatively limited range of Fig F in S1 Text. The maps were generated using the R package usmap https://cran.r-project.org/web/packages/usmap/index.html (GPL-3), which uses shape files from the U. S. Census Bureau (the link provided in documentation is here: https://www.census.gov/geographies/mapping-files/ time-series/geo/tiger-line-file.html).

https://doi.org/10.1371/journal.pcbi.1009374.g004 testing rate observations for most U.S. states. For this reason, we cannot necessarily guarantee that our results can be easily extrapolated to substantially higher testing rates. However, with higher testing rates, the difference between test positivity rates and reported case rates would decrease and reduce the effect of greater uncertainty in the degree of bias between the test positivity rate and the lagged prevalence. Our model also does not account for the potential impact of population movement on seroprevalence. In-and out-flow of seropositive individuals could alter a state's seropositivity rate. While population movement may have marginal impact on COVID-19 seroprevalence in most states/countries because of mobility restrictions, some US states such as New York have experienced a significant increase in population out-flow during the pandemic [30] . This population movement may explain in part the reduction in seroprevalence observed in New York. Moreover, our model did not account for the impact of rates of false-positives and false-negatives on COVID-19 prevalence and how these rates may change with testing methods/strategies. However, if time series data on false-positive and false-negative rates were available, these could be easily incorporated into our modeling framework. We anticipate that the impact of imperfect test accuracy (the sensitivity and specificity of diagnostic testing) is likely to have a minimal impact on our results. Finally, for simplicity, our model assumes the power parameter, n for the bias function, b(t) remain constant during the course of the epidemic. This assumption can be extended by assuming the power term changes as testing behavior and strategies and/or infection prevalence change over time. This can be done by using a stepwise function with n 0 s value constant over periods of marginal changes in testing strategies and behavior. Future work can account for these different factors and could also extend the current framework to explicitly account for the impact of vaccination on estimating disease prevalence and seroprevalence.

In conclusion, we found that the undiagnosed COVID-19 prevalence is well-approximated by the geometric mean of the positivity rate and the reported case rate, and that seroprevalence can be estimated by taking a cumulative sum while accounting for the duration between infection and seropositivity, a period of typically 2 weeks, ongoing diagnoses, and a state-specific offset. The use of this simple, reliable, and easy-to-communicate approach to estimating COVID-19 prevalence and seroprevalence will improve the ability to make public health decisions that effectively respond to the ongoing COVID-19 pandemic in the U.S.

First, we develop a model for infection prevalence. Test positivity rate P +,τ (t) = N +,τ (t)/N test,τ (t) is defined as the percentage of positive diagnostic tests administered over a given period τ between t−τ and t, where time t is discretized by day (we use τ-averaged testing data throughout our analysis to smooth out day-to-day variations in reporting, including weekend effects). We hypothesize that, because testing is mainly done through passive case finding (i.e., only those considered more likely to be infected due to symptoms, contacts, etc., are tested), P +,τ (t) is correlated to the lagged prevalence I U (t−t lag )/N of undiagnosed COVID-19-infected persons in the population, where N is the population size, with a time-dependent bias parameter b(t):

Conceptually, this relationship is shown in Fig 1A and 1B . As shown in Fig 1C, we also hypothesize that the bias parameter b(t) is inversely related to the testing rate Λ τ (t) = N test,τ (t)/ N over the same period τ. At a testing rate of 1, where everyone is tested, there is no bias, so b = 1. On the other hand, for low testing rates, the bias is likely to be high, as mostly severely ill individuals will be tested. We assume large-scale passive testing as a baseline testing rate for our model, which is consistent with COVID-19 outbreak response in the US. Under this condition, increases in the testing rate from baseline, which reflects more active testing/contact tracing efforts, will preferentially increase the infected population testing rate relative to the general population testing rate; so b(t) may decline more rapidly than at higher testing rates, as there is "diminishing return" from increased testing. Thus, for simplicity, we assume that b(t) is a convex function of Λ τ (t).We therefore model the bias as a negative power function of

ð with n restricted between 0 and 1. Though other more complex functional forms could be used, the inverse power function we chose has the advantage that the limit of n = 0 reflects no bias (random sampling) and the limit n = 1 reflects the case that everyone infected is tested. While this appears to imply an unbounded bias as the testing rate goes to zero, as shown below, our model will naturally limit the bias parameter when test positivity is 100%. Combining Eqs (1) and (2), and re-arranging leads to the following relationship between test positivity and the undiagnosed infectious population:

Additionally, because test positivity and the testing rate share a term N test,τ (t), Eq (3) can be further rearranged as

where the last term is the reported cases per capita C +,τ (t) = N +,τ (t)/N. Thus, our hypothesis predicts that the infectious population is proportional to a weighted geometric mean of the positivity rate and the reported case rate, with n = ½ corresponding to equal weighting (simple geometric mean). For n = 1 the reported cases per capita is equal to the lagged undiagnosed prevalence rate regardless of the underlying disease dynamics and prevalence in the population. Such a scenario will likely occur only when everyone is tested.

To obtain the overall infection prevalence, we need to add diagnosed infectious cases. We assume a recovery period after diagnosis of T rec = 10 days, consistent with the CDC quarantine recommendation for COVID-19 infection [31] , so the diagnosed cases from the last T rec days constitute the active diagnosed infections I D :

Note that t 0 = t is not included because on the day individuals are diagnosed, they are considered part of the undiagnosed prevalence (i.e., testing is "sampling without replacement" of the undiagnosed population).

We can also rearrange Eq (1) and view the bias parameter as the relative efficacy of testing infected individuals compared to the general population:

Here, L I;t;t lag ðtÞ is the daily rate of testing of infectious individuals (with averaging time τ and lag t lag ), whereas Λ τ (t) is the daily rate of testing of the general population, as previously defined. Thus, the bias reflects the extent to which infectious individuals are "preferentially" tested through passive case finding. Moreover, due to the way I U (t−t lag ) is calculated, when positivity is 100% so that N +τ (t) = N test,τ (t), the bias appropriately equals N/I U (t−t lag ). We use this semi-empirical model for infection prevalence to estimate undiagnosed seroprevalence SP U (t) as follows. Assuming a time interval between infection and seropositivity = T inf , each time point t, we can subdivide the undiagnosed infection prevalence I U into T inf "sub-compartments" I U,m (m = 1. . .T inf ) (see Fig K in S1 Text) . Given the daily testing rate of infectious individuals L I;t;t lag ðtÞ, the number of individuals in subsequent subcompartments declines by a factor (1 -Λ) as diagnoses occur (leaving I U for I D ), so the sub-compartment sizes are:

Thus, the number of undiagnosed individuals who become newly undiagnosed seropositive each day is simply the number in the last sub-compartment I U;T inf multiplied by another factor of (1 -Λ), which simplifies to SP U ðtÞ ¼ SP U ðt À 1Þ þ I U ðt À 1Þ= P T inf m0¼1 ð1 À L I;t;t lag ðt À 1ÞÞ À m0 :

Setting γ = 1/(1 -Λ), replacing the summation with the formula for the sum of a geometric sequence (T inf terms, first term and common ratio both = γ), and defining the sum as a timedependent "effective" time T eff ¼ gð1 À g T inf Þ=ð1 À gÞ, Eq (8) becomes

Therefore the fraction of I U becoming seropositive each day (while remaining undiagnosed) is a fraction 1/T eff (see Fig K in S1 Text). As testing rates approach 0, so that virtually everyone remains undiagnosed, this fraction approaches 1/T inf , as would be calculated considering I U as a single "well-mixed" compartment. Additionally, as an initial condition, we allow for an offset SP o for missed infections during the early part of the pandemic before regular and large-scale testing was established. Therefore, combining with Eq (4) gives the undiagnosed seroprevalence rate as:

For the diagnosed seroprevalence SP D , we make the simplifying assumption that it is equal to the cumulative reported cases lagged by the mean time interval between infection and seropositivity T inf

Eqs (4), (5), (10) and (11) therefore comprise the complete semi-empirical model for overall infection prevalence (I = I D + I U ) and seroprevalence (SP = SP D + SP U ) based solely on average positivity P +,τ (t), averaged reported case rates C +,τ (t), and corresponding reported cases N +,τ (t), which we calculate from data obtained from the COVID Tracking Project [32] . We fix the averaging time τ at 14 days, and the lag time t lag = τ/2 at 7 days, so the semi-empirical model has only three remaining free parameters: the power parameter n, the infection-to-seropositive time interval T inf , and the initial condition for seroprevalence SP o . We consider two variations of the model: the primary "random effects" model in which n and SP o are considered as random effects across states and a simpler "geometric mean" model with a fixed n = ½ so that I U is the geometric mean of positivity and case rates. For both variations, a single value of T inf across states is used. We conducted sensitivity analyses for different values of the averaging time τ (7 and 28 days instead of 14 days); the posterior parameter estimates for n and SP o and the seroprevalence predictions were almost indistinguishable across different averaging times, while the infection prevalence predictions were much noisier using an averaging time of 7 days but little changed using 28 days (Figs L-O in S1 Text).

To calibrate and validate the model, we utilized state-wide seroprevalence data, which has only recently become available for all 50 states and the District of Columbia (Table A in S1 Text). Specifically, we fitted our model using data collected from 9-March-2020 to 15-Nov-2020 and validated our model predictions by comparing them to data collected from 9-Nov-2020 to 4-Jan-2021 that were not used for model fitting (the overlap in dates is due to overlapping end dates and start dates of CDC data collection rounds). The likelihood function assumes independent log-normal distributed errors given an observed and model-predicted seroprevalence. The log-transformed variance of the likelihood distribution was calculated as the sum of the reported error variance in the data (estimated from reported 95% CI for each observation) and a fitted residual error variance. We used a Bayesian MCMC approach to calibrate the model parameters (see Table 1 for prior and posterior distributions) and the potential scale reduction factor (PSRF) was used to assess convergence, with a value of <1.2 regarded as adequate [33, 34] . Additional details about model calibration and validation are found in S1 Text.

We compare prevalence estimates of our model to estimates from a Bayesian extended-SEIR [23] and Imperial model [35] . This was done by comparing the log-transformed posterior median estimates for each model for their overlapping time intervals (March 12 to July 22 for the extended-SEIR model and March 12 to July 20 for the Imperial model). The model performance was quantified by the residual standard error on the log-transformed values between models, the corresponding coefficient of variation, as well as the R-squared statistic. The extended-SEIR [23] was calibrated to US state-level reported cases and deaths through a Markov Chain Monte Carlo (MCMC) approach using a Metropolis within Gibbs sampling. The model explicitly estimated underreported symptomatic/mild symptomatic cases in each state and the District of Columbia. The Imperial model [35] uses a Bayesian semi-mechanistic model calibrated to US state-level reported deaths. Model calibration was done using a MCMC approach with an adaptive Hamiltonian Monte Carlo (HMC) sampler. The model back-calculates cases from estimated deaths through estimated infection fatality rate. This approach implicitly accounts for under-reported cases. Both of these are Bayesian models, and we use these models' posterior distributions for comparison.

Our model can be used to estimate the degree of bias in current measures of prevalence (test positivity and reported case rates) and seroprevalence (cumulative reported cases). The overreporting bias of test positivity as a measure of prevalence is already given in Eq (2) . The under-reporting bias of reported case rates can be calculated by rearranging Eq (4),

so the under-reporting bias is b(t) (n−1)/n , which is equal to b(t) −1 for n = ½. The implied bias from cumulative reported cases as a measure of seroprevalence is calculated by dividing by sum of C +,τ (t) by the seroprevalence estimated by Eqs (10 and 11).

All analyses were performed using the R statistical software (R version 3.6.1) in RStudio (Version 1.2.1335). We have implemented our model in an online dashboard (https://wchiu. shinyapps.io/COVID-19-Prevalence-and-Seroprevalence/) to enable easy access to our results.

Ethical approval was not required for this work.

Supporting information S1 Text. Supplemental Methods. Table A . State-wide seroprevalence calibration data. Table B . State-wide seroprevalence validation data. 

Evaluating Data Types: A Guide for Decision Makers using Data to Understand the Extent and Spread of COVID-19

Seroprevalence of Antibodies to SARS-CoV-2 in 10 Sites in the United States

Population Point Prevalence of SARS-CoV-2 Infection Based on a Statewide Random Sample-Indiana

Cumulative incidence and diagnosis of SARS-CoV-2 infection in New York

Prevalence of SARS-CoV-2 antibodies in a large nationwide sample of patients on dialysis in the USA: a cross-sectional study

Estimated SARS-CoV-2 Seroprevalence in the US as of

COVID-19) Testing-Statistics and Research-Our World in Data

How Bad Is the Coronavirus Outbreak? Here's a Key Number

What Is the Active Prevalence of COVID-19?

Modelling the positive testing rate of COVID-19 in South Africa Using A Semi-Parametric Smoother for Binomial Data

Predictive Capacity of COVID-19 Test Positivity Rate

Relationship of Test Positivity Rates with COVID-19 Epidemic Dynamics

How Bad Is the Coronavirus Outbreak? Here's a Key Number.-The Atlantic

Considerations for implementing and adjusting public health and social measures in the context of COVID-19

Underdetection of cases of COVID-19 in France threatens epidemic control

SARS-CoV-2 antibody testing for estimating COVID-19 prevalence in the population

Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe

COVID-19 Mathematical Modeling | COVID-19 | CDC

State-level tracking of COVID-19 in the United States-Imperial College London

State-level impact of social distancing and testing on COVID-19 in the United States

Urban Air Pollution

Case-Fatality and Mortality Rates in the United States

Association of Public Interest in Preventive Measures and Increased COVID-19 Cases After the Expiration of Stay-at-Home Orders: A Cross-Sectional Study

CDC Activities and Initiatives Supporting the COVID-19 Response and the President's Plan for Opening America Up Again

Substantial underestimation of SARS-CoV-2 infection in the United States

Estimated incidence of COVID-19 illness and hospitalization-United States

Census Estimates Show Population Decline in 16 States | The Pew Charitable Trusts

Science Brief: Options to Reduce Quarantine for Contacts of Persons with SARS-CoV-2 Infection Using Symptom Monitoring and Diagnostic Testing | CDC

The COVID Tracking Project | The COVID Tracking Project

Inference from Iterative Simulation Using Multiple Sequences

General Methods for Monitoring Convergence of Iterative Simulations

State-level tracking of COVID-19 in the United States

Serology Testing for COVID-19 at CDC | CDC

State-level tracking of COVID-19 in the United States

We thank the COVID Tracking Project and Our World in Data for compiling COVID-19 case and testing data and providing it to the public.