key: cord-0816452-k3ylgwaf authors: Rosenberg, Eli S; Bradley, Heather M title: Improving surveillance estimates of COVID-19 incidence in the United States date: 2020-12-04 journal: Clin Infect Dis DOI: 10.1093/cid/ciaa1813 sha: 850486936a5754d96b6d71b70af7d61c1226f339 doc_id: 816452 cord_uid: k3ylgwaf nan M a n u s c r i p t 2 Assessment of the cumulative burden of SARS-CoV-2 infection across demographic groups and geography informs disease control policies, public health prevention efforts, and risk communication to the public. Estimating burden of disease is challenging, however, because reported diagnoses represent a fraction of total infections, as a function of symptomatic status along with patient and provider behaviors, compounded in this year's pandemic by time-varying conditions that impacted test-seeking and receipt. In this issue of Clinical Infectious Diseases, Reese et al. adapt a multiplier method that has previously been used by CDC to convert the reported 2.1% symptomatic US diagnoses through September 2020 into 16.2%, or 53 million, persons estimated infected [1] . This estimate is a staggering assessment of both how many Americans have been touched by this infection and how many remain vulnerable in the months ahead, necessitating ongoing widespread prevention efforts. To-date, two study designs have been employed to measure cumulative incidence in US jurisdictions. Studies of clinical laboratory residual serum use new serological testing for SARS-CoV-2 antibodies to understand history of infection and can be combined with external data to estimate rates of diagnosis, hospitalization, and death [2, 3] . These studies use plentiful, available specimens to understand burden of infection across jurisdictions and time, but have limited variables for describing cumulative incidence by demographic features and for adjusting estimates for biases resulting from passive sampling of persons attending medical care during a pandemic. Populationbased serological studies that collect specimens from participants through combinations of inperson, drive-through, and at-home modalities, offer the opportunity for more extensive survey data collection and representative sampling, but come with the disadvantages of cost and threatened validity due to the complexities of recruiting large samples in a time of misinformation and societal closure [4] [5] [6] . Data from both designs require adjustment for serological test Similar to models developed for influenza and viral hepatitis surveillance, this process involves applying serial correction factors to an under-ascertained, and potentially otherwise biased, disease indicator universally available in surveillance data [9, 10] . In the case of COVID-19, the authors begin with mandatorily-reported symptomatic COVID-19 cases and apply 4 levels of successive corrections for the probabilities of detection given testing (test-sensitivity), test ordering given clinical presentation with symptoms, care seeking given symptoms, and symptom development given infection. They stratify parts of this process by subgroups defined by demography, geography, and time, which facilitates the following: 1) control for heterogeneity in these probabilities across subgroups by varying the correction factors applied, and 2) stratification of results to display the differential burden of disease across subgroups. We discuss below the implications of this approach as performed as well as enhancements that could be afforded by improved surveillance data. prior existence of this method and data inputs, we lament that CDC could have far earlier and routinely made such estimates available to inform the public and public health community. As with all models, this approach's robustness may be evaluated by the external validity of its estimates, made challenging by the lack of external estimates of total US infections, and by evaluating the quality of its approach and inputs. Enhancements in input surveillance data would increase robustness of this method for estimating COVID-19 burden of disease. To produce optimally-informative results, more granular geographic and demographic data are needed for both parameter estimates and to create the strata within which they are applied. The authors apply parameters B (the extent to which symptomatic patients seek care) and C (the proportion of care-seeking patients who are tested for COVID-19) within HHS census regions, but the parameter estimates themselves are not available by region or other characteristics. These parameters likely vary substantially by geography and population characteristics including race/ethnicity and other social determinants of health. Race/ethnicity is a particularly important characteristic that should be accounted for in the results, because racial and ethnic minority populations have the highest rates of COVID-19 diagnosis per population and are also geographically concentrated in the U.S. [11] . For this to be feasible, more complete race/ethnicity data on case reports are required. Despite many calls for more complete case report data on race/ethnicity, including from CDC, only 52% of reported cases currently have race/ethnicity specified [11] . Varying the parameter estimates applied within racial/ethnic strata would also require care-seeking and testing behaviors data by race/ethnicity. The extent to which participation in voluntary surveillance such as Flu Near You and COVID Near You varies by characteristics of underlying populations by geography is unclear. A national surveillance system that routinely collects information on COVID-19 prevention and testing behaviors, symptoms A c c e p t e d M a n u s c r i p t 5 status, and vaccine readiness from a representative sample, with the ability to produce estimates by state and population characteristics, is needed to optimally parameterize this multiplier method and would serve other important functions. As applied in the present model, use of the same parameters for patient care-seeking and provider testing behaviors across geographic areas may underestimate burden of disease in geographic areas with larger minority populations. This is particularly problematic for future resource allocation decisions, including for vaccine distribution, which may be made based on estimated infections. Robust estimates from the method employed by Reese at al. also rely on jurisdiction-wide standardization of surveillance practices in two areas: de-duplication and merging of case reports and symptoms ascertainment. In terms of de-duplication, insufficient merging of case reports by person will result in overestimation of burden of disease. Further, because of latency between COVID-19 diagnosis and hospitalization, unless hospital tests are merged to the original case data, many persons ultimately hospitalized will be missing this information on their case report, leading to underestimation of hospitalization rates. Uniform practices for collecting data on symptoms are also required. In the present model, asymptomatic cases were excluded from diagnoses, and an asymptomatic fraction was applied to the total number of estimated asymptomatic cases across U.S. The underlying, but unstated, assumption is that missing symptom data reflect asymptomatic infection and the quality/completeness of these data is relatively uniform across jurisdictions and, ultimately, HHS region. As advancements are made in COVID-19 data quality and surveillance data systems, one interim solution for burden of disease estimates resulting from a multiplier approach is to stratify the method's steps and estimates by jurisdiction based on completeness and quality of surveillance data. CDC has previously performed such stratified reporting for HIV clinical outcomes and drug A c c e p t e d M a n u s c r i p t 6 overdose deaths [12, 13] . Data quality related to completeness of race/ethnicity and symptoms status, as well as procedures used to merge patient information across case reports, should be considered [14] . Reese et al. have provided us with an important set of COVID-19 burden of disease estimates that can continue to be improved over time as the quality and completeness of surveillance data also improves. While their estimate of 53 million infections is distressing, it indicates the U.S. is far from achieving herd immunity even with minimal assumptions about waning immunity. Swift and equitable vaccine distribution will be critical to curbing the U.S. epidemic, and continual improvements to burden of disease estimates and how they vary by person, time, and place will be needed to optimally allocate resources toward such efforts and monitor success. This work was supported by the National Institute on Drug Abuse [R01DA051302]. Neither author has any potential conflicts to disclose. Estimated incidence of COVID-19 illness and hospitalization -United States Estimated SARS-CoV-2 Seroprevalence in the US as of Repeated cross-sectional sero-monitoring of SARS-CoV-2 in New York City Cumulative incidence and diagnosis of SARS-CoV-2 infection in New York Seroprevalence of SARS-CoV-2-Specific Antibodies Among Adults Protocol for a national probability survey using home specimen collection methods to assess prevalence and incidence of SARS-CoV-2 infection and antibody response Changes in SARS-CoV-2 Spike versus Nucleoprotein Antibody Responses Impact the Estimates of Infections in Population-Based Seroprevalence Studies Decline in SARS-CoV-2 Antibodies After Mild Infection Among Frontline Health Care Personnel in a Multistate Hospital Estimates of the prevalence of pandemic (H1N1) Estimating Acute Viral Hepatitis Infections From Nationally Reported Cases Demographic Trends of COVID-19 cases and deaths in the US reported to CDC Monitoring Selected National HIV Prevention and Care Objectives by Using HIV Surveillance Data: United States and 6 Dependent Areas Drug and Opioid-Involved Overdose Deaths -United States Tracking COVID-19 in the United States: Progress and Opportunities A c c e p t e d M a n u s c r i p t 7 A c c e p t e d M a n u s c r i p t