key: cord-1015949-z96qw3tl authors: Kou, S C; Yang, Shihao; Chang, Chia-Jung; Ho, Teck-Hua; Graver, Lisa title: Unmasking the Actual COVID-19 Case Count date: 2020-05-15 journal: Clin Infect Dis DOI: 10.1093/cid/ciaa580 sha: b149c2c7dc69c588bee6a7b18835bebd850476a7 doc_id: 1015949 cord_uid: z96qw3tl This report presents a novel approach to estimate the number of COVID-19 cases, including undocumented infections, in the US, by combining CDC’s influenza-like illness surveillance data with aggregated prescription data. We estimated that the cumulative number of COVID-19 cases in the US by April 4 was above 2.5 million. During the COVID-19 pandemic, many infections with mild to no symptoms are not reported due to various factors, including limited testing [1, 2] . There is a critical need to estimate the true scale of the pandemic for hot-spot detection, resource allocation and intervention planning. Existing modeling approaches use epidemiology data [2] and digital technology/data [3] [4] [5] to estimate the scale of COVID-19. In this report we present a novel approach to estimate the total number of COVID-19 cases, including undocumented infections, in the United States (US) by comparing (a) data from the US Center for Disease Control and Prevention (CDC) Outpatient Influenza-like Illness Surveillance Network (ILINet), which targets all influenza-like illness (ILI), overlapping with COVID-19, against (b) the aggregated prescription data of oseltamivir [6] , which targets influenza only. Our model shows current official numbers are severely underestimated: we estimate that by the week ending March 21 there were over 1.3 million total COVID-19 infections in the US and that by the week ending April 4,there were over 2.5 million total infections in the US. CDC defines ILI as "fever and a cough and/or a sore throat without a known cause other than influenza" [7] , which covers the common symptoms of COVID-19. CDC generates weekly reports on the ILI level [7] , and conducts laboratorial influenza virologic surveillance. Prior to mid-February 2020, these two surveillance measures moved in the same direction. Since mid-February, however, the two measures have diverged with the difference between ILI and laboratory-confirmed influenza activities attributable to COVID-19 [7] . If we can obtain an accurate measure for influenza level, we can then use the difference between the reported ILI level and the estimated influenza level to estimate the level of new COVID-19 cases on a weekly basis. We use aggregated weekly prescription data of oseltamivir, prescribed to treat influenza A and B but not COVID-19, to estimate the influenza level. Specifically, we used a linear model to calibrate the A c c e p t e d M a n u s c r i p t CDC-reported ILI level to the oseltamivir prescription data from January 2010 to mid-February 2020, and then produced estimates for influenza activity for mid-February to March 2020 (see the figure). Our estimated influenza level (blue line) closely matches the CDC-reported ILI level (black line) (correlation 0.974) prior to mid-February 2020, but significant gaps between the two levels (red and black lines) emerge after mid-February, which can be attributed to COVID-19. For the week ending March 21, 2020, we estimated 47% of the reported ILI level could be from COVID-19, which corresponds to ~855,000 new symptomatic cases in the US. As the official confirmed number of new cases was 17,450 for that week [8] , this result shows that there were more than 800,000 unreported symptomatic cases. The figure also shows that the cumulative number of COVID-19 symptomatic cases in the US by the week ending March 28 was estimated to be above 2 million and that the cumulative number of symptomatic cases in the US by the week ending April 4 was estimated to be above 2.5 million. Our results show that the official numbers are severely underestimated, a conclusion that appears to be supported by a recent large-scale screening study covering more than 6% of the Icelandic population [9] and another antibody survey study in the Santa Clara County of California (although the study was cautioned for its design and potential sampling bias) [10] . Our study targeted symptomatic COVID-19 cases as we used the CDC-reported percentage of patients with symptomatic illness who would seek medical care in our estimation. Therefore, if we consider the substantial presymptomatic and asymptomatic cases revealed by the Icelandic study [9] , the total COVID-19 infections in the US are likely to be even higher than our estimates. Our estimation method is simple and intuitive. It contrasts the CDC-reported ILI level with the estimated influenza level from influenza-specific prescription data to obtain an estimate of the COVID-19 level. Our approach innovatively combined the traditional syndromic surveillance system with data from pharmacy prescriptions. It provides a feasible solution for estimating unreported COVID-19 cases with mild symptoms. One limitation of our model is that the estimate might become more conservative through time due to administrative/government interventions. Towards the start of April, the syndromic surveillance M a n u s c r i p t system ILINet got more and more affected by the changes in the healthcare system, including increased use of telemedicine, the recommendation to limit hospital visits to only severe illness, and tightened social distancing. These changes affect the total number of hospital visits, patients' inclination to seek outpatient healthcare and doctors' medication prescription. Thus, our estimates in early to mid March could be more accurate as these changes have not yet taken place, and our estimate would serve as a lower bound for the symptomatic cases of COVID-19 in later weeks. Our study indicates the feasibility to estimate COVID-19 case count using multiple data sources. This approach can be used in conjunction with approaches utilizing digital data sources for COVID-19 case estimation [11, 12] . COVID-19 presents an unprecedented challenge. Conquering it requires unprecedented levels of collaboration and data sharing across government agencies, research institutes and the private sector. M a n u s c r i p t The estimated influenza level before and after mid-February 2020. Prior to mid-February 2020, our estimated influenza level, the blue line, closely matches the CDC-reported ILI level, the black line, but significant gaps between the two levels, the red and black lines, emerge after mid-February, which can be attributed to COVID-19. To estimate the COVID-19 weekly case counts shown in the figure, we used the ILI total counts reported in ILINet, the reported 8.5% sampling rate of ILINet and the reported 50% ± 8% rate of persons with symptomatic ILI seeking medical care for their illness. For the reported rates, see https://www.cdc.gov/flu/about/burden/preliminary-in-season-estimates.htm and https://www.cdc.gov/flu/about/burden/how-cdc-estimates.htm. A c c e p t e d M a n u s c r i p t Coronavirus latest: confirmed cases cross the one-million mark Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2) Share mobile and social-media data to curb COVID-19 Aggregated mobility data could help fight COVID-19 Digital technology and COVID-19 Data Source: IQVIA, IQVIA National Prescription Audit TM Influenza Surveillance System: Purpose and Methods. 2020. Available at Situation update worldwide Spread of SARS-CoV-2 in the Icelandic Population COVID-19 Antibody Seroprevalence Accurate estimation of influenza epidemics using Google search data via ARGO Opinion | Google Searches Can Help Us Find Emerging Covid-19 Outbreaks A c c e p t e d M a n u s c r i p t