key: cord-0968227-3uqi3fxk authors: Noh, J.; Danuser, G. title: Estimation of the fraction of COVID-19 infected people in U.S. states and countries worldwide date: 2020-09-28 journal: nan DOI: 10.1101/2020.09.26.20202382 sha: 9350ef88a10359cabdc860b71671d1f8957cec66 doc_id: 968227 cord_uid: 3uqi3fxk Since the beginning of the COVID-19 pandemic, daily counts of confirmed cases and deaths have been publicly reported in real-time to control the virus spread. However, substantial undocumented infections have obscured the true prevalence of the virus. A machine learning framework was developed to estimate time courses of actual new COVID-19 cases and current infections in 50 countries and 50 U.S. states from reported test results and deaths, as well as published epidemiological parameters. Severe under-reporting of cases was found to be universal. Our framework projects for countries like Belgium, Brazil, and the U.S. ~10% of the population has been once infected. In the U.S. states like Louisiana, Georgia, and Florida, more than 4% of the population is estimated to be currently infected, as of September 3, 2020, while in New York the fraction is 0.12%. The estimation of the actual fraction of currently infected people is crucial for any definition of public health policies, which up to this point may have been misguided by the reliance on confirmed cases. the second round. As some studies reported that the antibodies decreased over time in some 110 patients (14, 15) , the second round seroprevalence rates in the four states seemed to be 111 unstable. Indeed, the rates became even smaller than the first round in Utah and remained 112 almost the same in Connecticut and Missouri. In comparison to the seroprevalence rates in the 113 first round in Louisiana and Missouri, the cumulative incidence rates seemed to be slightly 114 underestimated. In Utah, the estimates and seroprevalence rates showed prominent 115 discrepancy, where the significantly low incidence estimate by April 20 suggested a possibility of 116 under-reported death tolls. 117 Applied across countries and U.S. states, the proposed framework estimated actual time 118 courses of new infections and currently infected cases. In early April, the U.S. reported ~30,000 119 daily confirmed cases. In striking contrast, the proposed estimation suggested a number of 120 actual daily cases of more than 400,000, showing that the daily ascertainment at that time was 121 less than 10% (Fig. 3A) . As of September 3, 2020, 0.9% (0.5%-1.6%) of the U.S. population 122 was estimated to be currently infected. In Brazil, the under-reporting was also severe early in 123 the pandemic, but gradually improved as in the U.S. As a result, the peak in actual daily cases 124 seemed to have occurred between June 1 and June 8, 2020, reaching nearly 250,000 cases in 125 contrast to ~25,000 confirmed daily cases (Fig. 3B ). This time of the peak in new infections was 126 earlier than the peak in confirmed cases, which fell between July 27 to August 3. The currently 127 infected cases in Brazil were estimated to be 2.3% (1.1%-3.9%) of the total population as of 128 September 3, 2020. Among U.S. states, Louisiana showed the highest estimated fraction of 129 currently infected people, 6.9% (3.7%-11.3%) as of September 3, 2020 (Fig. 3C ). The first peak 130 in the daily new cases in Louisiana was allegedly ~1,500 around April 6, but the actual new 131 cases at that time were estimated to be more than 30,000, indicating the severity of under-132 reporting in Louisiana during the month of April. 133 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020 . . https://doi.org/10.1101 The severe under-ascertainment was universal across the 50 countries with the most 134 confirmed cases and 50 U.S. states. The ascertainment rates for the whole period until 135 September 3, 2020, widely varied from 5% in Italy to 99% in Qatar, and from 8% in Connecticut 136 to 71% in Alaska (Fig. 4A) . Among them, 25 countries, 19 U.S. states, and Washington D.C. 137 showed an ascertainment rate less than 20% for the entire time of the pandemic. Focusing on 138 only the past two weeks the ascertainment rates unfortunately have not improved much in these 139 countries, while the recent rates of U.S. states increased overall as of September 3, 2020. 140 Interestingly, many of the countries with high ascertainment rates from the beginning of the 141 outbreak were the ones that previously experienced Middle East respiratory syndrome 142 The under-reporting adjustment allowed us to monitor the actual severity of the virus 144 spread across countries and U.S. states, and especially the estimated sizes of currently infected 145 populations helped to identify fast-changing COVID-19 hotspots. In Peru, Belgium, and Brazil, 146 more than 10% of the population were estimated to be once infected as of September 3, 2020 147 ( Fig. 4B ). Across U.S. states, the cumulative incidence rates ranged from 28.2% (14.0%-148 47.7%) in New Jersey to 0.9% (0.6%-1.5%) in Hawaii (Fig. 4B ). As of September 3, 2020, 149 COVID-19 hotspots among U.S. states were estimated to be Louisiana, Georgia, and Florida, 150 where currently infected cases were estimated to be more than 4%. 151 The estimated fractions of current infections differentiated New Jersey from New York, 152 both of which experienced severe early outbreaks (Fig. 4B ). The confirmed new cases per 153 100,000 population were 3.8 in New Jersey and 3.7 in New York as of September 3, 2020, 154 suggesting that the virus spread was under control in both states. However, because of the 155 differences in recent ascertainment rates between the two states ( Fig. 4A ) the fractions of 156 currently infected people were 1.05% (0.52%-1.78%) in New Jersey and 0.12% (0.06%-0.20%) 157 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020 . . https://doi.org/10.1101 in New York, as of September 3, 2020. This reveals New Jersey as still with a considerable 158 infected population whereas New York has become one of the safest states. 159 Since the beginning of the COVID-19 pandemic, the Case-Fatality-Rates (CFRs) have 160 displayed huge differences between countries, adding confusion to how deadly SARS-CoV-2 is. 161 The crude CFRs, which are the ratios of total confirmed deaths to total confirmed cases, ranged 162 from 13.0% in Italy to 0.05% in Singapore as of September 3, 2020. Our analysis now reveals 163 that the variation on CFR reports across the countries and U.S. states is primarily associated 164 with the massive differences in ascertainment rates between the locations (Fig. 4C ). The 165 Spearman rank correlations between CFR and ascertainment rate were -98% and -97% (P-166 values < 0.0001) for the analyzed countries and the U.S. states, respectively. After adjustment 167 for the under-reporting, the inferred IFRs, which were based on the assumed IFR 0.66%, did not 168 correlate with the ascertainment rates (Fig. S2 ). Thus, a high CFR in a region is shown to be a 169 result of severe under-reporting of the cases. 170 This study demonstrates that severe under-ascertainment has obscured the true severity 171 of widespread COVID-19 all over the world. In the majority of the 50 countries, actual 172 cumulative cases were estimated to be 5-20 times greater than the confirmed cases. Given that 173 the confirmed cases only capture the tip of the iceberg in the middle of the pandemic, the 174 estimated sizes of current infections in this study provide crucial information to determine the 175 regional severity of COVID-19 that can be misguided by the confirmed cases. 176 It is a challenging task to estimate actual numbers of COVID-19 infections based on 177 under-reported limited data, especially to make a framework applicable for many regions that 178 have displayed diverse dynamic patterns in the infections and ascertainment rates. As the 179 pandemic progresses, the pipeline would need to be adapted to the increasing complexity of the 180 infection data. The proposed estimation heavily relies on the published estimate of the IFR that 181 has large estimation uncertainty. The estimates of actual cases would become more accurate if 182 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020 . . https://doi.org/10.1101 the IFR estimate is optimized to a specific region and its uncertainty can be reduced. The 183 pipeline only takes the simple input of the confirmed cases and deaths. Depending on available 184 datasets in each region, the estimation of actual cases can be improved by augmenting more 185 information such as daily positivity rates of diagnostic testing or daily hospitalized cases. 186 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020. . In an epidemic process, a population is categorized into susceptible, infected, deceased, or 260 recovered individuals. Counts of confirmed COVID-19 cases, deaths, and recoveries are 261 insufficient to calculate the number of currently infected individuals (purple dotted box) because 262 of substantial undocumented infections not captured by diagnostic tests. The input to the 263 proposed framework is the daily counts of confirmed new cases and deaths (black boxes). 264 Using pandemic parameters such as the Infection-Fatality-Rate and the mean duration periods 265 from infection to death and recovery, the framework estimates the counts of actual new cases 266 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020. . https://doi.org/10.1101/2020.09.26.20202382 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted September 28, 2020. . https://doi.org/10.1101/2020.09.26.20202382 doi: medRxiv preprint CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted September 28, 2020. GA FL SC AL AZ MS CA IL RI NM NC ND PA NV MA AR NJ OH TN TX OK IA MO DC UT MD MN KY IN SD MT CT AK VA MI WI CO ID WY OR HI KS WA NH NE ME VT WV DE . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020. . https://doi.org/10. 1101 Materials and Methods 325 The proposed framework provides estimates of time courses of actual new cases and 327 current infections based on daily reported counts of confirmed positive tests and deaths in a 328 particular region and published estimates of key pandemic parameters such as the infection-329 fatality-rate (IFR) and the mean duration from infection to death and recovery. For the purpose 330 of this study data of daily confirmed cases and deaths for countries and U.S. states were taken 331 from the repository by the Center for Systems Science and Engineering (CSSE) at Johns 332 Hopkins University (1), and the COVID Tracking Project (2), respectively. Upon availability of 333 other more granular data sets, the proposed framework can also be applied to smaller 334 populations. The data of the country population as of July 2019 was from the United Nations (3), 335 and the population data for U.S. states as July 2019 was from the U.S. Census Bureau (4). 336 To begin with, the counts of daily confirmed cases and deaths were averaged in a 7-day 337 rolling window to remove weekend effects, which tend to yield systematic drops in the Saturday 338 and Sunday values. To obtain initial guesses on actual counts of daily new cases and 339 recoveries, we first combined the daily death counts and the IFR estimate ( 0 = 0.66%) that 340 was presented by Verity et al. (5) where (0) is the derived initial estimate of the currently infected cases on day t. These initial 348 time courses led to two other initial estimates for daily ascertainment rates ( ), and daily ratios 349 of the confirmed new cases to currently infected cases, referred to as detected transmission 350 rates ( ): 351 Both ratio estimates displayed a common increasing trend in many countries, probably 353 indicating that the under-ascertainment was gradually improving as the testing capacities were 354 increasing. The common trend could be exploited to obtain better estimates of the daily 355 ascertainment rates. 356 357 An expectation-maximization (EM) algorithm was implemented to update the latent time 359 courses involved in actual infections. To first extract the temporal trends of the estimated daily 360 ascertainment rates and detected transmission rates, the two rate time courses were spline-361 smoothed. Since the noise-levels around the temporal trends were different depending on 362 regional population sizes, infection dynamics, and test reporting schemes, we applied multiple 363 levels of smoothness and selected the optimal level after the complete EM computation as 364 discussed below. For smoothing, we applied the R (6) function smooth.spline(). Since the 365 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020. . https://doi.org/10. 1101 smoothed estimates of the detected transmission rates and ascertainment rates 366 ( (0) (ℎ), (0) (ℎ), ℎ is a smoothness parameter) possessed a common trend, the estimated 367 ascertainment rates could be improved by augmenting the information from (0) (ℎ). The 368 following regression model was applied to find a functional relation from (0) (ℎ) to (0) (ℎ): 369 (0) (ℎ) = 0 + 1 (0) (ℎ) + 2 � (0) (ℎ)� 2 + , for = 1, 2, … , , (Eq. 1) 370 where the regression coefficients were estimated with constraints of 1 , 2 ≥ 0 using an R 371 function penalized(). Then, the estimated coefficients ( � (0) , = 0, 1, 2) were used to update the 372 initial ascertainment rates: 373 (Eq. 2) 374 The above EM steps enabled iterative updates of the latent time courses on actual infections. 375 For each smoothness parameter ℎ, the following EM algorithm was applied to obtain converged 376 estimates (Fig. S1A) . After completing EM iterations with multiple smoothness parameters (ℎ), the converged 380 time courses of currently infected cases were assessed by using corresponding daily rates of 381 deaths among the infected cases: 382 . 383 The framework selected the smoothness parameter value (ℎ � ) and the final estimate of current 384 infections ( � �ℎ � �) that produced the smallest coefficient of variation (CV) of (ℎ) over time. Underlying this choice is the assumption that increasing variation in daily death rates would be 386 less plausible. 387 To obtain 95%-confidence intervals (CIs) of the current infection estimates, the same 388 initialization and EM iterations were applied using the lower/upper limits of the IFR estimate and 389 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020. . https://doi.org/10.1101/2020.09.26.20202382 doi: medRxiv preprint the selected smoothness (ℎ � ). The workflow of the initialization, EM iterations, and calculating 390 CIs are illustrated with the case of the U.S. time courses (Fig. S1B) . 391 392 The estimates of actual new infections and currently infected cases have been updated 394 daily for 50 countries with the most confirmed cases and 50 U.S. states since August 12, 2020, 395 in the GitHub repository (https://github.com/JungsikNoh/COVID19_Estimated-Size-of-Infectious-396 Population). All code, daily updated estimates, and visualizations are publicly available in this 397 online repository. 398 420 421 422 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020. . https://doi.org/10.1101/2020.09.26.20202382 doi: medRxiv preprint CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020. . https://doi.org/10.1101/2020.09.26.20202382 doi: medRxiv preprint Scatter plots between the inferred infection-fatality-rates (IFR) and the whole period 437 ascertainment rates for the 50 countries (left) and 50 U.S. states (right). The inferred IFR is the 438 ratio of total confirmed deaths to the under-reporting-adjusted total number of cases on a date 439 18-day before, accounting for the mean duration from infection to death. Spearman rank . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 28, 2020. . https://doi.org/10. 1101 An interactive web-based dashboard to track COVID-19 in real time 2. The COVID Tracking Project United Nations, Department 405 of Economic and Social Affairs, Population Division United States Census 409 Bureau, Population Division Estimates of the severity of coronavirus 415 disease 2019: a model-based analysis R: A language and environment for statistical computing: R Foundation for 418 Statistical Computing Spline-smoothed Detected Transmission Rates (DT (k) (h)) and Ascertainment Rates (A (k) (h)) Detected Transmission Rates: DT (k) Fit a regression function from DT (k) (h) to A (k) (h) Ascertainment Rates: A (k+1) Actual Daily New Cases: AC (k+1) Currently Infected Cases: I (k+1) Ju n 0 1 Ju n 0 8 Ju n 1 5 Ju n 2 2 Ju n 2 9 Ju l0 6 Ju l1 3 Ju l2 0 Ju l2 7 A u g 0 3 A u g 1 0 A u g 1 7 A u g 2 4 A u g 3 1 S e p 0 7 Percentage US, as of 2020−09−03, CV of death rates: 40%