key: cord-206391-1dj285h8 authors: Yan, Donghui; Xu, Ying; Wang, Pei title: Estimating the Number of Infected Cases in COVID-19 Pandemic date: 2020-05-24 journal: nan DOI: nan sha: doc_id: 206391 cord_uid: 1dj285h8 The COVID-19 pandemic has caused major disturbance to human life. An important reason behind the widespread social anxiety is the huge uncertainty about the pandemic. One major uncertainty is how many or what percentage of people have been infected? There are published and frequently updated data on various statistics of the pandemic, at local, country or global level. However, due to various reasons, many cases were not included in those reported numbers. We propose a structured approach for the estimation of the number of unreported cases, where we distinguish cases that arrive late in the reported numbers and those who had mild or no symptoms and thus were not captured by any medical system at all. We use post-report data for the estimation of the former and population matching to the latter. We estimate that the reported number of infected cases in the US should be corrected by multiplying a factor of 220.54% as of Apr 20, 2020. The infection ratio out of the US population is estimated to be 0.53%, implying a case mortality rate at 2.85% which is close to the 3.4% suggested by the WHO. With the quick spread at the global scale, the COVID-19 pandemic has become one of the most tragic disasters in human history, with a worldwide confirmed cases of 2.74 million and death toll at 192k as of April 24, 2020. Rising trend of these numbers still remains in multiple countries right now. The most risky aspects about the coronavirus are the long incubation period and the existence of a large number of asymptomatic cases. These cause a substantial proportion of infected cases not tracked by medical systems. For better policy making and disease control, as well as to reduce the widespread speculations among the public about the extent of the disease spread, it is of significant interest to give an estimate on the missing counts. Specifically, when the pandemic gradually becomes under control, the world is considering the resume of normal business. This requires a prudent assessment of the potential risk. Inevitably, such an assessment would involve the estimation of the number of asymptomatic cases when such cases are still active. However, the task of estimating the number of those undocumented cases is very challenging, exactly because of the long incubation period and those asymptomatic cases. In this work, we will present a structured approach for such an estimation task and give an approximate estimate at the US national and state level. The remainder of this paper is organized as follows. In Section 2, we will describe our approach. This is followed by a presentation of results in Section 3. Finally we conclude in Section 4. Statistically, the estimation of the number of unreported cases is related to the problem of inference with missing data [5] or censored data [2] . However, certain characteristics of the coronavirus epidemiology allow us to take a different approach. We adopt a structured approach, inspired by the diagnostic analysis of remote sensing studies [8] where the errors in the land use classification were decomposed according to their sources. Our approach is illustrated by Figure 1 . The missing counts in the reported numbers come from two sources. One is those cases for which, at the report date, the symptoms were not severe enough and the affected individuals would not test for infection; however, they would eventually visit some medical facility and test for potential coronavirus infection. We call such cases the type I cases, and the waiting period before the onset is termed as the incubation (or dormant) period. This is illustrated as the filled blue bars in Figure 1 . At the time of report, all such cases are still in dormant status thus are missing in the reported number. The second source of unreported cases are those who were infected but are either not aware of it or with symptoms too light to visit the medical facility, and later on recovered without any particular medical treatments. We call such cases the type II cases. The type II cases never show up in any reported numbers, thus leaving too little clue for estimation. But we cannot overlook such cases, be-cause the number of such cases could be potentially large and such individuals form an important infecting source. For the rest of this section, we will describe our method for the estimation of each of the two cases. The estimation of the number of type I cases is facilitated by a crucial observation. Though not included in the reported number while in the dormant period, such cases would eventually be reported when the symptoms become so severe that the individuals have to seek medical treatments. By that time, those previously missed cases at the original report date (which was a few days ago) would be counted towards infected cases at some later report dates (though one would not know at which particular report date). Such numbers should be included at the original report date but surface only several days later; for this reason we call them delayed counts. If there is a way to estimate such delayed counts or their total, then one can estimate the number of type I cases for the original report date. It will be instructive to consider a simple ideal case where all infected cases have a dormant period of 7 days. In this ideal case, the numbers D 1 , D 2 , ..., D 6 in Figure 1 are exactly the number of cases who were at their 6 th , 5 th , ..., 1 st day of infection, respectively, when counted at the original report date. If we assume that the incubation period is 7 days, then these are all the number of type I cases missed at the original report date but reflected perfectly later in the number of newly reported cases during the 6 days time window following the original report date. So the total number of type I cases at the original report date can be calculated by their sum, The reality is, however, complicated. First, the length of the dormant period varies for individual cases. Also, during the post-report time window, newly infected cases may arise and be reported. Thus the number of newly reported cases at any particular day within this time window might be mixed, in the sense that it would include both cases that are infected both before (but were in dormant period) and after the report date. The former case will not pose a problem as anyway such cases would be counted towardsD type1 though cases infected at the same day may now contribute to different D i 's. The latter case is undesirable but could be corrected, to a certain extent, by the truncating effect when we only sum up the counts in the post-report time window up to a length of T days. That is, those cases with a dormant period extending more than T days post-report will be truncated and not included inD type1 , with the total count of such truncated cases being 'cancelled out' by the newly infected cases within the post-report time window of a properly chosen length T . This leads to an estimate for the number of type I cases aŝ where D i are now the number of cases reported at the i th day after the original report date. We can let T take a value around or slightly larger than the mean of the incubation periods. In the appendix, we give a justification on why our estimate,D type1 , would be a reasonable one. If we can keep track of the delayed estimateD type1 through time, then we can get a time series which, upon smoothing, could be used to estimate the current missing type I counts. For such an estimation to be feasible, we have two requirements. One is that the daily reported counts through time would not change too abruptly. Thus, our approach would not work well when the infection trend still rises very rapidly. During such a period, the safest strategy might be to strictly enforce social distancing. But as the overall situation is gradually under control, our estimation would apply. The other is knowledge of the duration of the incubation period. According to many studies [3, 1, 4] , the incubation periods has a median of around 4-7 days. While further studies or data analysis is required to confirm this, we take T = 7 in our estimation. Additionally, it should be cautious that our estimation is valid assuming that the test of coronavirus is sufficiently carried out for the population of interest; insufficient test would render an underestimate of the number of type I cases. In Figure 2 , we plot the ratio of estimated type I cases w.r.t. the reported number of cases for Connecticut (CT) and Massachusetts (MA) since Mar 8, 2020. These two states were chosen as they are similar in many aspects, so we expect their ratio of type I cases out of reported cases would be similar. In Figure 2 , there is an initial difference in ratios of type I cases in these two states, which we attribute to the late response and the small number of cases tested in CT. Later, these two states exhibit strikingly similar trend, which is quite xpected. We also explore the effect of using different values of T where 6 and 7 are used. Again, initially the resulting estimation are fairly different, which indicates the rapid spread of coronavirus and the rapid rise of infected cases. Gradually, the difference in the resulting estimations diminish, which implies that the choice of T = 7 leads to a fairly stable estimation at late stages of disease spread. Similar observation can be made for the estimation of type I cases in US. This is shown in the right panel of Figure 2 . The estimation of the number of type II cases is extremely challenging, as there is barely anything observable. Our main strategy in the estimation is based on the matching of population statistics-using what we see well to infer what is missing. When we group reported infection cases, we notice a significant discrepancy in the count statistics by age groups between reported cases and the US population. We expect that, while people in most age groups in the popu- lation have a similar chance of being infected, those type II cases occur much more often in age groups 20-64 but rarely for people of age 65+. This is because people at age over 65 typically have a relatively weaker immune system along with some pre-existing medical conditions. Once they are infected with coronavirus, a slight symptom would prompt them to seek medical treatments. As a result, such cases are very likely to be discovered. Thus reported counts about such age groups would be more accurate and can serve as a reference to correct counts for other age groups. On the contrary, cases for the 20-64 age group are easy to be overlooked or not noticed, unless their status is deteriorated. The reported counts for these age groups thus require a correction (termed as age correction). The age group of 85+ is more vulnerable to infections, as they typically live in the senior centers or extended-care nursing facilities which, as a matter of fact, have a very high risk of infection. The case statistics for this age group would be very thorough, but many in this age group get infected simply because they share a very confined living space with many other equally vulnerable seniors, and the infection of any one (including staff) in a senior center will quickly spread to the rest (To certain extent, one may think of this as a party of many people during COVID-19). So statistics in this age group would not be a reliable reference for population match, since people in other age groups have a very different mobility pattern (the infants interact with the world through their parents thus have a chance of infection not so different from the general population). The main assumption we use for population match is that all the people, with an age in the range 0-84, have similar chance of being infected. As a result, the counts at different age groups would be proportional to their respective percentage in the population; we call such an approach population matching. Let r pop and r case be the proportions of the reference group in the population and in the reported cases, and x pop and x corrected be the the respective proportions for the target group, respectively. Then r case : r pop = x corrected : x pop , and the corrected percentage in the infected cases for the target group can be calculated accordingly. As we argue before, the case statistics for age groups 65-74 and 75-84 are reliable, but those for ages 0-64 are incomplete and consist of substantial missing data, and we will use the reliable portion of the data to infer or correct statistics about the incomplete part of the data. A simple calculation reveals that age groups, 65-74 and 75-84, according to Table 1, have a similar ratio of cases percentage: population percentage, i.e., 9.00 : 4.70 ≈ 17.00 : 9.32. Thus, we can pool counts from these two groups and obtain r case : r pop = (9.00 + 17.00)/(4.70 + 9.32) = 1.8544. This yields the corrected ratio as the bottom row of Table 1 . Adding up numbers in the bottom row gives a total of 187.94%, implying that we should expand the reported counts by 87.94% in order for the reported case counts to match the population statistics across age groups. This gives the ratio of type II cases over the reported cases. An interesting question is, will estimated counts of type II overlap with that for type I cases? We claim that this will not, at least not significantly, so the addition of estimated counts for type I and II cases is valid. The reason is that, type I cases still contribute to the reported numbers, at a delayed time though. These delayed cases can be thought of as a sample from the reported cases (assuming that the reported cases have a stable proportion when breakdown by age groups). The inclusion of type I cases will not change the age-breakdown proportions. Thus, after the inclusion of type I cases, we still have the same age-breakdown proportions and thus require an age correction. We apply our approach to each of the 50 states and the US. The data are available from Wikipedia [7] . Due to the large variation of the population at different states, we calculate the ratio of missing cases out of the number of reported cases for individual states. The ratio for type I cases is shown in Figure 3 . Due to the lack of reported case data for individual states by age groups, we use the overall estimate, which is 87.94% according to discussions in Section 2.2, for the ratio of type II cases for all the states. The overall ratio for type I cases for the US is estimated to be 32.60%. Combining with the 87.94% ratio for type II cases, this gives an estimated ratio of missing cases versus the reported number at 120.54%. In other words, the reported number should multiply by a factor of 220.54% to reflect the true number of infected cases. With the unreported numbers estimated, we can estimate the infection ratio, defined as the ratio of the number of infected cases out of the population. The overall infection ratio of the US is estimated to be 0.53%, or 1.75 million, as of Apr 20, 2020. If we use the associated death toll at about 50k, then the case mortality rate is calculated as 2.85%, which is close to the WHO suggested estimation of 3.4% [6] . The infection ratio for individual states are visualized as heatmap in Figure 4 . Heavily hit states are NY, NJ, CT, RI, MA, and LA with infection ratio estimated at 2.61%, 2.11%, 1.22%, 1.15%, 1.31% and 1.04%, respectively, as of Apr 20, 2020. The trend of infection ratio and cases by time for these states is shown in Figure 5 . It can be seen that, except LA, the infection ratio for all other five states are still rapidly increasing. NJ shows a similar growing pattern as NY, while the three New England states, CT, RI and MA, are similar. We have proposed a structured approach for the estimation the number of infected cases not included in the reported number at a given date. We distinguish two types of 'missing' cases, those cases which were infected but are still during the dormant period and those asymptomatic cases which later self-recover without medical treatments. The number of these two types of cases are estimated by accumulating reported counts within a properly chosen post-report time window and by population matching. The reported number, as of Apr 20, 2020, of infected cases in US should be corrected by multiplying a factor of 220.54%. The overall infection ratio of the US is estimated to be 0.53%, with a case mortality rate of 2.85% which is close to the recommended 3.4% by WHO. Our estimation can potentially be used for risk assessment. The infant age group may worth further consideration as people in this group are much less risky than other age groups as they interact with the rest of the world through their parents, so the number of cases for this group may need to adjust accordingly to reflect the true risk. denote the number of cases that were infected one day, two days and so on before the report date (for which we use N 0 ). Here we limit to type I cases as we can conveniently assume that type II cases have an infinite incubation period. Then the expected number of cases that are discovered during the time window of T days following the original report date is calculated as For simplicity, we would assume that all the EN −i 's take a constant value N . We feel that this should not be too unrealistic as we would expect that the number of newly infected cases per day do not vary too much when the pandemic reaches a stable stage (those at the very far distant past would be small, but they carry a very small fraction of the total number so could be ignored). Also, we abuse the notation a bit by using D . 's to also indicate the expected value of the associated random variable; the exact meaning will be determined by the context. Then Equation (1) can be rewritten as Under the same assumption, the number of new cases generated during the post-report time window of length T days is (T − i) · P (i − 1 ≤ X < i) + N · P (X < T ) Thus the total number of reported cases during the T days post-report time window is calculated as D a + D new = (T − 1) · N + N · P (X ≤ T ) = T N − N · P (X > T ). Assuming that random variable X has a finite mean, then we have P (X > T ) ≤ EX/T µ/T, implying that the estimated number of type I cases satisfies The actual number of cases that have accumulated but not being discovered before the report date consists of missing cases during the previous T days and those even earlier cases, which has an expected value (2) indicates that the mean number of type I cases equals the product of the mean daily infected cases of type I and the mean length of the incubation period, which is consistent with the ideal case discussed in Section 2.1. Let T = (1+ǫ)µ, then we have the following error bound for the estimated number of cases of type I |D type1 − D type1 | ≤ µN · max(ǫ, |ǫ − 1/T |). It follows that the relative error of the estimate satisfies |D type1 − D type1 | D type1 ≤ max(ǫ, |ǫ − 1/T |) = max(ǫ, |ǫ − 1/((1 + ǫ)µ)|). For a given µ, one can pick ǫ to optimize the above bound. For example, when µ = 7, one can take ǫ = 0.07 to achieve a relative error bound of about 7%. Incubation period of 2019 novel coronavirus (2019-ncov) infections among travellers from wuhan, china Nonparametric estimation from incomplete observations The incubation period of coronavirus disease 2019 (covid-19) from publicly reported confirmed cases: Estimation and application Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: A statistical analysis of publicly available case data Statistical Analysis with Missing Data World Health Organization. Coronavirus (COVID-19) Mortality Rate COVID-19 pandemic in the United States A structured approach to the analysis of remote sensing images In this appendix, we will give a justification on our estimation algorithm. We show that the error between our estimate,D type1 , of the number of type I cases and its actual value D type1 is small in expectation under reasonable assumptions about the distribution of the incubation periods.Denote by random variable X the length of the incubation period, and for simplicity we further assume that X ≥ 0 takes integer values. Let N −1 , N −2 , ...