key: cord-0607464-60jkbasc authors: Alvarez, Agust'in; Fragal'a, Marina; Valdora, Marina title: Estimating the case fatality rate of a disease during the course of an epidemic with an application to COVID-19 in Argentina date: 2021-09-06 journal: nan DOI: nan sha: 79df6d98978e37ffafdb7d879363f3667299097b doc_id: 607464 cord_uid: 60jkbasc We present an accurate estimator of the case fatality rate that can be computed during the course of an epidemic outbreak and we prove its asymptotic properties. We apply it to the real case of covid-19 in Argentina during 2020 and to simulated data and show its very good performance. One of the most important questions when a new infectious disease arises, as Covid-19, is to know how deadly it is, that is to say, to know the proportion of infected people that dies. Note that the term infected includes confirmed and some unconfirmed cases, that are nevertheless infected. This proportion is known as infection fatality rate (ifr). This rate is a key epidemiological measure for quantifying disease severity and is particularly important during outbreaks of emerging infectious diseases such as Covid-19. It is difficult to estimate this rate during the course of an outbreak for several reasons: the great number of cases which are never diagnosed, bias from reporting delays and the time gap between diagnosis and death (see Baud et al. (2020) ). Since the Covid-19 information is limited to confirmed cases and, in general, the real number of infected people remains unknown due to the presence of asymptomatic cases or patients with very mild symptoms, we focus on the proportion of diagnosed people that dies. Traditionally, the "naive"case fatality rate over a certain period of time (a day, a week, a month, the whole epidemic) is understood as the proportion of people diagnosed over this period of time who die from the disease in the same period. The World Health Organization and many countries report this naive case fatality rate on a daily basis, considering the period of time from the beginning of the epidemic up to that day. The reason for using this rate is that it requires minimum information for computation and eventually equals the final case fatality rate when the epidemic is over, see Kim et al. (2021) . This rate has a tendency to underestimation, so many other estimators have been proposed to reduce the bias. There have been many attempts to define and calculate the case fatality rate for Covid-19, see for instance, Chang et al. (2020) and Shim et al. (2020) . Lipsitch et al. (2015) and Marschner (2021) define the case fatality rate as the probability of death from the disease among confirmed cases and discuss the potential biases in its estimation. Another attempt in this direction is made in Lee and Lim (2019) . Most of the rates mentioned so far, use data of daily recovered patients. The data base publicly available in Argentina does not include this information. Garske et al. (2009) proposed a definition of case fatality rate that can be computed with minimum information. The rate informed in Argentina daily during the Covid-19 pandemic underestimates the true case fatality rate since the estimation is made on day t by dividing the number of deceased people from Covid-19 until day t by the number of confirmed cases until day t. The denominator also includes confirmed cases that have not yet died from the disease, but will do so in the future. Thus, the delay between confirmation and death might result in bias when estimating the case fatality rate in this naive way. This bias may be high, especially when the estimation is made in a moment of rapid growth of confirmed cases or if the time between diagnosis and death is long. At the beginning of the outbreak, the number of cofirmed cases may double in a few days. However, only a small proportion of the patients that will finally die, do so in the first days after diagnosis. In this work we define the case fatality rate until day t (cf r(t)) as the expected proportion of confirmed cases that die or will die from Covid-19 infection among cases confirmed since the beginning of the epidemic until day t included. The defined cf r(t) could be called cumulative case fatality rate but, for simplicity, we will refer to it as case fatality rate by day t. We propose an unbiased estimator of cf r(t). This estimation is achieved by accounting for the expected future fatalities. Garske et al. (2009) attempt to correct the problem of the time gap between diagnosis and death. The estimator of cf r(t) they propose is unbiased if the probability of death among confirmed cases each day is constant in time but it may be biased if these probabilities change. In this work, we propose an estimator of cf r(t) that can be computed on day t and we show that it is unbiased even when the probability of death among confirmed cases is not constant in time. We apply our proposal, Garske et al's estimator and the naive estimator and compare them in the real case of Covid-19 in Argentina during 2020 as well as in simulated data and we show the very good performance of our proposal. The paper is organized as follows. In Section 2 we introduce the main definitions and notations, including the definition of cf r(t) together with the proposed estimators and confidence intervals for cf r(t). In Section 3 we report the results of a simulation study. In Section 4 we apply the proposed estimators to the real data set from Argentina in 2020. In section 5 we comment on how we estimate the distribution of time between diagnosis and death for patients who died. In Section 6 we summarize the concluding remarks. Finally, in the Appendix we present the proofs of the assymptotic properties of the proposed estimator. We shall denote random variables with upper case letters and non random parameters with lower case letters. We define • p d the probability of death among cases confirmed during day d. • c d the number of cases confirmed during day d. • M d,i a dichotomous variable that equals 1 if the i-th confirmed case on day d dies because of the disease and 0 if it does not. • M d,i (t) a dichotomous variable that equals 1 if the i-th confirmed case on day d has died because of the disease by day t and 0 if it has not, defined for d ≤ t. • M f (t) the total number of persons that die from Covid-19 infection among cases confirmed until day t included, once the epidemic has ended. • M (t) the number of confirmed cases that died from Covid-19 from the beginning of the epidemic (day 0), until day t inclusive. We call CF R N the naive estimator of the case fatality rate that is usually reported on day t. With this notation, this estimator is We define CF R F (t) as the proportion of cases confirmed unitl day t which finally die because of the disease, and refer to this proportion as the final case fatality rate by day t. In terms of the defined variables we have that Notice that CF R F (t) cannot be computed on day t; one would have to wait for all the diagnosed people by day t to recover or die. We define the case fatality rate by day t as the expected value of CF R F (t), i.e, cf r(t) = E (CF R F (t)). It is worth noting that the case fatality rate by t defined in this way is a population parameter and it is not observable, not even at the end of the epidemic. Using (1), we easily obtain cf r(t) is a weihted sum of daily case fatality rates p d , where each day's weight is the proportion of cases that was confirmed that day with respect to the total number of cases confirmed until day t. cf r(t) can be interpreted as the probability of dying from the disease for a randomly picked person among those diagnosed by day t. If p d = p (p d is constant all along the epidemic), cf r(t) = p. We remark that cf r(t) is the parameter of interest, of which the case fatality rate observed at the end of the epidemic is an observation. In order to estimate cf r(t) we define T d,i = Number of days since confirmation until death in the i-th case confirmed on day d. Dorigatti et al. (2020) . Note that we are allowing the cdf to change in time unlike Garske et al. (2009) . This seems realistic for several reasons; for instance, new treatments may lengthen the survival time, the sanitary system may collapse and this may shorten the survival time or the confirmation of cases may be faster, lengthening also the time from confirmation until death, among others. Our goal is to predict on day t the number of people that eventually will die, among confirmed cases until day t, that is to say M f (t), as defined in (1). The idea of our proposal is the following: we know that, among the cases confirmed on day d, with d ≤ t, the expected proportion that will have died by day t, among those that have died or will die eventually is F d (t − d). Therefore, dividing the cases confirmed during day d that have died by day t by F d (t − d), we will obtain a predictor of the number of confirmed cases during day d that will finally die. In concrete, we define the predictor of M f (t) as follows: We then estimate the case fatality rate by day t, cf r(t), by Note that the probability of dying from covid by day t for a case confirmed on day d can be expressed in the following way Straightforward calculations show that E(CF R(t)) = cf r(t) and, therefore, the proposed estimator is unbiased. ). However, since the sequence of variables satisfy the Lindemberg condition (under certain conditions, see appendix) we will be able to build an asymptotic (1 − α) · 100% confidence interval for the expected value of the average, which is cf r(t). The confidence interval takes the form CF R(t) and r t = t d=0 c d is the total number of variables. The estimator CF R(t) can only be used if F d is known. Since this is not the case when we work with real data, we will replace it by an estimatorF d , see Section 5. Therefore, the predictor we will actually use for real data is To build the confidence interval on day t we also need the p d values for 0 ≤ d ≤ t, which we do not have in a real data analysis. In such case, for t 0 ≤ t 1 , call cf r t1 t0 the expected proportion of people that dies from Covid-19 between those who are confirmed in the period of time from t 0 to t 1 days after the start of the epidemic. In order to estimate p d * on day t, for d * ≤ t we propose to use an estimator of cf r d * +3 d * −3 based on the same idea of the cumulative case fatality rate estimator CF R, as follows: We compute estimation (7) for days d * , with 3 ≤ d * ≤ t − 3 and we estimatê Estimation of p d is here used as an auxiliary calculus in order to estimate the variance of the confidence intervals but it is a problem of interest in its own since it allows to estimate the actual probability of dying from Covid-19 for the cases confirmed on day d. An analogous calculation can be done in order to estimate the daily probability of needing an intensive care unit (ICU), which might be useful in order to predict the number of people who will arrive to ICU. This may be the subject of future work. Garske et al. (2009) do not consider the possibility that the distribution of the time from confirmation to death, F d or the daily case fatality rate p d may vary with time. Their estimator of the case fatality rate is defined as follows . Where F is the distribution of the time from confirmation to death and M (t) is the random variable that counts the number of confirmed cases that died from the beginning of the epidemic (day 0), to day t. If we consider varying F d their estimator becomes . Equation (9) shows that the expected value of this estimator is a weighted mean of p d , but the weights are not the same as in cf r(t) and therefore, the estimator is not necessarily unbiased, unless p d = p for all d. Of course these estimators are also computed by replacing F or F d by their estimates. Despite this bias, Garske et al's estimator has some advantages if p d is constant. First, if the distribution function F d = F is known or estimated previously, it can be computed for day t using only the number of confirmed cases and the number of deaths until day t. The estimator we propose, on the other hand, requires knowing the number of deaths by day t, among all confirmed cases on day d, for all d ≤ t. To evaluate the performance of the proposed estimators, we performed a Monte Carlo study. The hypothetical epidemic develops in the following way. For the first 158 days it has the same number of daily cases c d as we had in Argentina from March 3rd until august 7th, 2020. During the subsequent 158 days, the daily cases diminish in such a way that the curve is symmetric. That is to say, c 158+d = c 157−d for 0 ≤ d ≤ 157. Then, from d ≥ 316 onwards, we set c d ≡ 0. For each confirmed case on day d we set its probability of dying from covid in p d . For p d we consider the following setting: where d * is the day in which p d changes and 315 is the day of the last confirmed case. We consider d * ∈ {90, 120} and we vary c 1 and c 2 in {0.02, 0.05, 0.1} with c 1 = c 2 . For those who die from the disease, we set a survival time since confirmation of F d ≡ F , using F = BN(µ = 10.79, r = 0.88). This distribution has been seen to provide a good fit in the real case of Argentina by August 2020, and the values of the parameters are the maximum likelihood estimates. In all cases we calulate all the estimators presented in Section 2.1, namely CF R N (t), CF R(t) and CF R G (t), as defined in equations (2), (5) and (8), respectively. In order to calculate CF R(t) and CF R G (t) we do it in two different ways. At a first instance we make the estimation using the known distribution of survival times F used to generate the data, and in such case we make the estimation for all times t, for 0 ≤ t ≤ t f , where t f is the final day of epidemic, i.e., every confirmed case has either died or recovered from Covid by t f . At the second instance we estimate the distribution F by what we call the "empirical estimation"F emp in the way described in section 5. Since for the empirical estimation of F d , we analyze the survival time of cases confirmed 45 days before day t (see Section 5), at this second instance we calculate the estimators for days t with 90 ≤ t ≤ t f . At both instances we calculate 95% confidence intervals for cf r(t) presented in Section 2.1: CF R(t) ± z 0.975 V (CF R(t)). Since V (CF R(t)) depends both on the daily case fatality rates p d and on the distribution functions F d for 0 ≤ d ≤ t, for the first instance estimation we use the known p d values and the known F d ≡ F distribution used to generate the data while for the second instance estimation we used the empirical estimationF emp as the estimate of the F d and we performed the estimation of the daily p d presented in (7). For the derivation of the confidence intervals, see the Appendix. We compare three estimators of the case fatality rate by day t that can be computed on day t, namely CF R N (t), CF R(t) and CF R G (t) with cf r(t). The three estimators are defined in equations (2), (5) and (8), respectively In Figure 1 we observe the cf r(t) values and its estimators together with the confidence intervals based on CF R, presented as a shaded band around CF R(t). The graph corresponds to the scenario of an estimated F with parameters c 1 = 0.1, c 2 = 0.05 and d * = 120, i.e., the daily case fatality rates jump from 0.1 to 0.05 on day 120 since the beginning of the epidemic. One can observe that the CF R estimator works very well in the whole epidemic as well as the confidence intervals which contain cf r(t) during almost the whole epidemic. The estimator proposed by Garske et al. (2009) , CF R G , overestimates cf r(t) for a long period of time after day 120, the day when the cumulative case fatality rates cf r(t) start to decrease. On the other hand one observes that the "naive" case fatality rate CF R N underestimates cf r(t) during most of the epidemic. At the end of the epidemic all the estimators work well. In Figure 2 we observe the same scenario as in Figure 1 but using the known F for the estimators and also the known p d values to calculate the variance of the confidence intervals. We graph since day 30 from the start of epidemic because there are few cases and much noise during the first month. In addition to Figure 1 comments we can observe here the unbiased behaviour of CF R G while p d and cf r(t) values remain constant, until day 120. In that period a lower variability of CF R G estimator compared to CF R is observed as well. Again after day 120 when the cf r(t) values start to change a better behaviour of CF R estimator, as compared to CF R G , is observed. The same comments of Figures 1 and 2 apply to all scenarios we have studied and so we only present these two graphs. Figure 2 : Case fatality rate estimated by different methods in simulated data, generated using p d = 0.1 until day 120 and p d = 0.05 after day 120, assuming F is known. In the Appendix we prove that the proposed confidence intervals (C.I.) have assympthotic level (1 − α) × 100%. However, it is desirable to have an idea of how many confirmed cases we need in order for the C.I. to have approximately the desired level. To deal with this interest we performed a simulation study. To generate the data we made 1000 replicates. In each replication, we used c d , the number of confirmed cases of the first 301 days of the epidemic in Argentina, see Figure 3 . For each day d, the value of p d used is the proportion of confirmed cases which finally died between those cases confirmed between days d − 3 and d + 3, i.e, during the week centered at d, see Figure 4 . We used F d ≡ F where F is a Zero Inflated Negative Binomial (ZINB) distribution, i.e., a convex combination of F 0 , the distribution of a constantly zero variable, and F 1 a negative binomial distribution. We set F = πF 0 + (1 − π)F 1 , where π = 0.103, and F 1 ∼ N B(µ = 12.59, r = 1.2191), where the values of the parameters correspond to the maximium likelihood ZINB estimate for the distribution F of the survival times since diagnosis for people who died from Covid and were confirmed positive during the first 300 days of the epidemic in Argentina. For each replication we built the confidence interval (a(t), b(t)) estimating F d witĥ F emp as in the former simulation and also estimating the p d values, as it was explained in (7). For each replication we checked for every t with 90 ≤ t ≤ 301 whether the real interval contained the case fatality rate by day t, cf r(t) or not. Then, for each t we measured the proportion of times between the 1000 replications in which a(t) ≤ cf r(t) ≤ b(t), and we call that proportion the "mean coverage" of the intervals at time t. In Figure 5 (left) we observe that since day 90 of the epidemic the C.I. have "mean coverage" greater than 85%, since day 200 approximately the "mean coverage" is greater than 90% and the mean coverage approaches 95% when t approaches 300. In Figure 5 (right) we see how the "mean coverage" behaves depending on the number of cases, and see that for more than 500, 000 cases the "mean coverage" is over 90% and that the coverage is around 95% when the number of cases is around 1, 500, 000. In this section we analise the behaviour of the estimators presented in Section 2.1 in a real data example. Of course, in this case, we cannot compare the values of the estimators with the value of interest cf r(t) because this value is not observable. The natural way to deal with this problem is to compare the estimators to the final case fatality rate by day t, CF R F (t). We compute, for each day t from June 1st to December 31st 2020, different estimators of cf r(t) using the data base from the Ministry of Health of Argentina as of April 4th, 2021. We compare three estimators of the case fatality rate by day t, cf r(t), that can be computed on day t, namely CF R N (t), CF R(t) and CF R G (t) to the final case fatality rate until day t observed on April 4th, which we call CF R F (t), assuming that all the confirmed cases during 2020 have either recovered or die by that date. The three estimators are defined in equations (2), (5) and (8), respectively, and CF R F (t) is computed by where t F is the number of days from March 3rd 2020 to April 4th 2021. Observe that CF R F (t) defined in (10) equals the one defined in (3) if all people diagnosed until day t have already died or recovered by day t F , which might not be true but both values should be very close one from the other. In fact, the Argentine data set shows that the 0.98 quantile of the times between confirmation and death is 45 days and that the proportion of dead among cases confirmed during 2020 that died in 90 days or less is 0.9967. Figure 6 displays the estimated curves, together with the observed CF R F (t). First, observe that both CF R G (t) and CF R(t) do a much better job than the usual CF R N (t) at estimating cf r(t), since they are both much closer to CF R F (t). Second, we remark that CF R G (t) is nearer to CF R F (t) until around day 110 but soon after that day, the biased nature of CF R G (t) becomes apparent. Our simulations (unreported here) show that CF R G (t) has less variability than CF R(t). However, since for both estimators the variance converges to zero as the number of observations goes to infinity, the weight of the variance in the bias-variance trade off gets smaller and smaller as the number of cases rises. By day 150 approximately, we begin to see clearly that CF R G (t) is biased and converges to a different value, not the cf r(t) we are trying to estimate. Finally we would like to remark that the most realistic thing to do would have been to use the data base reported on day t to compute the estimators, for each t in the analysis, instead of using the data base of April 4th for all the estimations. This was not possible because we did not have all the data bases from june to december of 2020. However, we had access to some data bases from june 2020 and we observed (in estimations not reported here) that the estimators did not work as well as those we report here. The reason is the problem of delays in entering data, i.e., many people who was already a confirmed case and a dead case by day t, only appear in the data base of day t as a confirmed case, but not as a dead case, whereas if we take a posterior data base (say, a two month later data base), these people appear correctly as diagnosed and dead by day t. This problem of delays in entering the data may be dealt with nowcasting techniques, see for example Bastos et al. (2019) . This may be subject of further work. We now explain briefly how we estimate F d , for d ≤ t. For such estimation, we assume that F d ≡ F for all d and, on day t, we estimate F d (k) byF emp (k), the proportion of confirmed cases who died in k or less days since confirmation, among those who died form the disease. For estimating this proportion we consider only cases who were confirmed by day t − 45 and dead by day t. The reason for considering only cases confirmed 45 days before the day of interest is that in the Argentine data set, approximately 98% of confirmed cases that die, do so in 45 days or less. Considering all confirmed cases until day t would lead to an overestimation of F (k), for small values of k. We studied different estimators of F , assuming parametric models for this distribution, including regression of y on x with y = number of days from diagnosis to death and x= number of days from the beginning of the epidemic (March 3rd in this case) until diagnosis date. We fitted a linear model and different generalized linear models. However for the vast majority of values of t the regressions were non significant. Moreover, it was necessary to use the fitted models to estimate F d for values of d that were way out of the range of the training data. This induced large extrapolation errors. We believe this problem can be solved by estimating with techniques designed for censored data. This does not seem simple for this kind of data and may be the subject of further work. Given the diagnosed cases of an epidemic disease we set a statistical model for the outcomes of the disease for patients diagnosed in different days. Based on the model, we define the case fatality rate of the disease by day t as the probability of dying from the disease for a randomly picked person among those diagnosed by day t and we propose an estimator for this rate. This estimator is based on the distribution of the times between confirmation and death of confirmed cases that die because of the disease. We prove that the proposed estimator is unbiased, consistent and asymptotically normal and we derive asymptotic confidence intervals. Both the estimator and confidence intervals for the case fatality rate can be computed during the course of the epidemic. We show the very good performance of the proposed estimator and the confidence intervals for large samples, as compared to the estimator proposed by Garske et al. (2009) , by means of a Monte Carlo study and an analysis of the Covid-19 epidemic in Argentina during 2020. We also performe a Monte Carlo simulation to study the mean coverage of the asymptotic confidence intervals as a function of the cumulative number of cases and we show that it is very near the nominal confidence level when the number of cases is large. We also propose an estimator of the daily case fatality rate and an extension of the estimator proposed by Garske et al. (2009) that allows the distribution of the times between confirmation and death to change in time. for all ε > 0, where 1 {··· } is the indicator function, then the central limit theorem holds, i.e. S t /σ t D −→ N (0, 1) Proof. See Theorem 27.2 in Billingsley (2008) For a collection of variables M d,i (t), with t ≥ 0, 0 ≤ d ≤ t and 1 ≤ i ≤ c d , and distributions F d for d ≥ 0 defined as in Section 2, define the following assumptions: Theorem 2 For each t ∈ N, let {M d,i (t)} d,i for 0 ≤ d ≤ t and 1 ≤ i ≤ c d be independent random variables Be(p d F d (t − d)). Assume that the total number of confirmed cases until day t, r t = Proof. Even though weak consistency is a consequence of asymptotic normality, we write the proof of part (i) because it is interesting in itself and has a simple proof independent of (ii). The estimator CF R(t) we propose for the case fatality rate, cf r(t), is a sample average of the variables Z d,i (t) = M d,i (t)/F d (t − d). Easy calculations show that E(Z d,i (t)) = p d and V (Z d,i (t)) = p d (1 − p d F d (t − d))/F d (t − d) are finite. (i) Since A1 holds, we get for all d and for all i. Then (ii) For each day t by taking the centered variables Z d,i (t) − p d for 0 ≤ d ≤ t and 1 ≤ i ≤ c d , we have r t independent variables of mean 0. If we rename these variables as X t,1 , . . . , X t,rt , we will see that the Lindemberg condition of Theorem 1 is satisfied and then if S t = X t,1 + X t,2 + . . . + X t,rt and σ 2 t = Var(S t ), the convergence in distribution S t /σ t D −→ N (0, 1) is obtained. The proof concludes by observing with elementary calculations that CF R(t) − cf r(t) Var(CF R(t)) = S t σ t . Let us see that the variables {Z d,i (t) − p d } d,i satisfy the Lindemberg condition A modelling approach for correcting reporting delays in disease surveillance data Real estimates of mortality following covid-19 infection Probability and measure The computation of case fatality rate for novel coronavirus (covid-19) based on bayes theorem: An observational study Report 4: severity of 2019-novel coronavirus (ncov) Assessing the severity of the novel influenza a/h1n1 pandemic Estimation of the case fatality rate based on stratification for the covid-19 outbreak Online estimation of the case fatality rate using a run-off triangle data approach: An application to the korean mers outbreak in 2015 Potential biases in estimating absolute and relative case-fatality risks during outbreaks Case fatality risk estimated from routinely collected disease surveillance data: application to covid-19 Early epidemiological assessment of the virulence of emerging infectious diseases: a case study of an influenza pandemic Estimating the risk of covid-19 death during the course of the outbreak in korea We would like to thank our colleagues who participated in the Covid-19 data analysis meetings organized by Instituto de Cálculo from the University of Buenos Aires, where many of these ideas came up and were discussed. In this section we prove the consistency and asymptotic normality of the CF R(t). We call consistency of CF R(t) to the fact that CF R(t)−cf r(t)First of all, recall the Central Limit Theorem for triangular arrays.Theorem 1 Suppose that for each t ∈ N 0 , X t,1 , X t,2 , . . . , X t,rt are independent random variables. Let S t = X t,1 + X t,2 + . . . + X t,rt . Suppose that E(X t,k ) = 0 for all t and k, and that the variances E(X 2 t,k ) are finite. Call σ 2 t = Var(S t ). If the Lindemberg condition is satisfied, i.e.: