key: cord-1003125-ns30mkif authors: E. P., S.; P. G., S. title: Statistical methods for estimating cure fraction of COVID-19 patients in India date: 2020-06-03 journal: nan DOI: 10.1101/2020.05.30.20117804 sha: 4ee09ebb9ed02f045797bba2b6fc72b6330a44a7 doc_id: 1003125 cord_uid: ns30mkif The human race is under the COVID-19 pandemic menace since beginning of the year 2020. Even though the disease is easily transmissible, a massive fraction of the affected people are recovering. Most of the recovered patients will not experience death due to COVID-19, even if they observed for a long period. They can be treated as long term survivors (cured population) in the context of lifetime data analysis. In this article, we present some statistical methods to estimate the cure fraction of the COVID-19 patients in India. Proportional hazards mixture cure model is used to estimate the cure fraction and the effect of covariates gender and age on lifetime. The data available on website https://api.cvoid19india.org is used in this study. We can see that, the cure fraction of the COVID-19 patients in India is more than 90%, which is indeed an optimistic information. The outbreak of novel coronavirus disease 2019 (COVID-19) has created a global health crisis since January 2020. The first case of COVID-19 was reported in Many studies were reported on COVID-19 patient data using SIR models and other popular statistical techniques (Nadia and Hazem, 2020 and Waquas et al, 2020 ). In the present paper, our goal is to analyse COVID-19 patient data from India, using lifetime data models which is widely used in epidemiological studies and public health research (Lee and Go, 1997 and Cole and Hudgens, 2010) . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org /10.1101 /10. /2020 In lifetime data analysis we are modelling time to occurrence of an event of interest. In this context, we can define death due to COVID-19 as the event of interest. The time is known as lifetime. Due to various reasons, the final occurrence time of the event is not available for many individuals (patients). This situation leads to the phenomena known as censoring. For censored patients we are only available with the partial information on lifetime; it is greater than a particular time. Possibility of censoring makes lifetime data analysis differ from all other fields of statistics. An extensive review of lifetime data analysis is given in Lawless (2011) . For many COVID-19 patients in India, date of conformation of the disease and date of death/ recovery due to (from) the disease along with information on age and gender are available. The data is freely accessible in the site 'https://api.cvoid19india.org'. When the event is defined as death due to COVID-19, the recovered and hospitalized patients can be treated to have censored lifetimes, since we do not have information on them after the given date (as per records). In lifetime studies, researchers generally assume that all of the study subjects will experience the event of interest, if they are followed long enough (Maller and Zhao, 1996) . However, in some situations a non-negligible proportion of individuals may not experience the event of interest even after a long period of time. For example a COVID-19 patient recovered once from the disease is assumed that he/she has acquired immunity to the disease and will not experience it in future. These patients can be treated as long term survivors or cured patients. Cure models are treated as an effective statistical tool . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 3, 2020. Accordingly, the COVID-19 affected population in India includes many long term survivors, and an estimate of the cure fraction alludes to an estimate of the percentage of individuals who will recover. We should note that censored individuals and cured individuals are absolutely different; where the former case referred to an individual who does not experience the event of interest at a given time (censoring time) while the latter case referred to an individual who will not experience the event of interest even if they observed infinitely long. We can use mixture cure models, the basic cure model proposed by Boag (1949) to analyse this data. The mixture cure model assumes that the population under study is a mixture of susceptible (uncured)individuals who may experience the event of interest, and non-susceptible (cured) individuals who will never experience the event. Cox's PH model (Cox, 1972) for cured data is used to investigate the association between survival time and covariates. In the present paper, we analyse the data on COVID-19 patients in India (available as on 19 May, 2020) using statistical techniques in lifetime data analysis. This appears to be the first study on COVID-19 patient data in this direction. The rest of the article is organised as follows. In Section 2, we describe the statistical models and computational procedures employed in this study. We use Kaplan Meier estimate of the survivor function to get a basic information about the presence of long term survivors in the population. The effect of covariates on lifetime is analysed using proportional hazards model and the cure fraction is estimated in presence of covariates. Here the . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117804 doi: medRxiv preprint regression parameters as well as the cure fraction are estimated simultaneously. We use the package 'smcure' in R language (Cai et al., 2012) for this purpose. In Section 3, we present the results and analyse them with a detailed discussion. Finally, Section 4 gives concluding remarks and possible future works. In general, suppose that we have n patients under study. Define T as the time to death of a COVID-19 patient from the date of confirmation of the disease. Hence T will be counted as number of days in hospital. The number of hospitalised days for patient whose death is happened is counted as an observed lifetime. If the patient is recovered we know only that presently the event is not happened to the patient, hence the hospitalised number of days for those patients are considered as censored lifetime. For patients with current status as 'hospitalized' also, no information is available after the given date. Hence those lifetimes are also treated as censored lifetimes. The main frame works of lifetime data analysis are the survivor function, hazard rate function and distribution function. The survivor function denoted by S(t), gives the probability of a patient surviving beyond a time t and is defined by S(t) = P (T > t). In general, we have survivor function is a non-increasing function with lim t→∞ S(t) = 0. The hazard rate function represented by h(t) indicates the instantaneous probability of failure at time t and it is given by . Distribution function defined as . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117804 doi: medRxiv preprint the complement of survivor function, gives the probability of failure before a particular time t, and is defined as The information on age and gender are available for some patients. The covariates can be represented using vector z, which is missing partially or completely for some patients. In general, we assume the vector z has p × 1 dimension. Now for each patient, we observe a lifetime T and an indicator variable δ, which is defined as 0 if death is observed and 1 if the patient is recovered/hospitalized. Now, each patient has a pair of observation {T, δ}, the observed lifetime and an indicator variable to tell us whether T is an event time or censoring time, along with a p × 1 covariate vector z. Let S(t|z) be the survival function of T in presence of the covariate vector z which can be defined as We suppose that some patients will not experience death due COVID-19, even if they are observed infinitely long and our aim is to estimate those fraction of patients. In cure model, we can define the latent variable Y as the indicator event as which takes the value 1, if the individual belongs to uncured group (if it experience the event) and 0 otherwise. Now the lifetime T can be decomposed as T * is the lifetime of susceptible (uncured) individuals. We can note that the variable Y denote the true event status and the variable δ denote the observed failure status and . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117804 doi: medRxiv preprint Now, survivor function given in Equation (1) can be written as where S * (t|z) is defined as S * (t|z) = P (T > t|z), which is a proper survivor function. The researchers usually estimate π(z) by modelling it as logistic distribution given by (Farewell, 1982 ) where γ is a collection of parameter vectors. In presence of long term survivors, the survivor function S(t|z) is such that lim t→ S(t|z) > 0, and this limiting value given by 1 − π(z), corresponds to the proportion of cured subjects, known as cure rate. To assess the impact of covariate values z on the survivor function of uncured individuals, we can model the survivor function S * (t|z). In this paper, we use Cox PH model which is the most popular regression model used in medical research, to model survivor function in presence of covariates. In Cox PH model, the survivor function S * (t|Z) can be written as where S * 0 (t) is the baseline survivor function, which is common for all individuals and β is the p × 1 vector of regression parameters. Now Eq. (4) can be used to model Eq. (2) and the resulting model can be termed as Proportional Hazards Mixture Cure (PHMC) model. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. where d i is the number of events and n i is the number of individuals at risk at time t i . In a perfect situation where all individuals have an observed lifetime we have S(τ ) = 0, where τ is the largest observed lifetime. But when cured or long term survivors are presented in the data, obviously S(τ ) > 0 and 1 − S(τ ) will give a rough estimate of the censoring percentage in the data set, which may also include cured proportion. Hence, a high value of S(τ ) gives the evidence of the presence of long term survivors (Maller and Zhao, 1996). Our aim is to estimate the baseline survivor function S * 0 (t) (different from the KM estimator), the vector of regression parameters β and the cure fraction π(z), simultaneously from the given data. To estimate the parameters under a PH model, Peng and Dear (2000) and Sy and Taylor(2000) proposed an partial likelihood method, where we can estimate β with out specifying the baseline survivor function S * 0 (t). Let Φ = (T, δ, z) denote the observed data. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117804 doi: medRxiv preprint Expectation-Maximisation (EM) algorithm can be used to estimate the parameters of interest in PHMC model. Following the notations in Section 2.1, given y = (y 1 , y 2 , ..., y n ) and Φ, the complete likelihood can be written as where h(.) is the hazard rate function corresponding to S * (.). The logarithm of the complete likelihood l can be written as l = l 1 + l 2 , where and The conditional expectation of the complete log likelihood with respect to y i given Φ can be calculated using the E-step of EM algorithm along with the estimates of β and S * 0 (t). Since Eq. (7) and Eq. (8) are linear functions of y i , we need only the conditional expectation of y i to perform this computation. Let us denote β (k) and S * 0 (t) (k) as the estimates of β and S * 0 (t) obtained in k th iteration. Now the conditional expectation of y i given β (k) and S * 0 (t) (k) can be written as . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117804 doi: medRxiv preprint We can see that b (k) i = 1 if δ i = 1 and δ i = 0, it will be the conditional probability that the ith individual remaining uncured. Since δ i logb i , the expectations of Eq. (7) and Eq. (8) can be written as and The Maximisation step (M-step) in EM algorithm can be used to maximize Eq. (10) and Eq. (11) . To estimate the parameters under a PH model, we can employ the methods given in Peng and Dear (2000) . denote the distinct uncensored failure times, d t (j) denote the number of events and R(t (j) ) denote the risk set at at time t (j) . Now the Breslow type estimator for S * 0 (t|Y = 1) is whereβ is the estimate of β, obtained in the previous step. Now we can estimate the cure probability π(z) using Eq. (3) given in section 2.1. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117804 doi: medRxiv preprint In this Section, we analyse the data on COVID-19 patients in India, using statistical methods explained in Section 2. Since the response variable is time (number of days in hospital) in this study, to employ lifetime data models we need information on the 'date of admission to hospital' and 'status change date'. The available information in the file 'patient raw data' from the site https://api.cvoid19india.org as on May 19, 2020 is used to carry out the analysis. From the raw data, we can see that 10.83% of lifetimes are the actually observed lifetimes (current status of the patient is death ) and the remaining 89.17% of lifetimes are censored (current status of the patient is deceased/hospitalized). We plot a Kaplan Meier curve using the Eq. (5) given in Section 2.1, to get a basic information about the presence of long term survivors in the data. The minimum value of survival probability estimated is 0.874 with a standard error 0.0114. The plot of KM curve is given in Figure 1 . From the plot it is evident that a large fraction of patients will be long term survivors in this set . We analyse the above data to estimate the cure fraction in presence of covariates gender and age. Separate analysis is done for both covariates, since for some patients information on gender and for some others information on age is missing. First we estimate the cure fraction considering the covariate gender. In the accessible data, 24.6% are females and remaining 73.4% are males. We denote females by 0 and males by 1 in the analysis. The regression parameter is estimated as 0.386. Since the parameter value is greater than 0 it implies that females will have greater survival probability than males. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117804 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117804 doi: medRxiv preprint Figure 2 . Predicted survival probability of males and females We now consider the covariate age to estimate cure fraction. The minimum recorded age is 1 and maximum is 96. The regression parameter is estimated as 0.2381, which means that the hazard ratio will be greater than 1. Hence age can be treated as a 'bad prognostic factor' which tells us that as age increases hazard will increase. With respect to age, cure fraction is estimated to be 0.9313. We plot the survival probability curve for patients by dividing them in to two groups with respect to age; below 60 years and equal to or above 60 years. The value 60 is chosen to determine, how much the survival pattern is different for senior citizens and others. In the above figure solid line represents the survival probability of patients below 60 years and dotted line represents the survival probability of patients above 60 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 3, 2020. We analyse the data on COVID-19 patients in India using statistical methods in lifetime data analysis. All the reported statistical studies on COVID-19 data from various nations use compartmental models in epidemiology. We carry out an explanatory . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117804 doi: medRxiv preprint data analysis in the new perspective and estimated the fraction of long term survivors in the data in presence of comorbidities age and gender. It is shown that female patients have a greater chance of survival than male patients. When the younger population has survival probability more than 0.95, for aged population it is around 0.5 only. A COVID-19 patient entering in the study, may be identified with the disease (or symptoms) at an earlier time, which may unknown. This possibility of partial information on lifetime leads to the data with left censoring. The study can be extended to incorporate left censored individuals also. The grouped data models, where the lifetime is grouped into several non-overlapping groups can also be done in this context. Studies in these directions will be reported elsewhere. Also, it will be worthwhile to analyse the data on COVID-19 patients in presence of information of the current health condition of the patient, since patients with cardiac problems and diabetes may have a greater hazard rate. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 3, 2020. . https://doi.org/10.1101/2020.05.30.20117804 doi: medRxiv preprint Maximum likelihood estimates of the proportion of patients cured by cancer therapy An R-Package for estimating semiparametric mixture cure models,Computer methods and programs in biomedicine Survival analysis in infectious disease research: Describing events in time Regression models and life tables The use of mixture models for the analysis of survival data with long-term survivors Nonparametric estimation from incomplete observations Statistical Methods and Models for Lifetime Data Survival analysis in public health research Survival Analysis with Long-Term Survivors Data Analysis of Coronavirus CoVID-19 Epidemic in South Korea Based on Recovered and Death Cases Cure models as a useful statistical tool for analysing survival A nonparametric mixture model for cure rate estimation The cure model in perinatal epidemiology Estimation in a Cox proportional hazards cure model Analysis and Prediction of COVID-19 Pandemic in Pakistan using Time-dependent SIR Model Sreedevi E. P. would like to thank Kerala State Council for Science Technology and Environment, Kerala, India for the financial support provided to carry out this research work.