key: cord-0805948-721x81up authors: Ghosh, Sarada; Samanta, G.P.; Nieto, Juan J. title: Application of non-parametric models for analyzing survival data of COVID-19 patients date: 2021-08-27 journal: J Infect Public Health DOI: 10.1016/j.jiph.2021.08.025 sha: 89b16fe8a044991b99eb77b5b9732a59a1b85178 doc_id: 805948 cord_uid: 721x81up BACKGROUND: COVID-19 Coronavirus variants are emerging across the globe causing ongoing pandemics. It is important to estimate the case fatality ratio (CFR) during such an epidemic of a potentially fatal disease. METHODS: Firstly, we have performed a non-parametric approach for odds ratios with corresponding confidence intervals (CIs) and illustrated relative risks and cumulative mortality rates of COVID-19 data of Spain. We have demonstrated the modified non-parametric approach based on Kaplan-Meier (KM) technique using COVID-19 data of Italy. We have also performed the significance of characteristics of patients regarding outcome by age for both gender. Furthermore, we have applied a non-parametric cure model using Nadaraya-Watson weight to estimate cure-rate using Israel data. Simulations are based on R- software. RESULTS: The analytical illustrations of these approaches predict the effects of patients based on covariates in different scenarios. Sex differences are increased from ages less than 60 years to 60-69 years but decreased thereafter with the smallest sex difference at ages 80 years in a case for estimating both purposes RR (Relative Risk) and OR (Odds Ratio). The non-parametric approach investigates the range of cure-rate ranges from 5.3% to 9% and from 4% to 7% approximately for male and female respectively. The modified KM estimator performs for such censored data and detects the changes in CFR more rapidly for both gender and age-wise. CONCLUSION: Older-age, male-sex, number of comorbidities and access to timely health care are identified as some of the risk factors associated with COVID-19 mortality in Spain. The non-parametric approach has investigated the influence of covariates on models and it provides the effect in both gender and age. The health impact of public for inaccurate estimates, inconsistent intelligence, conflicting messages, or resulting in misinformation can increase awareness among people and also induce panic situations that accompany major outbreaks of COVID-19. The pandemic 'severe acute respiratory syndrome (SARS)', a viral respiratory disease, has shown how spontaneously new infectious diseases can rapidly spread out in worldwide. In December 2019, a pneumonia (i.e., outbreak was first pointed out at Wuhan in China. On 31 December 2019, the outbreak has been traced to a novel strain of Corona virus, giving the interim name 2019-nCoV by WHO. It is also known as COVID-19. The virus primarily spreads among people via exhaled respiratory droplets such as coughing or sneezing. Among those who died from the disease, the time from development of symptoms to death is between 6 to 41 days, with a median of 14 days [1] . As of 20 th March 2020 more than 11, 000 deaths have been attributed to COVID-19 [2] . Most of the people who died were elderly about 80% of deaths were in those over 60 and 75% had pre-existing health conditions including cardiovascular diseases and diabetes. Scientists in Italy have found traces of the new Corona virus in waste water collected from Milan and Turin in December 2019, suggesting COVID-19 was already circulating in northern Italy before China reported the first cases. The Italian National Institute of Health looked at 40 sewage samples collected from waste water treatment plants in northern Italy between October 2019 and February 2020. An analysis released late on Thursday (December 19, 2019) said samples taken in Milan and Turin on 18 th December showed the presence of the COVID-19 virus [3] . In this work we have demonstrated the data-set of Spain which reports daily cumulative deaths due to COVID-19 by age and sex available on datalab of Institute National D'Études Démographiques [4] . Next, we have considered total 73, 780 cases of COVID-19 diagnosed by the regional reference laboratories (Istituto Superiore di Sanità) as positive in Italy [5] . Lastly, the data-set of recovered patient from COVID-19 of Israel includes anonymized records of 15,383 Israeli COVID-19 patients [6]. But we have used only 14,835 patients (8,202 men and 6,633 women) data since gender group is not clear in other group. In this work, we have estimated relation between mortality due to COVID-19 and age specific gender using non-parametric approach with logistic regression [7, 8] . We have calculated sex-specific cumulative mortality rates together with relative risks with 95% CIs for the covered time periods. Since the function f (x) is arbitrary and unknown, it cannot be explicitly expressed through coefficients. The odds ratio is: where x r is a specific value of the exposure taken as the reference and the CIs of the odds ratio in (1) be such as exp(ln OR(x, x r ) ± z α 2 SE(ln OR(x, x r )) where SE be standard error and z be the corresponding statistics and α represents the level of significance. 2 J o u r n a l P r e -p r o o f Statistical approach for estimating CFR Preliminaries Let, ζ(t), R(t), and δ(t) denote as cases, recoveries and the cumulative number of deaths respectively, where t denotes time point. Now we have considered such as: In equation (2) the censoring is ignored which arises since some affected person persist in hospital. It is assumed in the second estimator that the CFR for those who persist in hospital will be same for known outcome. Apart from this, from equation (3), it is also a clear vision that at any certain time t elapsed from admitting to hospital, the hazards of death and recovery must be proportional. Let, h 0 (t) and h 1 (t) as hazard functions death and recovery states respectively & t max (s) be referred the maximum time of observation from admitting in hospital for death or recovery, occurring at s time in the pandemic, the probability of discharge θ 1 (s) or death θ 0 (s) at or before the time s be obtained such as: where Θ(t) denotes survival function whenever both endpoints are considered as a single composite endpoint. If the epidemic is end then, θ 0 (s) + θ 1 (s) = 1, &θ =θ 0 (s) is an estimate of the CFR. But, during pandemic, S ι (t) are not complete, therefore, θ 0 (s) + θ 1 (s) < 1. The CFR at time s should lie betweenθ 0 (s) and 1 −θ 1 (s) followed by our estimation. We have estimated the CFR at time s is as follows: At any certain time s, the estimation of hazard function estimate such as:ĥ ιj (s) = δ ιj (s) n j (s) , where, δ ιj (s) be the number of events of ι th type on j th day (whereas j being the time elapsed from admission). Lastly, n j (s) be the number who are remained at risk at j days. The corresponding matrix ofΘ j (s) be Ω(s) defined as follows: where j > k and at time point s, n * (s) be the total sample size which is considered as halfway between the total uncensored sample size and total sample size in the pandemic. Non-parametric approach for cure-rate Estimator One of the most important cure models is the proportional hazards cure (PHC) model defined as: ) is a proper baseline cdf (cumulative distribution function) and θ(z) is a positive function of z and is formulated as exp(β T x). Let, Y i be the failure time of the i th subject. Let C i be the i th censoring time and t i = min(Y i , C i ) be the observed failure time. Let δ i be the censoring indicator with δ i = 0 when t i = C i and δ i = 1 when t i = Y i ; z i and x i are the corresponding covariates of t i . The observed data is In this work we use the dataset as follows: Compared with the definition in equation (9), x is suppressed as it only affects the survival function of patients who are not cured. A non-parametric methods for cure-rate based on KM method at the largest uncensored failure time such as [14] : whereŜ(·) is the KM estimate of the population survival function and T (n) is the largest uncensored failure time. The generalized maximum likelihood estimation of the population survival function and the cure-rate are as follows [15] : where t (1) < t (2) < ... < t (n) are the ordered failure times. Whenever m = 0, the survival function estimation reduces to the Kaplan-Meier estimation and the cure-rate estimation reduces to the estimate of Maller and Zhou (1992) as mentioned in (11) . After getting the estimations of θ, where covariate z is suppressed, the cure-rate was estimated as follows: For survival data (Y i , δ i , z i ), i = 1, ..., n without a cure fraction, the generalized Kaplan-Meier estimator is as follows [16] : where I(A) is the indicator function. The proper weight function B j (z) satisfy the following conditions: Here Y (n) refers as the largest failure time. When the covariate z is compressed, (14) reduces to the KM method. For survival data mentioned in (9) with a cure fraction, we have proposed for estimating π(z) as: whereŜ(y | z) is the generalized Kaplan-Meier estimate with data as mentioned in (9), if there is no cure fraction. Let us consider Y 1 (n) as the largest uncensored failure time. The bandwidth in the weight functions is as follows h = αn − 1 3 [17] . We have tried α from 0.1 to 2 and observed that the bandwidth h m = 0.2 for male and h f = 0.1 for female work very well [17] . The weight functions for B(z) are considered in this work. The Nadaraya-Watson weight function is as follows: , j = 1, 2, ..., n; K(·) is a proper kernel function. If the covariate is absent, B r (z) = 1 n , r = 1, 2, ..., n, and the proposed estimator becomes the KM estimator at Y 1 (n) . Therefore, the proposed estimator as mentioned in (15) reduces to the cure-rate estimator of no-covariate case [14] . This is based on COVID-19 dataset of Spain available on the Institut National D'Études Démographiques (INED) website. Sex-related cumulative mortality rates (CMRs) from COVID-19 per 100,000 men and women are calculated for each day for the covered time periods of Spain. Furthermore RRs and ORs with 95% CIs are also calculated for four age-groups (< 60, 60 − 69, 70 − 79 and ≤ 80 years) using COVID-19 data-set of Spain. In Spain, the CMRs are increased with advancing age (from Table 1 The non-parametric KM based approach provides sensible estimates of the CFR whenever the degree of censoring is moderate. In Italy, there are various reasons for higher CFR such as: (i) the age structure of the Italian population (i.e., 2nd oldest in the world), (ii) the highest rate of antibiotic resistance deaths in Europe. In this work, the data are collected from the Istituto Superiore di Sanità (ISS), Rome at 4 pm on March 26, 2020. It was reported that total 73, 780 cases of COVID-19 diagnosed by the regional reference laboratories as positive for COVID-19 (15,781 more cases than in the previous bulletin referred to March 23 rd March, 2020). The diagnosis of COVID-19 infection was confirmed in 99% of the samples sent by regional reference laboratories and processed by the national reference laboratory (ISS). There is a clear vision in Fig. 1 . that the number of censored cases are gradually increasing mode according as different time points. Information on sex is known for 73, 044 out of 73, 780 cases. The difference in the number of cases reported by sex increases progressively in favour of sex subjects male up to the age-group ≥ 70 − 79, with the exception of the 20-29 and 30-39 years in which the number of female subjects is slightly higher. For ≥ 90 years, the number of cases of female sex exceeds that of male cases probably for the demographic structure of the population. The observed CFR are increasing and reflect a vulnerable effect for the age-group in both cases. In such pandemic, the simple estimator from equation (2), based on deaths and cases ratio, denoted as e 1 , underestimates the CFR. Therefore, the numerator in this work can able to underestimate the total deaths related to COVID-19, occurring eventually in sample. The another estimate obtained from equation (3) based on the ratio of deaths whose outcome is already perceived denoted as e 2 , which is sensible in such pandemic. In different scenarios, specific group-wise estimates of CFR are very desirable for both age and gender. Table 2 provides the estimates of the CFR obtained for several age-groups. It is also observed that CFR differences for various age-group are changing for larger Italy cohorts. This analysis provides that patients aged 70 − 79 years and > 79 years reveal remarkable effects on age due to COVID-19 in CFR purpose for both gender cases. It also shows a significantly slower mortality rate in case of female patients (specially in case of aged person over 80 years or more). Relatively, < 70 years age-groups signify average rate of CFR which indicates the prognosis comparatively low rate for young aged people. The summary of COVID-19 dataset of Israel used in this work are posted (June 28, 2020) by the Israel Ministry of Health (IMH). The mortality and morbidity due to COVID-19 are higher for old aged patients [18] . This important information can improve our knowledge of COVID-19 and helps for assisting public healthcare for future Corona virus pandemics or epidemics. The subgroups categorised by age are shown on underlying dataset according to decades for the range of age from 20-59 years, while for older or younger COVID-19 patients age-groups are assigned here as > 60 years and 0-19 years respectively. In dataset, recovery is determined by taking two consecutive negative results of COVID-19 after testing 'test of throat swab'. For proposing cure-rate estimation, we have considered the bandwidth h m = 0.2 for male and h f = 0.1 for female and (0, 1) for standardized age variable. For non-parametric model, it is shown in Table 3 that the rates of cure are substantial for age-group under consideration and the cure-rate ranges from 5.3% to 9% approximately for male cases and from 4% to 7% approximately for female in the non-parametric model. Whenever age is very large or very small, the cure-rates tend to increase as age increases and demonstrate S-shaped curve. For a full vision of how age affects the cure-rate for both gender purpose, we have estimated the cure-rate corresponds to the actual age from ≤ 19 to ≥ 60 and plotted the relation between the cure-rate and age in Fig. 2 . The S-shape relationship between cure-rate and age from underlying the data-set is depicted here. The original data-set is divided into 6 subgroups based on age: ≤ 19, (19, 29] , (29, 39], (39, 49], (49, 59) and ≥ 60 for both male (red line in Fig. 2 ) and female (blue line in Fig. 2 ). The approach proposed by Maller and Zhou (1992) is used in each of the 6 subgroups to estimate the cure-rates, which are shown in Fig. 2 as per age subgroup [14] . The Table 3 shows that the cure-rates for male purpose are decreasing whenever age increases from subgroup 2 to subgroup 5. But the cure-rates of subgroup 1 and subgroup 6 do not obey such pattern and the former is greater (except group 2) than others and the latter is second greater (except group 2) than the cure-rates of their respective adjacent groups. The Table 3 also shows that the cure-rates for female purpose are decreasing whenever age increases from subgroup 1 to subgroup 5. However, the cure-rates of subgroup 6 do not follow this pattern. Results of the non-parametric approach investigate the influence of covariates on these models and also provide that both gender and age have a significant effect (shown in Tables 2 and 3 ) and the depicted curves are displayed in Fig. 2 . for individual group purpose. In epidemiology, case fatality rate typically is used as a measure of disease severity and is often used for prognosis. It also can be used for evaluating the effect due to new treatments, with measures decreasing as treatments improve. On daily cumulative deaths by age and sex due to COVID-19 in Spain, it is concluded that the risk of death increased with age, and that men have higher mortality from COVID-19 than women in almost all age-groups using population estimates. Sex differences are increased from ages < 60 years to 60-69 years but decreased thereafter with the smallest sex difference at ages ≥ 80 years in case for estimating both purposes RR and OR. A recent work provides the survival in seven populations under extreme conditions from famines, epidemics and slavery: even when mortality is very high, women can survive on average better than men [19, 20] . The work provides that the survival advantage of women has fundamental biological underpinnings, but the female advantage is modulated by a complex interaction of biological, environmental and social factors [20] . Although we have demonstrated a higher mortality from COVID-19 for men than for women in almost all age-groups, we find, as hypothesised, a reduction in the relative risk of mortality for men at later ages, consistently with findings elsewhere [21] . This work provides the importance of addressing the impact of sex on mortality from disease epidemics, but studies using individual-level data are needed to confirm an interaction between age and sex in COVID-19 mortality in order to guide clinical care personal and to address questions of whether men require additional surveillance, prevention, and earlier intervention than women. The work also provides two approaches for estimating the CFR computed from the known outcomes and then proceed with modified KM approach that adequately estimates the CFR during the COVID-19 pandemic. The underlying first technique is appealing due to its simplicity and after that the ease with which it can be computed. The modified KM estimator performs for such censored data and detects the changes in the CFR more rapidly for both gender and age-wise. The health impact of public for inaccurate estimates, inconsistent intelligence, conflicting messages or resulting in misinformation can increase awareness between people and also induce panic situation that accompany major outbreaks of COVID-19. Among several important aims, another uttermost important challenge is to encounter during such disease epidemic. COVID-19 is perceiving the reasons underlying the contrast in all CFRs which are reported for Italy. In future epidemics (in the context of COVID-19) the careful estimation and proper analysis of the CFR can be used for evaluating the effectiveness of any new treatments. In pandemic condition, we have demonstrated the CFR initially which is defined by modified KM technique as displayed in this work. Apart from this, we have also illustrated the nonparametric approach to explore the effect of covariates on cure-rate with a cure fraction for underlying survival data. Our simulations display that the proposed non-parametric approach is flexible and can also accommodate a complex consequence of a covariate on cure-rate. The proposed non-parametric method is used for distinct age-group in both gender cases. It also can be extended to a multiple-covariate case specially whenever the number of covariates is greater than two. In the present work, both gender and age have a significant role in this prevalence predicted from the underlying approaches with current Corona virus data. The pandemic COVID-19 shows an increased number of cases and a greater risk of severe disease with increasing age and for being male. Overall, as per weighted approach of cure-rate tends to downward as age is increasing for both gender purposes. The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak COVID-19) Situation Report-61 Italy sewage study suggests COVID-19 was there in Institut National D'Etudes Demographiques (INED) COVID-19. Demographics of COVID-19 deaths 2020 Generalized additive models Non-parametric regression and generalized additive models: a roughness penalty approach The use of mixture models for the analysis of survival data with long-term survivors A generalized F mixture model for cure rate estimation Statistical modeling for cancer mortality A proportional hazards model taking account of long-term survivors Estimation of survival based on proportional hazards when cure is a possibility Estimating the proportion of immunes in a censored sample Nonparametric estimation and testing in a cure model Nonparametric regression with randomly censored survival data Non-Parametric regression with censored survival time data Comparison of Regression Approaches for Analyzing Survival Data in the Presence of Competing Risks: An Application to COVID-19 Life expectancy: women now on top everywhere Women live longer than men even during severe famines and epidemics A geroscience perspective on COVID-19 mortality The analytical illustrations of these approaches predict the effects of patients based on covariates in different scenarios. Sex differences are increased from ages less than The authors are grateful to the learned reviewers and Dr. Rehab Hosny El-Sokkary (Editor) for their careful reading, valuable comments and helpful suggestions, which have helped them to improve the presentation of this work significantly. The research of J.J. Nieto has been partially supported by the Agencia Estatal de Investigacion (AEI) of Spain, cofinanced by the European Fund for Regional Development (FEDER) corresponding to the 2014-2020 multiyear financial framework, project PID2020-113275GB-I00; and by Xunta de Galicia under grant ED431C 2019/02, and Instituto de Salud Carlos III, grant COV20/00617.