key: cord-0701894-wjhfat4w authors: Atlam, Mostafa; Torkey, Hanaa; El-Fishawy, Nawal; Salem, Hanaa title: Coronavirus disease 2019 (COVID-19): survival analysis using deep learning and Cox regression model date: 2021-02-15 journal: Pattern Anal Appl DOI: 10.1007/s10044-021-00958-0 sha: 3b6bc445728c3e9f2cfce467925bc35e49a3d187 doc_id: 701894 cord_uid: wjhfat4w Coronavirus (COVID-19) is one of the most serious problems that has caused stopping the wheel of life all over the world. It is widely spread to the extent that hospital places are not available for all patients. Therefore, most hospitals accept patients whose recovery rate is high. Machine learning techniques and artificial intelligence have been deployed for computing infection risks, performing survival analysis and classification. Survival analysis (time-to-event analysis) is widely used in many areas such as engineering and medicine. This paper presents two systems, Cox_COVID_19 and Deep_ Cox_COVID_19 that are based on Cox regression to study the survival analysis for COVID-19 and help hospitals to choose patients with better chances of survival and predict the most important symptoms (features) affecting survival probability. Cox_COVID_19 is based on Cox regression and Deep_Cox_COVID_19 is a combination of autoencoder deep neural network and Cox regression to enhance prediction accuracy. A clinical dataset for COVID-19 patients is used. This dataset consists of 1085 patients. The results show that applying an autoencoder on the data to reconstruct features, before applying Cox regression algorithm, would improve the results by increasing concordance, accuracy and precision. For Deep_ Cox_COVID_19 system, it has a concordance of 0.983 for training and 0.999 for testing, but for Cox_COVID_19 system, it has a concordance of 0.923 for training and 0.896 for testing. The most important features affecting mortality are, age, muscle pain, pneumonia and throat pain. Both Cox_COVID_19 and Deep_ Cox_COVID_19 prediction systems can predict the survival probability and present significant symptoms (features) that differentiate severe cases and death cases. But the accuracy of Deep_Cox_Covid_19 outperforms that of Cox_Covid_19. Both systems can provide definite information for doctors about detection and intervention to be taken, which can reduce mortality. Coronaviruses problem is one of the most serious problems, that faces the world [1] . Coronaviruses were first discovered in the 1930s, but only animals were infected with it. Human coronaviruses were discovered in the 1960s. Coronaviruses have taken many phases of mutation; it started as the common cold in 1960s, till reaching the current form with respiratory effects. A novel coronavirus was reported as the cause of a cluster of cases of pneumonia in Wuhan, a city in China's Hubei Province, at the end of 2019. It spread exponentially, leading to an epidemic across China, followed by several cases across the world in other countries. The World Health Organization (WHO) identified the disease COVID-19 in February 2020, which stands for coronavirus disease 2019 [2] . Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a virus that causes COVID-19; previously, and it was referred to as 2019-nCoV. Because of the increase in mortality due to COVID-19 and the increase in the speed of its spread, many methods have been developed to reliably predict patient survival based on symptom data and specific clinical parameters. Artificial intelligence is now needed in order to help expert epidemiologists. AI provides a useful tool, that can help in computing risk factor, classification, even drug analysis, and finally responding to crisis according to health data specialists. Because of the increase in COVID-19 patients and the lack of equipment to receive all patients, a hard choice is taken. The necessary medical care is applied only to patients with more probability to survive. Calculating the probability to survive and the effect of each feature like symptoms in our case on survival probability is done using survival analysis. Survival analysis is a model for time until a certain "event". Time-to-event data encounters several research challenges such as censoring, symptoms (features) correlations, high-dimensionality, temporal dependencies, and difficulty in acquiring sufficient event data in a reasonable period of time [3] . There are many current literature techniques for conducting this sort of survival study. Among them, the Cox Proportional Hazards Model (Cox) [4] which is a Regression models that is commonly used in survival analysis [5] . Survival analysis methods can work with specific problems, with a data type that waits for the event to occur. Cox regression is the most appropriate method to deal with this kind of data. Occasionally, the basic assumptions of the model, for example, non-proportionality for the Cox model, are not true. The choice of an appropriate model varies depending on the complexity and features that affect the suitability of the model [6] , in model building. Data-driven approaches are robust. A long-range regression analysis of COVID-19 using immunological, epidemiological, and seasonal effects on US data is going to be done out to 2025 [7] . The implementation of the designated dataset in its original form in this research led to the appearance of some trammels such as high collinearity and convergence, so autoencoder deep neural network is implemented to solve such problems. To predict the survivability analysis, Cox regression is implemented in two situations: the first is called Cox_Covid_19, and the second is called Deep_Cox_ Covid_19. Cox_Covid_19 implements Cox regression on the original dataset, while Deep_Cox_Covid_19 implements autoencoder deep learning before Cox regression to solve the problems associated with the dataset. Our main objective in this paper is to define the main features affecting the survival probability for COVID-19 with the aid of the most suitable machine learning algorithm. This information can help doctors in taking the right decision about each patient's case according to the available treatment and medical instruments. The principal contributions of this research could be summarized as the following: • At first, finding the survival probability for each patient. • At second, finding the impact of each feature on survival probability by calculating p value for each feature. The p value is an indication of the impact of each feature on survival. • Enhancing Cox_COVID_19 system by applying deep neural network, for increasing accuracy, presenting a new system called Deep_ Cox_COVID_19. • Finally, a comparison between Cox_COVID_19 and Deep_ Cox_COVID_19 is provided in terms of concordance, accuracy, precision, and recall. The paper is organized as follows: The related work is presented in Sect. 2. Then, the proposed system design and implementation are introduced in Sect. 3. Section 4 introduces the results and discussion. Finally, the conclusion is discussed in the last section. Up to now, the 2019 novel coronavirus pneumonia (COVID-19) is one the greatest public health problems that faced the world throughout its history. Worldwide, as of 2:00 a.m. CEST, 14 Apr 2020, 1,980,704 confirmed cases of COVID-19 have been reported to the WHO, including 67,666 deaths [2] . One of the earliest applications of data mining techniques was in medical fields in which it can accurately predict and diagnose diseases and improve medical decisionmaking. Many researchers have focused on conducting work in COVID_19 application and experimental application of medical datasets for scientific purposes. Khan et al. [8] for accelerated failure time models, developed variable selection approaches consisting of a group of algorithms based on a combination of the Dantzig selector and the Buckley-James method, two frequently used techniques in the field of variable selection for survival analysis. Additionally, Khan et al. [9] proposed new approaches to variable selection for censored results, based on optimized AFT models using regularized weighted least squares. A mixture of L 1 and L 2 standard penalties under two proposed elastic net type approaches is used by the regularized technique. The two proposed methods are also expanded by incorporating censoring observations into their model optimization structures as constraints. XGBoost machine learning algorithm predictive modelling is presented by Yan et al. [10] . Their methodology is able to predict mortality risk for COVID-19 dataset. The proposed model defined that lactic dehydrogenase (LDH), lymphocyte and high-sensitivity C-reactive protein (hs-CRP) are the key features for differentiating between critical patients from the two classes. Xiang et al. [11] present an algorithm for determining COVID-19 biomarker for earlier diagnosis based on IBM SPSS statistics 22.0 (New York, USA) software for statistical analysis, the Wilcoxon signedrank test for data comparison between two groups. They found that using serum urea, CREA, CysC, DBIL, CHE, and LDH can be used to differentiate severely COVID-19 cases from non-severe COVID-19 cases. In addition of using many parametric models, but the best was chosen by Bayesian information criterion (BIC) as presented by Mollazehi et al. [12] . The authors apply their algorithm on COVID-19 dataset for patients in Singapore to predict the recovery time from COVID-19 in Singapore between 23 January and 13 March 2020. Nemati et al. [13] introduced an approach that was applied on 1,182 COVID-19 cases and was based on several statistical methods to evaluate survival characteristics. By using various ML and statistical analysis approaches, the discharge-time prediction of COVID-19 cases was assessed. The findings show that in this analysis the Gradient Boosting survival model outperforms other models of patient survival. A semi-supervised learning method based on the Cox and AFT models is presented by Liang et al. [14] applied on DLBCL (2002), DLBCL (2003), lung cancer and AML to overcome small sample size and censored data that limit accuracy. The semi-supervised model of learning can substantially increase the predictive performance of Cox and AFT models in the survival study. Lee et al. [15] presented algorithms based on cause-specific version of the Cox Proportional Hazards Model (cs-Cox) and DeepHit applied in breast cancer dataset for handling competing risks. Moreover, Ranganath et al. [16] presented a heterogeneous data types that occur in the electronic health record based on baseline Framingham risk score deep survival analysis. Survival analysis is a time till event analysis. Cox regression is one of the most commonly methods in survival analysis. Here, we used an autoencoder deep neural network to improve performance. In this section, all methods that we used in this paper and different types of survival analysis methods are presented. Survival analysis models [17, 18] are categorized to parametric, nonparametric and semi-parametric models, as shown in Fig. 1 [18] . For the parametric model, it isn't suitable for normal distribution in which negative values can be found. The parametric model assumes that the survival time follows a known distribution. Methods of parametric model are such as Tobit, Buckley-James, Penalized regression and Accelerated Failure Time. For the semi-parametric model, even if the regression parameters are known, the distribution of the survival time is still unknown. It isn't a fully parametric or a fully nonparametric. Methods of the semi-parametric model are such as Cox model, Regularized Cox, CoxBoost and Time-Dependent Cox. For the nonparametric model, it is difficult to understand and gives unreliable estimates but more efficient when the appropriate theoretical distributions aren't known. Methods of nonparametric model are such as Kaplan-Meier, Nelson-Aalen and Life-Table. Cox regression [19] method is a statistical method that is used often in medical research, for predicting the survival time for different patients. Cox Regression method is used for predicting the degree of effect of different features upon survival which is called hazard rate. Cox regression method is considered as an example of semi-parametric models. The hazard function h(t) can be used to express the Cox model. Shortly, the probability of dying at time t is provided by the hazard function. It can be estimated as follows: where the survival time is expressed by "t". The hazard function h(t) is determined using "n covariates" (x 1 , x 2 , … , x n ) . The impact of covariates is measured using the coefficients Autoencoder is an unsupervised artificial neural network. An autoencoder is a neural network that can efficiently encode data then reshaping and reconstructing the data back from the encoded representation, removing noise from the data that can affect the performance of the prediction model. Figure 2 shows the layers of the used autoencoder neural network. Figure 2 shows the autoencoder components: • Encoder In this step, reducing the input dimensions and compressing the input data into an encoded representation is the main goal. • Bottleneck This layer contains the lowest possible dimensions of the input data; the compressed representation of the input is presented in this layer. • Decoder In this step, the data are reconstructed from the encoded version to extract a new representation of data, that is as close to the original input as possible and removing noise. • Reconstruction loss This is the method for measuring the performance of our decoder and how close the new representation to the original input. The two systems proposed in this paper are Deep_Cox_ COVID_19 and Cox_COVID_19. Both systems are used to define significant symptoms (features) that differentiate between sever and death cases. Cox_COVID-19 is based on Cox regression to predict the survival probability. Deep_ Cox_COVID_19 is a combination of deep neural network which is autoencoder and Cox regression method to predict the survival probability. The system is made up of three major stages as shown in Fig. 3 : • Data preprocessing in this paper contributes reading COVID_19 dataset, then solving categorical data problems, and handling missing data. After reading the dataset, there were some categorical to be handled such as gender and country, Labelencoder is applied to solve the categorical problem of gender feature, and oneho- In this section, a dataset description, validation, and the findings of adding an autoencoder deep neural network to a Cox regression model are presented. In this study, a clinical dataset for COVID-19 patients is used [20] . This dataset consists of 1085 patients, and as features, it has an id for each patient, reporting date, summary, location, country, gender, age, symptom, hospital visit, exposure_Start, exposure_end, visiting Wuhan, from Wuhan, death and recovered. The symptoms can be divided into demographics symptoms, common symptoms and other symptoms and all shown in Fig. 4 . The demographics symptoms are age, gender, country, from Wuhan and visiting Wuhan. The common symptoms are like fever, cough, pneumonia, headache and throat pain. The other symptoms are like chills, joint pain, thirst, flu and reflux. For each patient, if died, the date of death is shown in the summary column and if alive it means till the date of the dataset downloaded. So, the duration till death or being alive is calculated. For patients with no symptom's information, they were removed from the dataset and for patients with no gender information, they were removed from the dataset leaving the information for only 509 patients. For the train subset, there are 36 samples that have died, and the other 333 sample that are still alive. For the test subset, there are 10 samples that have died and the other 130 are still alive. The features that are used in this system, age, gender, and different symptoms. The information about the used dataset after removing patients that don't have enough information is shown in Table 1 . COVID-19 features can be categorized into demographics and symptoms. According to common protocols all over the world, symptoms can be categorized into common symptoms with most patients and other symptoms. Figure 4 shows different characteristics for patients with COVID-19. To rank the model, concordance index [21, 22] is used. The concordance index is like measuring accuracy in classification problems but in survival analysis. Concordance is used as a rating of how well the model is. Concordance maximum value is one. The closer the concordance to one, the where " | | " is the number of edges in the order graph, and f x i is the predicted survival time for an item " i". Cox_COVID_19 model has a concordance of 0.923 out of 1 for training and 0.896 out of 1 for testing, so it is a very good Cox model. Deep_Cox_COVID_19 improved the performance as it has a concordance of 0.983 for training and 0.999 for testing. For accuracy [23] , it can be defined as the percentage of right predictions that could be done by machine learning model. Formally, accuracy can be defined using Eq. 3 [21] : where "TP": True positives; "TN": True negatives; "FP": False positives, and "FN": False negatives. Precision Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. For each feature, the percentage of each occurrence in the dataset, p value, and coefficient are computed. The p value is used with each feature for testing the null hypothesis that the coefficient is equal to zero (no effect). For a predictor to have a lower p value (that can be less than 0.05), it means that it has a big impact on your model, and any changes in the predictor's value will lead to changes in the response variable. A positive coefficient indicates a worse prognosis, and a negative coefficient indicates a protective effect of the variable with which it is associated. The features are categorized into demographics and symptoms; the symptoms are categorized into common symptoms and other symptoms according to common protocols all over the world. Table 2 shows that age, muscle pain, pneumonia all have p value which is less than 0.05 and throat pain has a p value that is higher than 0.05 but still very small. So that age, pneumonia, muscle pain and throat pain are the most important factors affecting mortality. The main contribution for Cox_COVID_19 model is to predict the survival probability overtime. Figure 4 shows the survival probability for fifteen patients' overtime that are chosen as examples for predicting survival probability. Survival probability for randomly ten patients using Cox_COVID_19 is shown in Fig. 5a . It is shown that patient 6 and patient 8 have a survival probability up to one over the whole time, patient 7 has a survival probability that is close to one, patient 9 has a probability of surviving, that is decreasing to be close to 0.7, and patient 5 has a probability that is decreasing overtime till reaching less than 0.1. Survival probability for another randomly five patients using Cox_COVID_19 is shown in Fig. 5b. For Fig. 6a , survival probability for the first randomly five patients using Deep_Cox_COVID_19 is presented. Figure 6b shows the survival probability for the second randomly five patients using Deep_Cox_COVID_19. The implementation of the designated dataset in its original form in this research led to the appearance of some trammels such as high collinearity and convergence, so autoencoder deep neural network is implemented to solve such problems by reconstructing features presenting a new system called Deep_Cox_COVID_19. After comparing results, it is shown that Deep_Cox_COVID_19 system outperforms Cox_COVID_19 system in terms of concordance, accuracy and precision. Table 3 shows the internal construction of the autoencoder, we built in this paper. It consists of one layer with 31 nodes for the input layer, of one layer with 31 nodes as an encoder, one layer with 30 nodes as bottleneck and finally the decoder to reconstruct data, with one layer and 31 nodes. Loss functions are used to determine the error between the prediction of our model and the actual target variable, the used reconstruction loss function in this paper is binary_crossentropy. An activation function is the function used to predict the output based on the input, and the used activation function in this paper is Rectified Linear Unit (ReLU). ReLU can be considered as a linear function in which the input will be directly outputted if it is positive; otherwise, it will output zero. For several forms of neural networks, it has become the default activation feature because it is easier to train a model that uses it and often achieves better performance. Cox regression [17] is used to predict the probability of survival for each patient. For evaluating the accuracy of the model, a threshold is used. So that, the patients with a probability for survival higher than the threshold, they are the closest to survive, and patients with a probability for survival lower than the threshold, they are the furthest to survive. Table 4 shows a comparison between Cox_COVID_19 system and Deep_Cox_COVID_19 system. It is shown in Table 4 that, for Cox_COVID_19 system, the best accuracy of the test subset is up to 95.71%, with a threshold up to 0.1 and accuracy for the train subset up to 93.5%. The best accuracy with the train subset is 96.21%, with a threshold up to 0.3 and accuracy with the test subset up to 95.71%. As shown in the results below, Deep_Cox_COVID_19 system always gives better accuracy for both test and train subsets with all subsets. As shown in Table 5, Deep_Cox_COVID_19 system always gives better precision for both test and train subsets with all subsets. A comparison between our proposed system (Cox_ COVID_19, and Deep_Cox_COVID_19) results and different algorithms (IPCRidge, CoxPH, Coxnet, Stagewise GB, Componentwise GB, Fast SVM, and Fast Kernel SVM) [24] applied on the clinical dataset for COVID-19 is presented in Table 6 . The results show the proposed system achieves higher survival function accuracy as shown in Fig. 7 . A comparison between our Cox_COVID_19 results and another algorithm applied on the same clinical dataset for COVID-19 is presented in Table 7 . The results show that age, muscle pain, and pneumonia all have p value that is less than 0.05 and throat pain has a p value that is higher than 0.05 but still very small. So that, age, pneumonia, muscle pain and throat pain are the most important factors affecting the mortality. For evaluating the accuracy of the model, a threshold is used. So that the patients with a probability for survival higher than the threshold, they are the closest to survive, and patients with a probability for survival lower than the threshold, they are the furthest to survive. The best result for Cox_COVID_19 is when the threshold up to 0.1 with a training accuracy up to 93.5% and a testing accuracy up to 95.71%. Deep_ Cox_COVID_19 shows better results with all threshold values, but the best result was when the threshold was up to 0.3 with a training accuracy up to 96.21% and a testing accuracy up to 95.71%. An autoencoder deep neural network is implemented to solve problems like high collinearity and convergence presenting Deep_Cox_COVID_19 system. Deep_Cox_COVID_19 system outperforms Cox_ COVID_19 in terms of concordance, accuracy and precision. Coronaviruses: a paradigm of new emerging zoonotic diseases Director-General's remarks at the media briefing on Application of machine learning for survival analysis-a review A method for checking regression models in survival analysis based on the risk score Nonlinear Survival Regression Using Artificial Neural Network Projecting the transmission dynamics of SARS-CoV-2 through the post-pandemic period Variable selection for accelerated lifetime models with synthesized estimation techniques Variable selection for survival data with a class of adaptive elastic net techniques Prediction of criticality in patients with severe Covid-19 infection using three clinical features: a machine learning-based prognostic model with clinical data in Wuhan Potential biochemical markers to identify severe cases among COVID-19 patients Modeling survival time to recovery from COVID-19: a case study on Singapore Machine-learning approaches in COVID-19 survival analysis and discharge-time likelihood prediction using clinical data Cancer survival analysis using semisupervised learning method based on Cox and AFT models with L12⁄regularization DeepHit: a deep learning approach to survival analysis with competing risks Deep survival analysis Regression models and life tables Machine learning for survival analysis: a survey Survival analysis part II: multivariate data analysis -an introduction to concepts and methods On Ranking in Survival Analysis: Bounds on the Concordance Index Estimating the concordance probability in a survival analysis with a discrete number of risk groups Intelligent decision support system for breast cancer diagnosis by gene expression profiles Machine-learning approaches in COVID-19 survival analysis and discharge-time likelihood prediction using clinical data Development and validation of a risk factorbased system to predict short-term survival in adult hospitalized patients with COVID-19 a multicenter retrospective cohort study Acknowledgements We thank Dr. Noura Atef for useful discussions and providing us with the necessary medical information.