key: cord-204125-fvd6d44c authors: Chowdhury, Muhammad E. H.; Rahman, Tawsifur; Khandakar, Amith; Al-Madeed, Somaya; Zughaier, Susu M.; Doi, Suhail A. R.; Hassen, Hanadi; Islam, Mohammad T. title: An early warning tool for predicting mortality risk of COVID-19 patients using machine learning date: 2020-07-29 journal: nan DOI: nan sha: doc_id: 204125 cord_uid: fvd6d44c COVID-19 pandemic has created an extreme pressure on the global healthcare services. Fast, reliable and early clinical assessment of the severity of the disease can help in allocating and prioritizing resources to reduce mortality. In order to study the important blood biomarkers for predicting disease mortality, a retrospective study was conducted on 375 COVID-19 positive patients admitted to Tongji Hospital (China) from January 10 to February 18, 2020. Demographic and clinical characteristics, and patient outcomes were investigated using machine learning tools to identify key biomarkers to predict the mortality of individual patient. A nomogram was developed for predicting the mortality risk among COVID-19 patients. Lactate dehydrogenase, neutrophils (%), lymphocyte (%), high sensitive C-reactive protein, and age - acquired at hospital admission were identified as key predictors of death by multi-tree XGBoost model. The area under curve (AUC) of the nomogram for the derivation and validation cohort were 0.961 and 0.991, respectively. An integrated score (LNLCA) was calculated with the corresponding death probability. COVID-19 patients were divided into three subgroups: low-, moderate- and high-risk groups using LNLCA cut-off values of 10.4 and 12.65 with the death probability less than 5%, 5% to 50%, and above 50%, respectively. The prognostic model, nomogram and LNLCA score can help in early detection of high mortality risk of COVID-19 patients, which will help doctors to improve the management of patient stratification. The novel coronavirus disease (COVID-19) spread rapidly throughout the world from Wuhan (Hubei, China) since December 2019 [1] [2] [3] [4] [5] . Since the outbreak, the number of reported cases has surpassed 12 million with more than 550 thousand deaths worldwide as of 12 July 2020 [6] . The COVID-19 disease is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which is a member of the coronavirus family. On 11 March 2020, COVID-19 was declared as a pandemic by the World Health Organization (WHO) [7] . Due to the pandemic, hospital capacity is being exceeded in many places and face issues in terms of limited medical staff, personal protective equipment, life-support equipment and others [8, 9] . Symptoms of COVID-19 are nonspecific, and infected individuals may develop fever (83-99%), cough (59-82%), loss of appetite (40-84%), fatigue (44-70%), shortness of breath (31-40%), coughing up sputum (28-33%) or muscle aches (11-35%) [10]. The disease can further progress into a severe pneumonia, acute respiratory distress syndrome (ARDS), myocardial injury, sepsis, septic shock, and even death [11] . Though most COVID-19 patients have a mild illness, there are some patients who show rapid deterioration (particularly within 7-14 days) from the onset of symptoms into severe COVID-19 with or without ARDS [12, 13] . Current epidemiological data suggest that the mortality rate of patients with severe COVID-19 is higher than that of patients with non-severe COVID-19 [14, 15] . It has been reported that 26.1-32.0% of patient infected with are prone to progressing critical illness [16] . Recent studies have confirmed a high fatality rate of 61.5% for patients in critical cases, which increases with age and other medical comorbidities [16] . A large cohort study from 2449 patients has reported that during this pandemic healthcare system can be overwhelmed by hospitalization (20-31%) and intensive care unit (ICU) admission rates (4.9-11.5%) [17] . This can be avoided by prioritizing hospital treatment for patients at high risk of deterioration and death, and treating low-risk patients in ambulatory environments, or by home-based self-quarantine. An effective tool is required to predict the disease trajectory to allocate resources efficiently and also improve the patient's condition. Understanding the great potential of this approach, it is important to identify key patient variables that can help to predict the course of the disease at diagnosis. In other words, early identification of patients at high risk for progression to severe COVID-19 will help in efficient utilization of healthcare resources via patient prioritization to reduce the mortality rate. Several researches indicate that biomarkers can help to classify COVID-19 patients with elevated risk of serious disease and mortality by providing crucial information regarding the patients' health status. Al Youha et al. [18] proposed a prognostic model called the Kuwait Progression Indicator (KPI) Score for predicting progression of severity in COVID-19. The KPI model was based on quantifiable laboratory readings unlike other self-reported symptoms and other subjective parameters based scoring systems. The KPI score categorizes patients to low risk if the score goes below -7 and high risk if the score goes above 16, however, the progression risk in the intermediate group (for patients scores within -6 to 15) deemed by the authors as uncertain. This intermediate category however exists with many prognostic systems. Weng et al. [19] reported an early prediction score called ANDC to predict mortality risk for COVID-19 patients using 301 adult patients' data. LASSO regression has identified age, neutrophil-to-lymphocyte ratio (NLR), Ddimer, and C-reactive protein recorded during admission as mortality predictors for COVID-19 patients [19] . They have developed a nomogram demonstrating good performance and also derived an integrated score, ANDC, with its corresponding death probability. They have also developed cutoff ANDC values to classify COVID-19 patients into three groups: Low, Moderate and High-risk groups. The death probability were 5%, 5% to 50% and more than 50% in the low-, moderate-and high-risk group, respectively. Using a cohort of 444 patients, Xie et al. [20] proposed a prognostic model using lactate dehydrogenase, lymphocyte count, age, and SpO2 as key-predictors of COVID-19 related death. The model showed good discrimination for internal and external validation with C-statistics of 0.89 and 0.98 respectively. (c=0ꞏ98) validation. Even though the model shows promising performance for internal calibration, however, external validation showed over and under-prediction for low-risk and high-risk patients respectively. Yan et al. [21] reported a machine learning approach to select three biomarkers (lactic dehydrogenase (LDH), lymphocyte and high-sensitivity C-reactive protein (hs-CRP)) and using them to predict individual patients mortality, 10 days ahead with more than 90 percent accuracy. In particular, high levels of LDH alone have been found to play a crucial role in identifying the vast majority of cases, which require immediate medical attention. However, there is no scoring system reported in this work, which can help the clinicians to identify the patients under risk quantitatively. Another clinical study on 82 COVID-19 patients showed that respiratory, cardiac, hemorrhage, hepatic, and renal injury had caused the death of 100%, 89%, 80.5%, 78.0%, and 31.7% patients respectively. Most of the patients had increased CRP (100%) and D-dimer (97.1%) [22] . The value of D-dimer as a prognostic factor was also shown to significantly increase odds of death if the amount is greater than 1 μg mL −1 upon admission [23, 24] . Although several predictive prognostic models are proposed for the early detection of individuals at high risk of COVID-19 mortality, a major gap remains in the design of state-of-the-art interpretable machine learning based algorithms and high performance quantitative scoring system to classify the most selective predictive biomarkers of patient death. Identifying and prioritizing those at severe risks is important for both resource planning and treatment therapy. Moreover, the high risk patients should be possible to continuously monitored using a reliable scoring tool during their hospital stay-time. Likewise, reducing patient admission with very low risk of complications that can be handled safely by self-quarantine will help to minimize the pressure on healthcare facilities. Therefore, using state-of-the-art machine learning algorithm, an early prediction scoring system was developed and also implemented to classify the most discriminatory biomarkers of patient mortality. The problem was initially introduced as a classification problem for determining the most appropriate biomarkers at the end of the test period with the aid of corresponding survival or death outcomes. The top ranked features with the best classification performance were used to develop a multivariable logistic regression-based nomogram and validated for the prognosis of death and survival. The findings obtained through this study provides a simple, easy-to-use and reliable algorithm for the prognosis of high-risk individuals and possess potential for clinical application. Blood samples collected between 10 January and 18 February, 2020 from 375 patients in Wuhan, China were retrospectively analyzed to identify reliable and relevant markers of mortality risk. Medical records were collected using standard case report forms, which included information on epidemiological, demographic, clinical, laboratory and mortality outcomes. Yan et al. [21] has published the dataset along with the article and the original study was approved by the Tongji Hospital Ethics Committee. Patients' exclusion criteria for the study were: Age (<18 years), pregnant, breastfeeding and missing data (>20%). Out of 375 patients, 187 (49.9%) had fever while cough, fatigue, dyspnea, chest distress and muscular soreness were present in 52 (13.9%), 14 Even though multiple blood sample data of the patients were available, only the data from the first sample were used as inputs for model training and validation to identify the key predictors of the disease severity. The model also helps in distinguishing patients that require immediate medical assistance. Research using clinically captured data often suffers from missing data challenge leading to either bias introduction or negative impact on analytical outcomes. Simple approach to handle this challenge is deleting the respective rows of data from further analysis. This simple approach is not very useful as it leads to loss of valuable information that would have been beneficial in the analysis and also can lead to biased estimates [25] . (1) where ℎ. A diagnosis nomogram was constructed by Alexander Zlotnik's Nomolog [29] , based on multivariate logistic was carried out to identify the threshold values in which nomograms were clinically useful, using Stata software. The parameters were drawn as a numerated horizontal axis scale and the values for the patient are put on the numerated scale. A vertical line was drawn down from the different parameter numerated arranged scales downward to a score axis. All five scores on the score axis were added to make a total score and this was linked to a death probability. It can be noted that according to the nomogram, higher score corresponds to a higher death probability. The model was designed using the initial blood sample of the patients. However, it can be applied to the biomarkers collected in later during the hospital stay period of the patients to predict death probability longitudinally using the LNLCA score. Of the 375 patients, 174 (46.4%) died, while 201 (53.6%) patients recovered from COVID-19 and were discharged from hospital. Figure 1 To determine the independent variables associated with death, univariate logistic regression analysis was performed with Top-1, Top-2, and up to Top-10 features identified using two different techniques. It is clear from the Figure 3 that Top ranked 5 features produced highest AUC of 0.97 for data imputed using MICE algorithm while Top-ranked 3 features produced highest AUC of 0.95 for the data imputed using -1 ( Figure 3 ). Table 2 shows the overall accuracies and weighted average performance for other matrices for different models using Top 1 to 10 features for 5-fold cross-validation using the logistic regression classifier along with the confusion matrices for each case. A multivariate logistic regression based nomogram for predicting early COVID-19 mortality was built using topranked five biomarkers that were found important both statistically and using ML based classifier (as shown in Table 1 , 2 and Figure 3 ). The relationship between linear prediction of death and these biomarkers was evaluated using multivariable logistic regression which was reported in Table 3 . Regression coefficient, z-value, standard error and its statistical significance along with 95% confidence interval were shown in Table 3 . Z-value is the ratio of regression coefficient and its standard error. Typically z-value indicates the strong and weak contributors in logistic regression. The The corresponding probability of death for a given LNLCA score was determined from the model and is listed in Figure 7 shows an example nomogram based scoring system for a COVID-19 patient with the variable values at admission. Individual score for each predictors were calculated and added to produce total score and death probability was calculated to 80%. This can be done as early as 9 days before the death of the patient. Furthermore, we have categorized the patients from training and testing subgroups into three subgroups (low, moderate and high-risk) by associating actual outcome with the predicted outcome using the LNLCA score. For training set ( Table 4 and prioritize the moderate and high risk group patients. There were 52 patients in the test set who had an outcome of death after different duration of hospital stay. Some patients were hospitalized in very late stages while some other patients were admitted in the early stages. The minimum, maximum, 262 (100.0%) P-value among three group is less than 0.001 P-value of Low-risk group vs Moderate-risk group is less than 0.001. P-value of Low-risk group vs High-risk group is less than 0.001. P-value of Moderate-risk group vs High-risk group is less than 0.001. 113 (100.0%) P-value among three group is less than 0.001 P-value of Low-risk group vs Moderate-risk group is 0.0037. P-value of Low-risk group vs High-risk group is less than 0.001. P-value of Moderate-risk group vs High-risk group is less than 0.001. Age was identified as a key predictor of mortality in previous studies on Coronavirus family such as SARS [30] , Middle East respiratory syndrome (MERS) [31] and COVID-19 [32] . This study has also concluded similar findings and this is because with the older age the immunosenescence and/or multiple medical conditions tend to make patients more prone to critical COVID-19 illness [19] . Yan et al. [16] showed that in patients with severe pulmonary interstitial disease, there is a significant increase of LDH and can be associated with indications for lung injury or idiopathic pulmonary fibrosis [33] . Consistent results from the previous research were also found in this study, in which critically ill patients with COVID-19 had elevated levels of LDH suggesting an increase in activity and severity of lung injury. LDH is an intracellular enzyme that leaks from damaged cells due to infection and viral replication leading to elevated levels in circulation. Recently, Liu et al. [34] proposed that increased Neutrophil-to-Lymphocyte Ratio (NLR) can aid in the early prediction of the severity of COVID-19 illness. Both neutrophils and lymphocytes are critical components of the immune system and play very important role in host defense and clearing infections. Lymphopenia, medical condition due to lower number of lymphocytes in the blood, is a typical feature in COVID-19 patients, and may be a key factor in disease severity and mortality [35] . In this study, we have used neutrophils and lymphocytes percentage and similar to the previous studies have found that lower percentage of these two quantities were associated with severe COVID-19 patients. According to previous research, patients with communityacquired pneumonia have significant immune system activation and/or immune dysfunction leading to changes in these quantities [35] . In addition, on the event of immunosuppression and apoptosis of lymphocytes caused by specific anti-inflammatory cytokines, bone marrow circulates neutrophils [36] , resulting in an increased NLR. However, in contrast to other models, it was observed in this study, both the parameters were small for high-risk patients. Lu et al. [37] stated that CRP tested upon admission may assist in predicting confirmed or suspected short-term mortality associated with COVID-19. CRP is an acute phase protein formed by hepatocytes caused by leukocyte-derived cytokines induced by infection, inflammation or tissue damage [38] [39] [40] . Similar findings were found in this study where increased CRP rates were measured at admission for the high mortality risk COVID-19 patients. This indicated that these patients developed a serious lung inflammation or possibly a secondary bacterial infection, and clinical antibiotic treatment might be appropriate for those patients [21] . Non-survivors in our study had low lymphocyte and neutrophil percentages, higher age, hsCRP and LDH than those of survivors. In addition to the dysregulation of the coagulation system and/ immune system, it can be seen that COVID-19 severity was significantly linked to the inflammatory response to the infection. This could lead to other worse medical consequences like ARDS, septic shock and coagulopathy etc. Therefore, this kind of prognostic model will aid in the development of a rational and personalized therapeutic plan for the patients with critical illness. Weng et al. [19] recently suggested that age, NLR, Ddimer and CRP were individual key predictors correlated with death probability. These key-predictors were used to create a nomogram for death prediction due to COVID-19. In our research, the five key predictors recorded at admission were chosen by the XGBoost feature selection to create a nomogram based prognostic model that exhibits excellent calibration and discrimination in predicting death probability of COVID-19 patients. It was also validated by an unseen validation cohort. Moreover, it was verified with multiple blood sample data collected from the patients during their hospital stay and the model holds valid for those cases as well. The AUC values for development and validation cohort showed a strong distinction of 0.961 and 0.991 respectively using the proposed nomogram, which is, to the best of our knowledge, outperforms any other nomogram based models for COVID-19 mortality prediction. In addition, this nomogram-derived LNLCA score offered a simple, easy-to-understand and interpretable early detection tool for stratifying the high-risk COVID-19 patients at admission and thereby assist their clinical management. COVID-19 patients were categorized into three risk groups with varying risk of death using LNLCA score measured and calculated at admission. Low-risk group cases could be isolated and treated in an isolation center while the moderaterisk patients could be treated isolation ward in a specialized hospital. On the other hand, patients in high-risk group could be under close monitoring and should be moved to critical medical services or ICU for urgent treatment if required. This study has scope for further improvement, which will be carried out in the future work. Firstly, the study motivates the possibility of research on COVID-19 clinical data helping in early mortality prediction but the proposed machine learning method is purely data-driven and may vary if starting from different datasets. The model can be further improved with the help of a larger dataset. Secondly, the modelling principle adopted here is to have a minimal number of features for accurate predictions to avoid overfitting, which can be revised with several other models to identify any other sets of best features on a multi-center and multi-country data to produce a generalized model. In summary, based on multiple risk factors (Lactate Dehydrogenase, Neutrophils (%), Lymphocytes (%), High Sensitive C-reactive protein, and age), our developed nomogram can predict the prognosis of patients with COVID-19 with good discrimination and calibration. The model can predict the patient's outcome far ahead of the day of primary clinical outcome with very high accuracy. Therefore, the application of LNLCA would help clinicians make an efficient and optimized patient stratification management plan without overloading the healthcare resources and also reduce the death with improved and planned response. The authors also plan to further improve the performance of the model with the help of larger dataset with multi-center and multi-country data. The authors declare that they have no conflict of interest. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. The lancet Clinical characteristics of coronavirus disease 2019 in China Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention Coronavirus disease 2019 (COVID-19): situation report-107 Coronavirus disease 2019 (COVID-19) Situation Report -68 Critical care utilization for the COVID-19 outbreak in Lombardy, Italy: early experience and forecast during an emergency response Projecting hospital utilization during the COVID-19 outbreaks in the United States Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship Rapid Progression to Acute Respiratory Distress Syndrome: Review of Current Understanding of Critical Illness from Coronavirus Disease 2019 (COVID-19) Infection Prevalence of comorbidities in the novel Wuhan coronavirus (COVID-19) infection: a systematic review and meta-analysis. International journal of infectious diseases Comorbidity and its impact on 1590 patients with Covid-19 in China: A Nationwide Analysis A machine learning-based model for survival prediction in patients with severe COVID-19 infection Severe outcomes among patients with coronavirus disease 2019 (COVID-19)-United States Validation of the Kuwait Progression Indicator Score for predicting progression of severity in COVID19. medRxiv ANDC: an early warning score to predict mortality risk for patients with Coronavirus Disease Development and external validation of a prognostic multivariable model on admission for hospitalized patients with COVID-19 An interpretable mortality prediction model for COVID-19 patients Clinical characteristics of 82 death cases with COVID-19 Clinical decision support tool and rapid point-of-care platform for determining disease severity in patients with COVID-19 D-dimer levels on admission to predict in-hospital mortality in patients with Covid-19 MICE vs PPCA: Missing data imputation in healthcare Xgboost: extreme gradient boosting. R package version 04-2 Ridge estimators in logistic regression A general-purpose nomogram generator for predictive logistic regression models Prognostication in severe acute respiratory syndrome: A retrospective time-course analysis of 1312 laboratoryconfirmed patients in Hong Kong Epidemiological, demographic, and clinical characteristics of 47 cases of Middle East respiratory syndrome coronavirus disease from Saudi Arabia: a descriptive study. The Lancet infectious diseases Risk factors of fatal outcome in hospitalized subjects with coronavirus disease 2019 from a nationwide analysis in China Staging of acute exacerbation in patients with idiopathic pulmonary fibrosis Neutrophil-to-lymphocyte ratio predicts severe illness patients with 2019 novel coronavirus in the early stage Lymphopenia in severe coronavirus disease-2019 (COVID-19): systematic review and meta-analysis An increased alveolar CD4+ CD25+ Foxp3+ T-regulatory cell ratio in acute respiratory distress syndrome is associated with increased 30-day mortality ACP risk grade: a simple mortality index for patients with confirmed or suspected severe acute respiratory syndrome coronavirus 2 disease (COVID-19) during the early stage of outbreak in Wuhan Predictive factors for pneumonia development and progression to respiratory failure in MERS-CoV infected patients Dynamic changes and diagnostic and prognostic significance of serum PCT, hs-CRP and s-100 protein in central nervous system infection. Experimental and therapeutic medicine High sensitive C-reactive protein: a new marker for urinary tract infection, VUR and renal scar Ethical approval: This article uses the clinical data which was made publicly available by Yan et al. [21] . Therefore, the authors of this study were not involved with human participants or animals. However, the original retrospective study carried out by Yan et al. [21] was approved by the Tongji Hospital Ethics Committee.