key: cord-0803201-s9j21zsy
authors: Yan, Li; Zhang, Hai-Tao; Goncalves, Jorge; Xiao, Yang; Wang, Maolin; Guo, Yuqi; Sun, Chuan; Tang, Xiuchuan; Jin, Liang; Zhang, Mingyang; Huang, Xiang; Xiao, Ying; Cao, Haosen; Chen, Yanyan; Ren, Tongxin; Wang, Fang; Xiao, Yaru; Huang, Sufang; Tan, Xi; Huang, Niannian; Jiao, Bo; Zhang, Yong; Luo, Ailin; Mombaerts, Laurent; Jin, Junyang; Cao, Zhiguo; Li, Shusheng; Xu, Hui; Yuan, Ye
title: A machine learning-based model for survival prediction in patients with severe COVID-19 infection
date: 2020-03-01
journal: nan
DOI: 10.1101/2020.02.27.20028027
sha: b38f3c83aaa3fa7c8f5c87af266793afcbd11c86
doc_id: 803201
cord_uid: s9j21zsy

The sudden increase of COVID-19 cases is putting a high pressure on healthcare services worldwide. At the current stage, fast, accurate and early clinical assessment of the disease severity is vital. To support decision making and logistical planning in healthcare systems, this study leverages a database of blood samples from 404 infected patients in the region of Wuhan, China to identify crucial predictive biomarkers of disease severity. For this purpose, machine learning tools selected three biomarkers that predict the survival of individual patients with more than 90% accuracy: lactic dehydrogenase (LDH), lymphocyte and high-sensitivity C-reactive protein (hs-CRP). In particular, relatively high levels of LDH alone seem to play a crucial role in distinguishing the vast majority of cases that require immediate medical attention. This finding is consistent with current medical knowledge that high LDH levels are associated with tissue breakdown occurring in various diseases, including pulmonary disorders such as pneumonia. Overall, this paper suggests a simple and operable formula to quickly predict patients at the highest risk, allowing them to be prioritised and potentially reducing the mortality rate.

The outbreaks of COVID-19 epidemic has caused worldwide health concerns since December, 2019. It has been shown in literature [ patients is astonishingly 61.5%. However, it is arduous to identify these patients manually from the infectious crowds. Hence, it is becoming an urgent yet challenging mission to identify the critically ill cases form the infectious crowds using clinical data with the assistance of machine learning approaches. Such a developed prognostic model could offer early treatment to critical patients, thus potentially reducing mortality.

For this retrospective, single-center study, we collected the electronic records of 2,779 validated or suspected COVID-19 patients from January 10th to February 18th, 2020 at Tongji Hospital in Wuhan, China. We distilled epidemiological, demographic, clinical, laboratory, drugs, nursing record, and outcome data from electronic medical record. The clinical outcomes were followed up to February 18th.The study was approved by the Tongji Hospital Ethics Committee.

As shown in Figure 1 , of the 2,779 individuals retained in our hospital, 2,259 cases were excluded as they were still in treatment before February 19th, 2020. Per the other 520 cases, 375 ones including 201 survivors have complete data materials. Pregnant or breast-feeding women, younger than 18 years old were excluded.

After February 19th, 2020, there were 26 new cleared severe patients, which were thus picked for the test together with other 3 cleared severe patients from Ying Cheng People's Hospital for testing. Note that all types of patients were included as samples for the study, whereas only severe patients were selected for testing. 

We apply the following diagnostic criteria [4] : 1) Epidemiological history: Traveled or lived in Wuhan within 14 days before onset; Had contact with patients with fever and respiratory symptoms from Wuhan within 14 days before onset; Had contact with COVID-19 patients (positive for COVID-19 nucleic acid) within 14 days before onset; Or part of a familial cluster of onsets;

2) Clinical manifestations: Fever and/or respiratory symptoms; Normal or decreased total white blood cell count or decreased lymphocyte count during early stage of onset; Typical imaging features.

Subjects that meet any one epidemiological history or meet two clinical manifestations without epidemiological history are defined as suspected cases. Suspected cases with one of the following etiological evidence are defined as validated cases: 1) SARS-CoV-2 nucleic acid is positive in respiratory or blood samples detected by RT-PCR; 2) virus sequence . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.27.20028027 doi: medRxiv preprint detected in respiratory or blood samples shares high homology with the known sequence of SARS-CoV-2.

A case with one of the three following conditions is defined as critical case: 1) shock, 2) need mechanical ventilation and 3) admitted into ICU because of MODS. Severe case is defined as those who exhibit RR≥30bpm or SPO2 ≤93% on rest.

The valid data after verification using Excel 2016 input was recovered, and double checked through SPSS 26.0 analysis data. The continuity variables of normal distribution were described by mean ± standard deviation, and the continuity variables of non-normal distribution were described by median and quartile.

First, the general data was tested for normality. The Kolmogorov-Smirnov test (K-S test for short) was used to examine whether the single sample is from a particular distribution, and then the single sample K-S test was used to test the normality of the general data. The test level α=0.05, and P <0.05 indicate that the sample does not fit a normal distribution. Because age, total protein, albumin, and calcium content satisfy normal distribution, after testing, mean ± standard deviation was used to describe their concentration trend. As other continuous variables are non-normal distribution, median was used to describe their concentration trend.

The performance of the model was first evaluated by assessing its predicted classification accuracy, and equaling the ratio of the test samples predicted correctly. The precision, sensitivity/recall and F1 score of each class are defined as below, in which n ∈ N represents (

Modeling and analysis of machine learning algorithm were performed using Python.

In this study, a supervised XGBoost classifier [5] was chosen as the predictor, due to its superb pattern characterization and feature selection ability. As shown in Figure 2 , its step-by-step procedure is detailed as below.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.27.20028027 doi: medRxiv preprint Data Pre-processing: Imported patients' data, used all clinical measurements of their last available date as features and set 'survival' and 'death' as labels for two classes. Used "-1" padding method to complement the incomplete clinical measures. Model Training (Multi-tree XGBoost): Randomly split the selected two-class data into a training set and a validation set, according to the ratio of 7:3. Multi-tree XGBoost was trained with the parameters setting as the max depth with 4, the learning rate was equal 0.2, the tress number of estimators was set to 150, the value of the regularization parameter α was set to 1 and the 'subsample' and 'colsample_bytree' both were set to 0.9 to prevent overfitting when there were many features but the sample size was not large [5] . Figure 2 shows that when the number of top features increased to 4, there was no performance improvement. Therefore, the number of key features was set to 3, Multi-tree XGBoost was trained with the parameters setting as the max depth with 4, the learning rate was equal 0.2 and the value of the regularization parameter α was set to 1. We had deleted some parameters and kept other parameters unchanged because we had selected only a few features here, not needing to add some of the previous parameters here to prevent overfitting.

Explainable Model (Single-tree XGBoost): The XGBoost was applied for final prediction using only the three key features as well as setting the number of tree estimator to 1 (so that the model is explainable). We further removed those patients with incomplete measurements for any of these three features and obtained 351 patients out of 375. XGBoost was re-trained with the parameter setting as the tress number of estimators was set to 1, the values of the two regularization parameters α and β were both set to 0, and the subsample and max features both were set to 1.

Model Prediction: The trained model was used to predict sample class on the testing set. The predicted and ground-truth label of test set were used to calculate the standard metrics for prediction performance evaluation.

Data collection and analysis of cases and close contacts were approved by the Ethical Committee of Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology.

The mean age of the 375 patients was 58.83±16.46 years old with 58.7% of males. Fever was the most common initial symptom (49.9%), followed by cough (13.9%), fatigue (3.7%), and dyspnea (2.1%). The epidemiological history included Wuhan residents (37.9%), familial cluster (6.4%), and health workers (only 1.9%). Of the 375 patients, 46.1% were critical patients. 

Using procedure specified in the Supplementary information, we first discovered that three key features (i.e., LDH, lymphocyte(%), hsCRP), are needed to distinguish critical patients from the two classes (visualization of samples is shown in Supplementary Figure 6 ). More importantly, the retrained Single-tree XGBoost algorithm outputs a clinical route (a decision tree in machine learning), as shown in Figure 3 . It can simply be used to classify all severe patients:

To validate the results, we blindly tested the decision rule in Figure 3 with 29 patients, whose outcomes were confirmed after February 19th. The confusion matrix of the testing data is shown in Supplementary Figure 2 , showing that still 100% death prediction accuracy and 90% survival prediction accuracy were achieved, respectively. To validate the model's performance on the testing data, the precision, recall, F1-score and the corresponding support are demonstrated in Table 2 . The score for survival and death prediction, accuracy, macro and weighted averages over all the samples are consistently larger than 0.90. It is worth noting that Multi-tree XGBoost and Single-tree XGBoost return the same predictions. The labels of . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.27.20028027 doi: medRxiv preprint some patients are predicted wrong. Yet, the prognosis of these patients is not optimistic. One of the patients had been admitted to the ICU because of an endangered condition and was recovered after emergency rescue. The other patient was in the cerebrovascular sequelae period with an extremely weak condition. Although this patient is currently alive, the prognosis is extremely poor. 

Coronavirus is prevalent in China and all over the world, with high morbidity and high mortality in critically ill patients. According to the recent reports [2 6], old patients are more prone to be infected by COVID-19, especially for those with underlying diseases.

The severity of patients is applying great pressure on the shortage of intensive care resources.

Unfortunately, so far, specific clinical features of COVID-19 pneumonia in different critical stages remain still unclear. Under this circumstance, novel approaches based on feature data to help clinicians to identify high-risk patients as early as possible, to improve the prognosis of patients and to reduce the mortality of critically ill patients, are highly demanded and are of clinical significance. In this study, we used XGBoost . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The significance of our work is three-fold. First, instead of merely providing the high-risk factors as the earlier published articles, the present study has provided a general operable formula to precisely and quickly quantify the risk of death, representing a significant progress in clinical practice. For example, for patients with SPO2 below 93%, the respiratory support therapy include intranasal catheterization of oxygen, oxygen supply through mask, high flow oxygen supply through nasal catheter, non-invasive ventilation support, invasive ventilation support, and ECMO. However, the routine sequential usage of the oxygen therapy usually leads to unsatisfactory therapeutic effects in severe patients. Significantly, our predictive model is likely to identify high-risk patients before irreversible lesions occur. By using appropriate respiratory support therapy as soon as possible, we may be able to completely improve the prognosis. Second, these three revealed key features can be conveniently collected by any hospital, thus helping bypass large streams of patients crowded in top-tier hospitals. As a result, our model can substantially alleviate the pressure caused by the shortage of medical resources and facilitates the forming of hierarchical medical care system of COVID-19. Third, the millisecond machine learning speed of the present model could improve the efficiency of frontline doctors in term of classifying the severity and predicting the fatal development trend, thereby greatly releasing the heavy work load of doctors.

The most common fatal complication of COVID-19 is acute respiratory distress syndrome (ARDS). Although the pathological features of COVID-19 are very similar to those by acute respiratory distress syndrome(SARS) and Middle East respiratory distress syndrome (MERS) [7] , it is known from the latest systematic anatomy that pulmonary fibrosis and consolidation by COVID-19 patients are not as serious as those caused by SARS, but the exudative reaction is more severe than that of SARS [8] .

Histological examination of COVID-19 showed bilateral diffuse alveolar damage with cellular fibro-myxoid Exudates, evident desquamation of pneumocytes and hyaline membrane formation [7] and then interstitial fibrosis.

The increase of LDH reflects tissue/cell destruction and is regarded as a common sign of tissue/cell damage. Serum LDH has been identified as an important biomarker for the activity . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.27.20028027 doi: medRxiv preprint and severity of Idiopathic Pulmonary Fibrosis (IPF) [9] . In patients with severe pulmonary interstitial disease, the increase of LDH is significant and is one of the most important prognostic markers of lung injury [9] . For the critically ill patients with COVID-19, the rise of LDH level indicates an increase of the activity and extent of lung injury.

Our analysis showed that higher serum hs-CRP could be used to predict the risk of death in severe COVID-19 patients. The increase of hs-CRP, an important marker for poor prognosis in ARDS [10, 11] , reflects the persistent state of inflammation [12] .

The result of this persistent inflammatory response is large gray-white lesions in the lungs of patients with COVID-19 (what was seen in the autopsy) [8] . In the tissue section, a large amount of sticky secretion was also seen overflowing from the alveoli [8] .

Our results also suggested that lymphocytes play vital role in forecasting of progression from mild to critically ill and may serve as a potential therapeutic target. The hypothesis is supported by the results of clinical studies [2 6 ]. Moreover, lymphopenia is a common feature in the patients with COVID-19 and might be a critical factor associated with disease severity and mortality[Error! Reference source not found.]. The injured alveolar epithelial cells could induce the infiltration of lymphocytes, leading to a persistent lymphopenia as SARS-CoV and MERS-CoV did, given that they share the similar alveolar penetrating and antigen presenting cells (APC) impairing pathway [14, 15] . A biopsy study has provided strong evidence that the counts of peripheral CD4 and CD8 T cells were substantially reduced, while their status was hyperactivated [Error! Reference source not found.]. Also, Jing and colleagues reported the lymphopenia is mainly related to the decrease of CD 4+ and CD 8+ T cells [16] . Thus, it is likely that lymphocytes play distinct roles in COVID-19, which deserves further investigation.

Nevertheless, this study has several notable limitations. First of all, since the proposed machine learning method is purely data driven, its model may vary given a different set of training and validation dataset. Given the limit number of samples in this study, we strike a balance between model complexity and performance. Yet the whole procedure should follow when more data is available. Secondly, this is a single-centered, retrospective study, which provides a preliminary assessment of the clinical course and outcome of severe patients.

Although this database covers more than 3,000 patients, most clinical outcomes have not yet been released. As we have a pool of more than 300 clinical measurements, here our modeling principle is a trade-off between the minimal number of features and the capacity of good prediction. Obviously, if a larger number of features are selected, the model may perform better. In this regard, we look forward to subsequent large sample and multicenter studies.

In summary, in this study, we have identified three indicators (LDH, hs-CRP, and lymphocytes) and even found the early warning thresholds (LDH: 365U/l, hs-CRP: 41.2mg/L, and lymphocytes%: 14.7%) for COVID-19 prognostic prediction and developed an XGBoost machine learning-based prognostic model that can precisely predict the survival rates of . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.27.20028027 doi: medRxiv preprint severe patients with more than 90% accuracy, enabling the early detection, the early intervention and the reduction of mortality in high-risk patients with COVID-19. From technical point of view, this work helps pave the way for using machine learning method in COVID-19 prediction and diagnosis in the triage of the large scale explosive epidemic COVID-19 cases. Further studies are needed to consider more clinical confounding factors and to increase the sample size for further refining our model. Algorithm 1 Feature selection 2) ? @A8AB$A9C8&@$ ← ? @A8AB$A9 3) Add element ? GHH [i] to ? @A8AB$A9 4) # $%&'(C@A8AB$A9 is the matrix formed by the corresponding columns of ? @A8AB$A9 in # $%&'( . 5) # 7&8'9&$':(C@A8AB$A9 is the matrix formed by the corresponding columns of ? @A8AB$A9 in # 7&8'9&$':( . XGBoost is trained with the parameters setting as the max depth with 4, the learning rate is equal 0.2, the value of the regularization parameter α is set to 1.

Step 3. The results on the Multi-tree XGBoost with Top3 features selected in Step 2 (375 samples).

Supplementary Figure 3 : Confusion matrix for the testing dataset using multi-tree XGBoost algorithm.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed) The copyright holder for this preprint .

Clinical features of patients infected with 2019 novel coronavirus in wuhan, china. The Lancet

Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study. The Lancet

Diagnosis and treatment of pneumonia infected by the new novel coronavirus (the trial fifth edition). National Health Commission of the people's Republic of China, The medical letter from the National Health Office

Xgboost: A scalable tree boosting system

Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in wuhan

Pathological findings of covid-19 associated with acute respiratory distress syndrome. The Lancet. Respiratory medicine

A general report on the systematic anatomy of COVID-19

Staging of acute exacerbation in patients with idiopathic pulmonary fibrosis

Rosuvastatin to prevent vascular events in men and women with elevated C-reactive protein

Aetiology, outcomes & predictors of mortality in acute respiratory distress syndrome from a tertiary care centre in north India

Plasma C-reactive protein levels are associated with improved outcome in ARDS

A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-toperson transmission: a study of a family cluster

Structure of SARS coronavirus spike receptor-binding domain complexed with receptor

Isolation and characterization of a bat SARS-like coronavirus that uses the ACE2 receptor

Xin Zheng. Longitudinal characteri -stics of lymphocyte responses and cytokine profiles in the peripheral blood of SARS-CoV-2infected patients

In the Supplementary Information, we shall illustrate data analysis using a step-bystep procedure below:Step1. Obtain the Top10 features using 375 samples with all features:Supplementary Figure 1 : Top ten key clinical features that are ranked according to its importance in the XGBoost algorithm.XGBoost Trees with 375 samples (all features):XGBoost is trained with the parameters setting as the max depth with 4, the learning rate is equal 0.2, the tress number of estimators is set to 150, the value of the regularization parameter α is set to 1, the 'subsample' and 'colsample_bytree' both are set to 0.9 to prevent overfitting when there are many features and the sample size is not large. Step 2 Feature selection