key: cord-335061-wn8u7u9y authors: Zheng, Yichao; Zhu, Yinheng; Ji, Mengqi; Wang, Rongpin; Liu, Xinfeng; Zhang, Mudan; Qin, Choo Hui; Fang, Lu; Ma, Shaohua title: A Learning-based Model to Evaluate Hospitalization Priority in COVID-19 Pandemics date: 2020-08-03 journal: Patterns (N Y) DOI: 10.1016/j.patter.2020.100092 sha: doc_id: 335061 cord_uid: wn8u7u9y Summary The emergence of novel coronavirus disease 2019 (COVID-19) is placing an increasing burden on the healthcare systems. Although the majority of infected patients have non-severe symptoms and can be managed at home, some individuals may develop severe disease and are demanding the hospital admission. Therefore, it becomes paramount to efficiently assess the severity of COVID-19 and identify hospitalization priority with precision. In this respect, a 4-variable assessment model, including lymphocyte, lactate dehydrogenase (LDH), C-reactive protein (CRP) and neutrophil, is established and validated using the XGBoost algorithm. This model is found effective to identify severe COVID-19 cases on admission, with a sensitivity of 84.6%, a specificity of 84.6%, and an accuracy of 100% to predict the disease progression toward rapid deterioration. It also suggests that a computation-derived formula of clinical measures is practically applicable for the healthcare administrators to distribute hospitalization resources to the most needed in epidemics and pandemics. The novel coronavirus disease 2019 (COVID-19) caused by the severe acute 31 respiratory syndrome coronavirus 2 (SARS-CoV-2) infection was firstly reported in last 32 December in China and rapidly spread across the world, affecting over 16 million people 33 worldwide and killing more than a half million infected patients up till now 1-3 . Even 34 worse, the global pandemic of COVID-19 is expected to continue growing, as no 35 effective vaccines have been officially approved for prophylaxis of this disease 4 . 36 Though the growth in detected infections has declined in East Asia and Europe, the 37 number of infections in U.S., south America and African places are witnessed with 38 continuous growth 3 . Moreover, the suspicion on a second generation pandemic 39 outbreak still sustains 5 . 40 In pandemic, a nation's healthcare system bears extraordinary burdens. However, a 41 majority of patients infected with SARS-CoV-2 generally have non-severe disease 42 progression and can be safely managed at home or self-quarantine, and recover under 43 limited and basic medical care 6 . For the infections with severe syndromes or 44 progression toward rapid deterioration, immediate admission to hospitals for close 45 monitoring and intensive treatment has been proven effective to reduce the 46 complications and mortality 7 . Therefore, identifying the COVID-19 patients at high risk 47 for severe illness and prioritizing them for immediate admission to hospitals becomes 48 urgently demanded, especially in nations and territories where the healthcare systems 49 are insufficient to administrate all infections and suspicions. 50 Some studies have been reported to predict deterioration and mortality during 51 hospitalization 8-15 , and even predict the probability of SARS-CoV-2 infections that 52 enables the the timely quarantine of high rate infections and prevents their spreading 15-53 25 . But none of these studies aim to provide a solution for rational triage of patients in 54 places where the medical resources are limited. 55 In light of this unmet need in efficient triage of COVID-19 cases, the study is sought to 56 develop and validate a learning-based model that evaluates patients' priority of being 57 admitted to hospital care due to their appearance or susceptibility toward severe 58 COVID-19. The model, provided with a simple user interface, can efficiently assess the 59 severity of COVID-19 and predict the disease progression, with high rates of accuracy. 60 Our study is expected to have a prolonged social impact under the current 61 circumstances, when the simple and practical model becomes accepted to assist 62 clinicians in quick and efficient triage of COVID-19 patients. This study was approved by 63 the Guizhou Provincial People's Hospital Ethics Committee. 64 The patients cohorts enrolled in this study were comprised of 134 COVID-19 cases 68 retrieved from World Health Organization (WHO) COVID-19 database 26 ( Figure S1 , 69 Table S1 ) and 467 COVID-19 cases recruited from a multi-center dataset in China. 70 Amongst the 601 patients, 25.4% of patients had developed severe disease on 71 admission and 6.5% of patients were presented with non-severe diseases on admission 72 but progressed toward severe disease after admission. The minimal, medium and 73 maximal time from hospital admission to severe disease progression were less than 1 74 day, 5 days and 12 days, respectively. The prevalence of underlying comorbidities was 75 diseases (6.7%). The medium age was 48 years. Fever was the most common initial 77 symptom (63.1%), followed by cough (50.1%), fatigue (21.3%) and dyspnea (9.3%). 78 Table 1 shows the baseline laboratory results obtained on or soon after admission. All 79 the patients were laboratory-confirmed COVID-19 cases and the severity of COVID-19 80 were stratified into severe and non-severe categories according to a criteria shown in 81 Table 2 . 82 83 The clinical variables of most patients were measured multiple times across different 85 days during hospitalization to assess the prognosis. As this study was sought to identify 86 the hospitalization priority according to the prehospital assessment of severe COVID-19 87 risk, only clinical data obtained on admission were used to evaluate the importance of 88 clinical variables in identification of severe or potentially severe cases. Given the 89 various missing data on different clinical variables and different patients, a strategy was 90 adopted to set a threshold value alpha to remove these missing data, minimizing their 91 impact on data analysis. It was found that as the threshold alpha increases, available 92 variables decrease while available observations, i.e. the available COVID-19 cases, 93 increase ( Figure S2) . A threshold alpha of 350 was selected to remove the missing data, 94 and to obtain as many clinical variables and observations as possible. Hence, a total of 95 clinical variables between the severe and the non-severe groups. As shown in Table S2 , 99 a total of 12 clinical variables were significantly different between the two groups, 100 including the age, fever, dyspnea, lymphocyte, neutrophil, C-reactive protein (CRP), 101 lactic dehydrogenase (LDH), creatine kinase (CK), D-dimer, alanine aminotransferase 102 (ALT), aspartate aminotransferase (AST) and albumin. These 12 clinical variables could 103 be used to discriminate between the severe and non-severe COVID-19 cases. 104 105 Extreme Gradient Boosting (XGBoost), which is a high-performance machine learning 107 algorithm and works with a sequence of decision trees where the latter tree tries to 108 minimize the net error from prior trees, was used to generate the risk assessment model. accuracy, F1 score, sensitivity, specificity, and the area under curve (AUC) score of 119 receiver operating characteristic (ROC) curve. The definition of these evaluation metrics 89.2% in discriminating severe COVID-19 cases from their non-severe counterparts 122 (Table 3) . Moreover, it outperformed other classifiers in the aforelisted evaluations, 123 excepting the specificity (Table 3 , and Figure 1 ). Our study was in agreement with a 124 reported conclusion that the XGBoost algorithm had high discriminative performance 9 , 125 and thus could be used to assess the hospitalization priority with precision. 126 Next, the assembly of variables was minimized to ease the clinical use. For this purpose, 127 a sequential variable selection approach was used to find the optimal variable set based 128 on its assessment performance. Briefly, important variables ranked by XGBoost ( Figure 129 2) were sequentially assembled in an individualized manner to investigate their 130 incremental effects in terms of AUC scores by cross validation. The AUC scores ceased 131 to grow when the count of assembled variables increased to 4 (Figure 3 ). Thus the 132 previous 12-variable models shrinked to the selected 4-variable models, where 133 XGBoost classifier achieved an accuracy of 84.6% in the identification of severe 134 COVID-19 cases. Table 4 compares the performance of various classifiers in the 4-135 variable model. The AUC score of XGBoost was slightly decreased compared with 136 others models, but XGBoost achieved the highest F1 score and accuracy among the 137 classifiers (Table 4, Figure 4 ). An over 80% accuracy indicated that the 4-variable 138 XGBoost model could play a crucial role in distinguishing the majority of cases that 139 require immediate medical attention. Overall, the 4-variable XGBoost model was 140 evaluated to be the most competitive and easy-to-use establishment throughout 141 comparison with other prevalent choices. was effective to predict the risk of deterioration for patients who were presented with 144 non-severe symptoms on admission. For this purpose, a total of 39 patients who had 145 non-severe COVID-19 on admission but experienced deterioration during hospitalization 146 were enrolled as an external testing set for analysis ( Figure S3 ). The 4-variable 147 XGBoost model achieved 100% accuracy in predicting the risk of rapid deterioration 148 (Table 5) . For 17 patients who had complete time course of exacerbation, the minimal, 149 medium and maximal prediction horizon were less than 1 day , 5 days and 12 days, 150 respectively, suggesting that our model could predict the risk of disease deterioration, 151 for as long as 12 days earlier than its occurrence. 152 To test whether a clinical operable single-tree XGBoost classifier based on the 153 lymphocyte count, CRP level and LDH level as reported by Li et al. 9 was able to 154 accurately identify the risk of severe disease on admission, we performed the single-155 tree XGBoost in identification of severe COVID-19 cases as well as in prediction of risk 156 of in-hospital deterioration from non-severe to severe disease. Table S4 shows the 157 single-tree XGBoost had 80% accuracy in identification of severe COVID-19 cases on 158 admission, but only 38.5% accuracy in prediction of risk of in-hospital deterioration. It 159 suggested that a model established for other purposes or reported in other works does 160 not fulfill our goal in this study, that is, identifying hospitalization priority for COVID-19 161 Collectively, the 4-variable XGBoost model is the first computation model established to 163 assess hospitalization priority that enables rational triage of infected patients and 164 prioritize hospitalization to the most needed. 165 SHapley Additive exPlanations (SHAP), as a game-theoretic approach that interpreted 168 an impact of each input variable toward the model output, had been relied upon for the 169 model interpretation. In Figure S4 , each dot corresponds to an individual case in the 170 study. Different colors encoded different values of input variables, while the SHAP value 171 represented the impact of each variable on the prediction outcome. As shown, the risk 172 of severe COVID-19 was found associated with a decrease in the lymphocyte count, 173 and an increase in the LDH level, the CRP level and the neutrophil count. 174 Next, the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm as a technique 175 for dimensionality reduction was used to project the four-dimensional data (lymphocyte, 176 LDH, CRP, neutrophil) into a 3-dimension (3D) feature space for visualization 27 . It 177 enabled to visualize the difference in features among the three groups of patients, 178 including the severe cases, the non-severe cases and the progressed severe cases. 179 The progressed severe cases referred to patients with non-severe disease on 180 admission but developed severe disease afterward. There was a clear separation 181 between the non-severe (distributed in the core) and severe cases (distributed in the 182 periphery), whereas the progressed severe cases distributed in between the core and 183 the periphery ( Figure S5 , Video S1). suggests that the previously reported prediction models are not suitable for identification 205 of the hospitalization priority for the severe or potential severe COVID-19 cases. (Table 206 S4). 207 More recently, a nomogram developed by Gong et al. can assist the early identification 208 of severe COVID-19 cases with a sensitivity of 77.5% and a specificity of 78.4% 10 , this 209 study was prematurely accomplished before all the participants had fully experienced 210 the outcome of event. Therefore, our study becomes the first to focus on the 211 triage patients, achieving a relatively high sensitivity and specificity, and meanwhile, 213 without negatively impacting its performance by involving participants still in treatment. 214 In this study, the clinical features of COVID-19 were screened to identify a total of 12 215 critical variables that were found to be associated with the risk of severe COVID-19 216 severe cases that require immediate medical attention (Table 4) . Importantly, it 219 precisely predicts the risk of progression toward rapid deterioration for as long as 12 220 days ahead of its occurrence (Table 5) . 221 Our study is in line with the previous studies that showed the increased inflammatory 222 Figure S2 ). In this way, the impact of missing values on data analysis can be reduced. 240 Third, there was no significant difference in the prevalence of comorbidies between the 241 severe and the non-severe COVID-19 cases in our datasets (Table S2) Shaohua Ma; ma.shaohua@sz.tsinghua.edu.cn. 259 No new unique reagents were generated in the present study. 261 The clinical data used in this study were obtained from the WHO COVID-19 database 263 (Table 2) . Specifically, a severe case of COVID-19 was defined by the 306 presence of any of the following conditions in the quiescent state, such as an increased 307 respiratory rate of ≥ 30 breaths/minute, decreased oxygenation index ≤ 300 mmHg or 308 declined SpO2 of ≤ 93%. Moreover, patients who developed shock, multiple organ 309 failure (MOF) that were required to be admitted to Intensive Care Unit (ICU), or 310 respiratory failure that warranted mechanical ventilation were stratified into the severe 311 category in the present study. Finally, patients with pulmonary lesions that showed rapid 312 progression of over 50% within 24-48 hours were considered to have severe disease. As there were various missing data in the datasets, a threshold alpha was set to remove 335 to these missing data ( Figure S2 ). First, the clinical variables with missing data which 336 exceeded the threshold alpha were removed (bottom figure) . Second, the observations 337 (that were COVID-19 cases) with missing data for any of the resulting clinical variables 338 were removed. An optimal threshold alpha should be selected to obtain as many clinical 339 variables and observations as possible. 340 Next, COVID-19 cases with non-missing variable values were grouped into the severe 341 and non-severe categories, according to the severity of disease. The difference in the 342 clinical variables between the two groups was identified via the univariate descriptive 343 statistics (Table S2) . A p-value of less than 0.05 was considered statistically significant 344 and was used as a threshold to identify key clinical variables for model development. 345 These key clinical variables were assembled to generate the risk assessment models 346 based on the XGBoost classifier as well as other classifiers, such as the LDA, Logistic 347 Regression, SVM, Random Forest, and Decision Tree. set at a ratio of 7:3. The different models were trained in the training set and evaluated 350 in the holdout testing set by comparing the values of accuracy, F1 score, sensitivity, 351 specificity, and AUC score of ROC curve (Table S3) . 352 Subsequently, the assembly of key clinical variables was minimized to generate the 353 simplified models. For this purpose, all the key clinical variables were ranked according 354 to the importance calculated by XGBoost (Figure 2) , followed by the sequential variable 355 selection approach (Figure 3) . This was to minimize the variable set while optimize the 356 model performance. The simplified models based on the minimized variable set were 357 trained and evaluated in accordance with a method mentioned above. To assess the 358 effectiveness of models in early prediction of severe progressions, patients who were 359 presented with non-severe symptom on admission but developed severe disease during 360 hospitalization were enrolled as an external testing set for analysis. The performance of 361 models was reflected by accuracy (Table 5) . As for comparison, the performance of a 362 previously validated single-tree XGBoost model 9 in identification of severe COVID-19 363 risk was assessed in our datasets (Table S4) . variable models in discriminating the severe COVID-19 cases. 508 The 12 variables included age, fever, dyspnea, lymphocyte, neutrophil, C-reactive 509 protein, lactic dehydrogenase, creatine kinase, D-dimer, alanine aminotransferase, 510 aspartate aminotransferase and albumin. Highlights: 1 A model was developed to evaluate hospitalization priority in COVID-19 pandemics. 2 This model used easily accessible biomarkers to evaluate the risk of severe COVID-19. 3 The evaluation can be rapidly proceeded using an online program. 4 Performance of different algorithms in evaluation of COVID-19 severity was explored. eTOC blurb: The authors proposed a learning-based model to assist clinicians in quick and efficient triage of patients in places where the medical resources are limited in COVID-19 pandemics. This model used four easily accessible biomarkers to assess the severity of COVID-19, and was found effective to identify the risk of severe COVID-19. It will enable the healthcare administrators to distribute hospitalization resources to the most needed. Bigger Picture: COVID-19 pandemic is threatening millions of lives and stressing the medical systems worldwide. Though the infection growth in some areas has creased, the risk of second wave of outbreak is under threatening. So, a sustainable strategy to defend the pandemic using current limited but effective healthcare resources is in high demand. Our study is deemed to find a solution that triages patients to hospitalization by identifying their severity progression. In this study, a model that used four easily accessible biomarkers to assess the risk of severe COVID-19 was successfully developed. This model is easy to use and it eliminates the dependencies on exquisite equipment to make a decision. It was found effective to identify the risk of severe COVID-19. So, it is practically applicable for general practitioners to effectively distribute the infections and allocate in-patient cares to the most needed. Our study is expected to have a prolonged social impact under the current circumstances. Association of radiologic 433 findings with mortality of patients infected with 2019 novel coronavirus in Wuhan Prediction models for diagnosis and prognosis of covid-19 infection: systematic 438 review and critical appraisal Real-time tracking of self-reported symptoms to predict potential COVID-19 Chest CT for 444 Typical 2019-nCoV Pneumonia: Relationship to Negative RT-PCR Testing A model to predict SARS-CoV-2 infection based on the first 448 three-month surveillance data in Brazil. medRxiv Deep learning-based model for detecting 2019 novel study. medRxiv Development and utilization of an intelligent application 456 for aiding COVID-19 diagnosis. medRxiv, 2020 COVID-19 early warning score: 459 a multi-parameter screening tool to identify highly suspected patients. medRxiv An artificial intelligence-based first-line defence against COVID-19: 463 digitally screening citizens for risks via a chatbot. bioRxiv Development and Validation of a Diagnostic 467 Nomogram to Predict COVID-19 Pneumonia. medRxiv Rapid and accurate identification of COVID-19 infection 471 through machine learning based on clinical available blood test results. medRxiv Development and validation of chest CT-based imaging biomarkers for Visualizing data using t-SNE Diagnosis and treatment of novel coronavirus pneumonia Clinical Characteristics of Coronavirus 486 Disease 2019 in China COVID-19: treating and managing 488 severe cases Clinical management of severe acute respiratory infection when COVID-19 is 490 suspected: Interim guidance V 1 Laboratory testing for coronavirus disease 2019 ( COVID-19) in suspected human 494 cases: interim guidance Clinical Characteristics of 138 Hospitalized Patients Clinical features of patients infected with 2019 novel 501 coronavirus in Wuhan COVID-19: towards understanding of pathogenesis Cough Expectoration Hemoptysis Dyspnea Catarrh Fatigue Anorexia Nausea/Emesis Myalgia Dizziness/Headache Pharyngalgia Abdominal pain/diarrhea Laboratory findings, mean± std White blood cell count, 10 9 /L Lymphocyte count, 10 9 /L Neutrophil count, 10 9 /L Erythrocyte sedimentation rate, mm/h C-reactive protein, mg/L Procalcitonin, ng/ml D-dimer, ug/ml Alanine aminotransferase, U/L Aspartate aminotransferase, U/L Total bilirubin, umol/l Albumin, g/L Lactate dehydrogenase, U/L Blood urea nitrogen We thank the above-mentioned cooperating hospitals for kindly sharing the data with us, 370in accordance with the Declaration of Helsinki. The work was supported by the national The authors declare no competing interests. Definitions Non-severe COVID-19 Patients have non-specific symptoms such as fever, cough, fatigue, myalgia, pharyngalgia, but have no signs of dehydration, sepsis or shortness of breath. The radiological examination showes no signs of severe pneumonia. Severe COVID-19Adult cases meeting any of the following criteria:(1) Respiratory rate ≧ 30 breaths/ min;(2) Oxygen saturation ≤ 93% at rest;(3) FiO2 ≦ 300mmHg.(4) Pulmonary lesion progression exceeds 50% in 24-48 hours (5) Respiratory failure that requires mechanical ventilation; (6) Shock; (7) Organ failure that requires to be managed in Intensive Care Unit