key: cord-0681559-35wiwzex authors: Ghosheh, Ghadeer O.; Alamad, Bana; Yang, Kai-Wen; Syed, Faisil; Hayat, Nasir; Iqbal, Imran; Kindi, Fatima Al; Junaibi, Sara Al; Safi, Maha Al; Ali, Raghib; Zaher, Walid; Harbi, Mariam Al; Shamout, Farah E. title: Clinical prediction system of complications among COVID-19 patients: a development and validation retrospective multicentre study date: 2020-11-28 journal: nan DOI: nan sha: 350f9f9bac27ec45bba25688cf60d2e8551ee5e8 doc_id: 681559 cord_uid: 35wiwzex Existing prognostic tools mainly focus on predicting the risk of mortality among patients with coronavirus disease 2019. However, clinical evidence suggests that COVID-19 can result in non-mortal complications that affect patient prognosis. To support patient risk stratification, we aimed to develop a prognostic system that predicts complications common to COVID-19. In this retrospective study, we used data collected from 3,352 COVID-19 patient encounters admitted to 18 facilities between April 1 and April 30, 2020, in Abu Dhabi (AD), UAE. The hospitals were split based on geographical proximity to assess for our proposed system's learning generalizability, AD Middle region and AD Western&Eastern regions, A and B, respectively. Using data collected during the first 24 hours of admission, the machine learning-based prognostic system predicts the risk of developing any of seven complications during the hospital stay. The complications include secondary bacterial infection, AKI, ARDS, and elevated biomarkers linked to increased patient severity, including d-dimer, interleukin-6, aminotransferases, and troponin. During training, the system applies an exclusion criteria, hyperparameter tuning, and model selection for each complication-specific model. The system achieves good accuracy across all complications and both regions. In test set A (587 patient encounters), the system achieves 0.91 AUROC for AKI and>0.80 AUROC for most of the other complications. In test set B (225 patient encounters), the respective system achieves 0.90 AUROC for AKI, elevated troponin, and elevated interleukin-6, and>0.80 AUROC for most of the other complications. The best performing models, as selected by our system, were mainly gradient boosting models and logistic regression. Our results show that a data-driven approach using machine learning can predict the risk of such complications with high accuracy. The Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has led to a global health emergency since the emergence of the coronavirus disease 2019 . Despite containment efforts, more than 55 million confirmed cases have been reported globally, of which 157,785 cases are in the United Arab Emirates (UAE) as of November 21, 2020 [1] . Due to unexpected burdens on healthcare systems, identifying high risk groups using prognostic models has become vital to support patient triage. Most of the recently published prognostic models focus on predicting mortality, the need for intubation, or admission into the intensive care unit [2] . While the prediction of such adverse events is important for patient triage, clinical evidence suggests that COVID-19 may result in a variety of complications in organ systems that may eventually lead to mortality [3] . For example, Acute Respiratory Distress Syndrome (ARDS) related pneumonia has been reported as a major complication of COVID-19 [4] . Other studies reported alarming percentages among hospitalized COVID-19 patients that have developed hematological complications [5] , organ dysfunction [6] , or secondary bacterial infection [4] . Table 1 summarizes key studies that reported diagnosed complications or biomarkers which may lead to severe complications across different COVID-19 patient populations. Those findings suggest a pressing need for the development and validation of a prognostic system that predicts such complications in COVID-19 patients to support patient management. Here, we address this need by proposing an automated prognostic system that learns to predict a variety of non-mortal complications among COVID-19 patients admitted to the Abu Dhabi Health Services (SEHA) facilities, UAE. The system uses multi-variable data collected during the first 24 hours of the patient admission, including vital-sign measurements, laboratory-test results, and baseline information. We particularly focus on seven complications based on the clinical evidence presented in Table 1 , which are either based on clinical diagnosis or biomarkers that are indicative of patient severity. To allow for reproducibility and external validation, we made our code and a test set publicly available at: https: //github.com/nyuad-cai/COVID19Complications. for the overall dataset showing how the inclusion and exclusion criteria were applied to obtain the final training and test sets, where n represents the number of patient encounters, and p represents the number of unique patients. We reported this study following the TRIPOD guidance [16] . ). There were 9 facilities in the Middle region, which includes the capital city, and 9 facilities in the Eastern and Western regions. Those regions are highlighted in Figure 1 (a). Figure 1( Based on clinical evidence and in collaboration with clinical experts, we focused on predicting seven complications, including three clinically diagnosed events such as secondary bacterial infection (SBI), Acute Kindey Injury (AKI) [17] and ARDS [18] and four biomarkers that may be indicative of patient severity. In particular, among COVID-19 patients, elevated troponin reflects myocardial injury and has been reported to be associated with a higher risk of mortality [7] , elevated d-dimer is associated with thrombotic events [9] , elevated interleukin-6 is a proinflammatory cytokine that has been shown to be associated with disease severity Table 2 : Criteria used to define the occurrence of the complications that our system aims to predict. Elevated Troponin Troponin T ≥ 14 ng/L [19] Elevated D-Dimer D-Dimer ≥ 500 ng/mL [20] Elevated Aminotransferases AST ≥ 40 U/l AND ALT ≥ 40 U/l * and in-hospital mortality [12] , and elevated aminotransferases have been reported to be associated with liver Injury [11] . For each patient encounter in the training and test sets, we identified the first occurrence (i.e., date and time), if any, of each complication based on the criteria shown in Table 2 . The biomarkers-based complications are defined based on elevated laboratory-test results, SBI is defined based on positive cultures, AKI is defined based on the KDIGO classification criteria [17] , and ARDS is defined based on the Berlin definition [18] , which required the processing of free-text chest radiology reports. Further details on the processing of those reports is described in Supplementary Section A. We considered data recorded within the first 24 hours of admission as input features to our predictive models. This data included continuous and categorical features related to the patient baseline information and demographics, vital signs, and laboratory-test results. Within the patient's baseline and demographic information, age and body mass index (BMI) were treated as continuous features, while sex, pre-existing medical conditions (i.e., hypertension, diabetes, chronic kidney disease, and cancer), and symptoms recorded at admission (i.e., cough, fever, shortness of breath, sore throat, and rash) were treated as binary features. As for the vital-sign measurements and laboratory-test results, we excluded any variable that was used to define the presence of any complication in order to avoid label leakage. In particular, we considered seven continuous vital-sign features, including systolic blood pressure, diastolic blood pressure, respiratory rate, peripheral pulse rate, oxygen saturation, auxiliary temperature, and the Glasgow Coma Score, and 19 laboratory-test results, including albumin, activated partial thromboplastin time (APTT), bilirubin, calcium, chloride, c-reactive protein, ferritin, hematocrit, hemoglobin, international normalized ratio (INR), lactate dehydrogenase (LDH), lymphocytes count, prothrombin time, procalcitonin, sodium, red blood cell count (RBC), urea, uric acid, and neutrophils count. All vital-sign measurements and laboratory-test results were processed into minimum, maximum, and mean statistics. We also defined seven binary input features to represent whether a complication had occurred within the first 24 hours of admission, to allow the models to learn from any dependencies between the complications. The proposed system predicts the risk of developing each of the complications during the patient's stay after 24 hours of admission. This is represented by a vector y consisting of 7 risk scores, where each risk score is computed by a complication-specific model, such that y = y El. troponin , y El. d-dimer , y El. aminotransferases , y El. interleukin-6 , y SBI , y AKI , y ARDS , The overall workflow of the model development is depicted in Figure 2 . For each complication-specific model, we excluded from its training and test sets patients who developed that complication prior to the time of prediction. For AKI, we also excluded patients with chronic kidney disease. Then for each complication, our system trains five model ensembles based on five types of base learners: logistic regression (LR), k-nearest neighbors (KNN), support vector machine (SVM), multi-layer perceptron (MLP) and a light gradient boosting model (LGBM). Missing data was imputed using median imputation for all models except for LGBM, which can natively learn from missing data, and the data was further scaled using min-max scaling for LR and MLP and standard scaling for SVM and KNN. For each type of base learner, the system performs a stratified k-folds cross-validation using the complication's respective training set with k = 3. We performed random hyperparameter search for each base learner's hyperparameters [21] over 20 iterations, resulting with 3 trained models per hyperparameter set. The hyperparameter search ranges are described in Supplementary Section B. We selected the top two sets of hyperparameters that achieved the highest average area under the receiving operator characteristic curve (AUROC) on the validation sets, resulting with 6 trained models per ensemble. Then, we selected the ensemble that achieves the highest average AUROC on the validation set. Each model within the selected ensemble was further calibrated using isotonic regression on its respective validation set to ensure non-harmful decision making [22] , except for the LR models. The final prediction of each complication consisted of an average of the predictions of all calibrated base learners per ensemble. To understand what input features were most predictive of each complication, we performed post-hoc feature importance analysis using the tree SHapley Additive exPlanations (SHAP) [23] . All analysis was performed using Python (version 3.7.3), LR, KNN, SVM, and MLP models were trained using the Python scikit-learn package and the LGBM models were trained using the LightGBM package [24] . We evaluated each complication ensemble using the AUROC and the area under the precision recall curve (AUPRC) on the test set. Confidence intervals for all of the evaluation metrics were computed using bootstrapping with 1,000 iterations [25] . We also assessed the calibration of the ensemble, after post-hoc calibration of its trained models, using reliability plots and reported calibration intercepts and slopes [22] . The funding source had no role in the study design or data analysis. The study was performed by all co-authors who had access to the anonymized dataset. Complication labeling Figure 2 : Overview of our proposed model development approach and expected application in practice. In the first row, we developed our complication-specific models by first preprocessing the data, identifying the occurrences of the complications based on the criteria shown in Table 2 , training and selecting the best-performing models on the validation set, and then evaluating the performance on the test set, retrospectively. As for deployment, we expect our system to predict the risk of developing any of the seven complications for any patient after 24 hours of admission. A total of 3,352 encounters were included in the study and the characteristics of the characteristics of the final data splits are presented in Table 3 . Across all the data splits, the mean age ranges between 39.3 and 45.5 years and the proportion of males ranges between 84.8% to 88.9%. The mortality rate was also less than 4% across all data splits, ranging between 1.3% and 3.7%. The most prevalent complication across all datasets was elevated d-dimer, although most patients mainly exhibited elevated d-dimer during the first 24 hours of admission. Elevated interleukin-6 was the most prevalent complication developed after 24 hours of admission across all datasets. The incidence of the complications developed after 24 hours were higher in the test sets than in their respective training sets, except for elevated troponin and d-dimer which were higher in training set A (3.0% and 6.6%, respectively) than in test set A (2.4% and 4.8%, respectively). The distributions of the vital signs and laboratory-test results, in terms of of the mean and interquartile ranges, are shown in Table 4 . The performance of the models selected by our system across the two test sets in terms of the AUROC and AUPRC are shown in Table 5 . The ROC, PRC, and reliability plots are also visualized in Figure 3 . Across both test sets, our data-driven approach achieved good accuracy (>0.80 AUROC) for most complications. In test set A, AKI was the best discriminated endpoint at 24 hours from admission, with 0.905 AUROC (95% CI 0.861, 0.946). This is followed by ARDS (0.864 AUROC), SBI (0.862 AUROC), elevated troponin (0.843 AUROC), elevated interleukin-6 (0.820 AUROC), and elevated aminotransferases (0.801 AUROC). The complication with the worst discrimination was elevated d-dimer (0.717 AUROC). In test set B, AKI was also the best discriminated endpoint with 0.958 AUROC (95% CI 0.913, 0.994), followed by elevated troponin (0.913 AUROC), and elevated interleukin-6 (0.899 AUROC). Similar to test set A, elevated d-dimer was the worst discriminated endpoint (0.714 AUROC). We also observe that LGBM was selected as the best performing model on the validation sets for most complications, as shown in Supplementary Section C. LR was selected for AKI in both datasets, for elevated d-dimer in dataset A, and for SBI in dataset B, highlighting its predictive power despite its simplicity compared to the other machine learning models. The top four important features for each complication are shown in Figure 4 across the two test sets. In test set A, age was among the top predictive features for all the complications except for elevated interleukin-6 and AKI. In test set B, C-reactive protein was among the top predictive features for predicting elevated Table 3 : Summary of the baseline characteristics of the patient cohort in the training sets and test sets and the prevalence of the predicted complications. Note that n represents the total number of patients while % is the proportion of patients within the respective dataset. aminotransferases, elevated d-dimer, elevated interleukin-6, and ARDS. Other features such as ferritin and LDH, and BMI, were among the top predictive features for several complications across both sets, specifically for AKI and ARDS, respectively. We also visualize the timeline of for two patients in Figure 5 , along with the predictions of our system. In Figure 5 (a), the patient shown developed all seven complications during their hospital stay of 43 days. This highlights the importance of predicting all complications simultaneously, especially for patients who may develop more than one complication. In Figure 5 (b), the patient did not develop any complications during their hospital stay of two days. To compare both patients, the system's predictions for patient (a) were relatively higher than those for patient (b). For example, the AKI predictions were 0.54 and 0.03, respectively, despite the fact that patient (a) developed AKI at around 20 days from admission. This demonstrates the value of our system in predicting the risk of developing complications early during the patient's stay. Table 5 . In this study, we developed a predictive system of commonly occurring complications among COVID-19 patients to support patient triage. During validation, the system was assessed for performance and calibration. To the best of our knowledge, this is one of the few machine learning studies that predict non-mortal complications secondary to COVID-19 and the first to demonstrate a system that predicts the risk of such complications simultaneously. The system achieves a good performance across all complications, for example, reaching above 0.9 AUROC for AKI across two independent datasets. This study has several strengths and limitations. One of the main strengths is that we used multicentre data collected from 18 facilities across several regions in Abu Dhabi, UAE. COVID-19 treatment is free for all patients, hence there were no obvious gaps in terms of access to healthcare services in our dataset. Our dataset is diverse since Abu Dhabi is residence for more than 200 nationalities, of which only 19.0% of the population only is Emirati. Those characteristics of the dataset make our findings relevant to a global audience. This is also the first data-driven study to present the population in the UAE and one of few studies with large sample sizes (3,352 COVID-19 patient encounters) among COVID-19 related studies, while most previous studies have focused on European or Chinese patient cohorts. Despite the diversity of the dataset, one limitation is that we did not perform validation on a patient cohort external to the UAE. Compared to other international patient cohorts, our patient cohort is relatively younger, with a lower overall mortality rate, suggesting that our system needs to be further validated on populations with different demographic distributions [4, 7, 13, 14] . Our data-driven approach and open-access code can be easily adapted for such purposes. Several studies reported worse prognosis among COVID-19 infected patients who had multi-organ failure, severe inflammatory response, and other hematological complications [5, 6, 7, 10] . Most existing studies focus on predicting the mortality endpoint [2] . The low mortality rate in our dataset strongly discouraged the development of a mortality risk prediction score, as small sample sizes may lead to biased models [2] . Our work was motivated by predicting the precursors of such severe adverse events, as identified by the World Health Organization [3] . We identified and predicted seven complications indicative of patient severity in order to avoid worse patient outcomes. The prevalence of the predicted complications ranged between 2%-10% and 2%-13% in our training and test sets, respectively. This high class imbalance is reflected in the AUPRC results. Since most of those tasks have not been investigated thoroughly before, our results introduce new benchmarks to evaluate other competing models. Future work should also investigate the use of multi-label deep learning classifiers, while accounting for the exclusion criteria during training. An important aspect of this study is that the labeling criteria relies on renowned clinical standards and hospital-acquired data to identify the exact time of the occurrence of such complications. In collaboration with the clinical experts, this approach was considered more reliable than relying on International Classification of Disease (ICD) codes, since ICD codes are generally used for billing purposes and their derivation may vary across facilities, especially during a pandemic. One limitation of the labeling procedure is that it could miss patients for whom the data used in identifying a particular complication was not collected. However, this issue is more closely related to data collection practices at institutions and clinical data is often not missing at random. We also avoided label leakage by ensuring that there is no overlap between the set of input features and the features used to identify complications. The feature importance analysis revealed that age, oxygen saturation, and respiratory rate are highly predictive of several complications. Since COVID-19 is predominantly a pulmonary illness, it was not surprising that oxygen saturation and respiratory rate ranked among the highest predictive features. Such features are routinely collected at hospitals and do not incur any additional data collection costs. We also identified C-reactive protein, ferritin, LDH, procalcitonin, systolic blood pressure, and diastolic blood pressure as markers for severity among COVID-19 patients, which is aligned with clinical literature [20, 26] . This analysis demonstrates that our system's learning is clinically meaningful and relevant. We assessed our models' calibration by reporting the calibration slopes and intercepts with confidence intervals and visualizing the calibration curves. Sufficiently large datasets are usually needed to produce stable calibration curves at model validation stage. Despite the size of our dataset, we found that reporting the calibration slopes and intercepts would provide a concise summary of potential problems with our system's risk calibration, to avoid harmful decision-making [22] . Overall, our results show that our ensemble models were adequately calibrated across all complications, as shown in Table 5 and Figure 3 (c). This is also reflected in the sample patient timelines shown in Figure 5 , where the predicted risks for the patient who experienced the complications were relatively higher than those predicted for the patient who did not experience any complications. Limiting factors to perfect calibration are the small dataset size and the fact that the ensemble prediction consists of an average of the predictions of the individually calibrated models. Further work should investigate how to improve the calibration of ensemble models. Our data-driven approach and results highlight the promise of machine learning in predicting the risk of complications among COVID-19 patients. The proposed approach performs well when applied to two independent multicentre training and test sets in the UAE. The system can be easily implemented in practice due to several factors. First, the input features that our system uses are routinely collected by hospitals that accommodate COVID-19 patients as recommended by WHO. Second, training the machine learning models within our system does not require high computational resources. Finally, through feature importance analysis, our system can offer interpretability, and is also fully automated as it does not require any manual interventions. To conclude, we propose a clinically applicable prognostic system that predicts non-mortal complications among COVID-19 patients. Our system can serve as a guide to anticipate the course of COVID-19 patients and to help initiate more targeted and complication-specific decision-making on treatment and triage. We are unable to share the full dataset used in this study due to restrictions by the data provider. However, to allow for reproducibility and benchmarking on our dataset, we are sharing test set B (n=225), the trained models, and the source code online at https://github.com/nyuad-cai/COVID19Complications. A Details of data pre-processing for labeling the complications KDIGO classification was used to classify AKI encounters [17] . The definition has three criteria, and if any of them were satisfied, the patient was assigned a diagnosis of AKI. The three criteria were either an increase in serum creatinine of 0.3 mg/dl within 48 hours, an increase of 1.5 times the baseline serum creatinine measurement, or urine output of less than 0.5 ml/kg/hr for 6 hours [17] . We only assessed the first two definitions, since urine output was not available in our dataset. The patient's first record of serum creatinine was treated as the baseline for that patient. Patients with reported chronic kidney disease were excluded from the training and testing AKI subsets. The Berlin definition was employed to identify the timing and incidence of ARDS [18] . The full ARDS labeling process is illustrated by the flow diagram in Figure 1 . Textual chest X-ray reports and CT scan reports were processed using natural language processing (NLP) techniques to identify three categorized key terms: opacity, bilaterality, and ARDS. The lexicon developed was in reference to the Herasevich [27] and ASSIST [28] sniffers, which was further refined and validated based on clinical expertise. To minimize the influence of uncertainty profiles, the negation expression "no" was searched 40 characters prior to the identification of opacity. The ARDS diagnosis was confirmed if either one of the two criteria is satisfied: (1) the ARDS term is present or (2) both terms of bilaterality and opacity are present in the report. We identified the first radiology observation of bilateral opacity, as subsequent reports usually refer to the ones previously conducted for the identical patient instead of repeating the full interpretation and findings. Manual inspection of portions of the reports was done to validate the efficacy of the algorithm. For the oxygenation criteria, 13,862 arterial partial pressure of oxygen (P aO 2 ) measurements acquired through arterial blood gas tests (ABG) were recorded for 358 unique patients. We have confirmed with SEHA clinicians that such test is only conducted for patients suspected of ARDS or with severe symptoms, and therefore, patients without one can be ruled out of ARDS directly. Each P aO 2 measurement was matched with the closet prior record of F iO 2 (the fraction of inspired oxygen) for the given patient to obtain the P/F ratio. For patients with missing F iO 2 measurements, we assumed that they were not on oxygen therapy and were assigned a value of 0.2095 (20.95% of oxygen in air). The patients were then labeled as potentially having ARDS if their P/F ratio ≤ 300 mm Hg. The earliest recorded time -either arrival time, admission time, or the first time the patient tested positive for COVID-19 -was utilized in lieu of the precise point of clinical insult of respiratory symptoms for the timing criteria of the Berlin definition. To rule out pulmonary edema of other origin, patients with cardiac edema prior to the onset of ARDS were identified from the vitals and excluded. With the criteria and steps delineated herein, 243 patients were identified as having ARDS across both training sets as well as test sets. Terms for Identification Figure 1 : The ARDS labeling process in our dataset, in accordance with the four criteria of the Berlin definition [18] : imaging, oxygenation, timing, and origin. The lexicon developed for identifying bilateral opacity in radiology reports is also shown within the table on the left. Our system performs random search for the hyperparameters of the machine learning models and then evaluates their performance on the validation sets. The searched hyperarameters for each of the models are shown in Table B . Table B : Hyperparameter values considered during the random hyperparameter search. Ranges are indicated with a '-'. After preprocessing the data, we compared the performance of 5 ensembles based on 5 types of base learners on the validation sets: Logistic Regression (LR), K-Nearest Neighbors (KNN), Support Vector Machine (SVM) Multi-layer Perceptron (MLP), and Light Gradient Boosting Model (LGBM). The models were compared using the AUROC and AUPRC and the results are shown in Table C . We selected the ensemble that achieved the highest AUROC on the validation set. An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal A minimal common outcome measure set for covid-19 clinical research. The Lancet Infectious Diseases Hua Chen, and Bin Cao. Clinical course and risk factors for mortality of adult inpatients with covid-19 in wuhan, china: a retrospective cohort study Prevalence and outcomes of d-dimer elevation in hospitalized patients with covid-19. Arteriosclerosis, thrombosis, and vascular biology Acute kidney injury in patients hospitalized with covid-19 Association of troponin levels with mortality in italian patients hospitalized with coronavirus disease 2019: results of a multicenter study Prevalence and impact of cardiovascular metabolic diseases on covid-19 in china D-dimer as a biomarker for disease severity and mortality in covid-19 patients: a case control study Pattern of liver injury in adult patients with covid-19: a retrospective analysis of 105 patients and the Northwell COVID-19 Research Consortium. Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the Dynamic interleukin-6 level changes as a prognostic indicator in patients with covid-19 Clinical characteristics of covid-19 in new york city Clinical features and outcomes of 98 patients hospitalized with sars-cov-2 infection in daegu, south korea: a brief descriptive study Clinical characteristics of coronavirus disease 2019 in china Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod) the tripod statement Kdigo clinical practice guidelines for acute kidney injury Acute Respiratory Distress Syndrome: The Berlin Definition Diagnostic accuracy of single baseline measurement of elecsys troponin t high-sensitive assay for diagnosis of acute myocardial infarction in emergency department: systematic review and meta-analysis C-reactive protein, procalcitonin, d-dimer, and ferritin in severe coronavirus disease-2019: a meta-analysis Random search for hyper-parameter optimization A calibration hierarchy for risk models was defined: from utopia to empirical data From local explanations to global understanding with explainable ai for trees Lightgbm: A highly efficient gradient boosting decision tree Blood pressure control and adverse outcomes of covid-19 infection in patients with concomitant hypertension in wuhan, china Validation of an electronic surveillance system for acute lung injury Validation Study of an Automated Electronic Acute Lung Injury Screening Tool Multi-layer Perceptron Learning rate [Constant, Adaptive] Weight optimization solver [Sgd, Adam] Hidden layer sizes We would like to thank NYU Abu Dhabi for the generous funding. We would also like to thank Waqqas Zia and Benoit Marchand from the Dalma team at New York University Abu Dhabi for supporting data management and access to computational resources. This study was supported through the data resources and staff expertise provided by Abu Dhabi Health Services.