key: cord-0836206-i7adp6qs authors: Arvind, Varun; Kim, Jun S.; Cho, Brian H.; Geng, Eric; Cho, Samuel K. title: Development of a machine learning algorithm to predict intubation among hospitalized patients with COVID-19 date: 2020-11-16 journal: J Crit Care DOI: 10.1016/j.jcrc.2020.10.033 sha: 0ca502429d712bbeb39eea32e18f20e6ca0b52df doc_id: 836206 cord_uid: i7adp6qs PURPOSE: The purpose of this study is to develop a machine learning algorithm to predict future intubation among patients diagnosed or suspected with COVID-19. MATERIALS AND METHODS: This is a retrospective cohort study of patients diagnosed or under investigation for COVID-19. A machine learning algorithm was trained to predict future presence of intubation based on prior vitals, laboratory, and demographic data. Model performance was compared to ROX index, a validated prognostic tool for prediction of mechanical ventilation. RESULTS: 4087 patients admitted to five hospitals between February 2020 and April 2020 were included. 11.03% of patients were intubated. The machine learning model outperformed the ROX-index, demonstrating an area under the receiver characteristic curve (AUC) of 0.84 and 0.64, and area under the precision-recall curve (AUPRC) of 0.30 and 0.13, respectively. In the Kaplan-Meier analysis, patients alerted by the model were more likely to require intubation during their admission (p < 0.0001). CONCLUSION: In patients diagnosed or under investigation for COVID-19, machine learning can be used to predict future risk of intubation based on clinical data which are routinely collected and available in clinical setting. Such an approach may facilitate identification of high-risk patients to assist in clinical care. The novel coronavirus disease 2019 , also known as severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) as defined by the World Health Organization (WHO), is an infectious disease that has spread quickly around the globe since the first case was reported in December 2019. It has become a pandemic of unprecedented proportions in modern times that have caught many countries and their healthcare systems woefully unprepared. Although there are numerous ongoing clinical trials, there is no curative therapeutic nor vaccine available against COVID-19 to-date. This has left the medical community to largely rely on supportive care, especially ventilator support, leading to a significant shortage of intensive care unit (ICU) availability. In February and March 2020, New York City, its surrounding boroughs, and the tristate area quickly became the epicenter of the SARS-COV-2 pandemic. As of May 12, 2020, there have been 184,319 reported cases, 48,939 hospitalized patients, and a staggering 20,237 deaths. This rapid rise in the number of positive cases has generated massive data of COVID-19 patients, which can be analyzed with machine learning algorithms to provide useful insights. Researchers, in a very short time frame, have analyzed publicly available clinical datasets using natural language processing, convolutional neural networks [1, 2] , and dense neural nets [3] [4] [5] to improve diagnostic speed and accuracy, develop and analyze the effects of therapeutic approaches [6] , and identify those susceptible patients based on personalized genetics [7] , demographics, laboratory values, comorbidities, and imaging. In response to this crisis, the medical and academic centers in New York City issued a call to action to artificial intelligence researchers to leverage their electronic medical record (EMR) data to better This study is a retrospective cohort study for the development and validation of a machine learning model to predict intubation among patients diagnosed or suspected of COVID-19. The study was performed at an academic healthcare system in an urban setting. All ethical regulations and concerns for patients' privacy were followed during this study. The Institutional Review Board approved the present study and granted waiver for consent of patient data owing to the retrospective nature of the study. Demographic, vitals, and laboratory data were retrospectively queried from electronic health records of 4,087 patients admitted to five hospitals within an academic healthcare system between February, 2020 and April, 2020. Given the novelty of COVID-19, all patient data was used to maximize sample size and available training data for the model. Patients ≥18 years of age who were either COVID-19+ by polymerase chain reaction (PCR) testing or deemed a patient under investigation for COVID-19 were included. Within the EMR, there were several instances of the same variable recorded using separate identifiers (e.g. point-of-care-testing vs. routine labs, or unique identifiers based on facility). This led to increased sparsity as data were distributed over several identifiers. To ensure our model was not specific to features within a specific patient cohort and setting, we employed the following strategies to improve model generalizability. Unique identifiers for variables were grouped manually with input from a boardcertified physician (J.S.K) ensuring that lab units and detection limits were comparable allowing us to merge 31 variables into 15. For multiple lab or vital measurements that were recorded at the same time within the same group, we used the mean of the simultaneous measurements. Using these strategies we J o u r n a l P r e -p r o o f reduced our variable set from 31 to 15. The following time-series variables were used: Time from Admission, Diastolic BP, Systolic BP, Pulse, Respiratory Rate, Temperature, pH, HCO3, Oxygen saturation, Arterial CO2, Arterial O2, Platelet Count, WBC Count, Creatinine, C-Reactive Protein, D-Dimer. Comorbidities were imputed from ICD-10 diagnostic codes Elixhauser comorbidity measures [8] . The following comorbidities were used: Hypertension With Complications, Hypertension, Liver Disease, Renal Disease, Diabetes, Diabetes With Complications, Chronic Obstructive Pulmonary Disease. For each visit, time of first intubation and time of last extubation was extracted from electronic health records. A positive label was assigned if a patient was or remained intubated 72-hours from the end of the 24-hour sampling window (Fig. S1) . A negative label was assigned if a patient was not intubated 72-hours from the end of the 24-hour sampling window, such as if they were never intubated or if they were extubated. For patients intubated and deceased within 72-hours from the end of the sampling window, their label remained positive (Fig. S1 ). Laboratory and vitals information were imputed using an indefinite-feed forward method based on the assumption that previous values remained constant until a value was updated. Of note, laboratory and vitals data were updated with varying frequencies often due to differences in patient acuity. To ensure models were not learning to predict future-intubation status based on updated frequency rate of specific laboratory values or vitals, rather than underlying physiologic patterns, we normalized sampling frequency using an upsampling-interpolation method where values were carried forward (x1, x1, x2, x2, xN, xN) to a frequency of a value every ~30-minutes within each 24-hour sampling window. In instances where feature values from all previous timepoints were completely missing for a patient, the respective population mean for the value from the training set was substituted. These mean values were saved and substituted for missing values in the testing dataset as well to prevent bias. We defined a supervised binary prediction classification using a sliding-window approach to predict the presence of intubation 72-hours from the end of the 24-hour sampling window. Model development reporting criteria followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) [9] . We defined a prediction task to be performed every 12-hours after the first 24-hours from the time of admission. Comorbidity and time-series data were used to fit a random-forest classifier (RandomForestClassifier, version scikit-learn 0.23.0). Class weight balancing was employed to correct for class imbalancing, and tree depth was optimized based on training/ validation cohort. To test our model we employed a randomized split, where patients were randomly allocated into a training and validation cohort (70%) and the remainder into a hold-out testing cohort (30%). The same feature data processing protocol was applied to both cohorts. Following optimization of the model on the training/ validation cohort, model weights were frozen and evaluated in a blinded-fashion on the testing cohort. Model classifications were defined as follows: true positive-positive alert prior to intubation, false positive -positive alert with no intubation, true negative -no alert with no intubation, false negative -no alert with intubation, or initial alert after intubation. Performance of the model was evaluated using AUC and AUPRC performance metrics refer to evaluation of the model on the hold-out test set as described in the previous section. Performance metrics (AUC, AUPRC, AVG precision) are reported based on performance of the model evaluated on the holdout cohort, which was blinded from the model during model optimization and training. Survival statistics were conducted using the presence of intubation during the course of hospital stay as the outcome event, to compare intubation-free event rate between patients alerted and not-alerted J o u r n a l P r e -p r o o f by the model. Kaplan-Meier curves were generated with computed 95% confidence intervals. Significance was tested using the log-rank test. All analyses were performed on secure computer clusters within the institution. Code was written in Python3 using numpy, pandas, and scikit-learn libraries. Data collected for this study are not publicly available because they contain patient health information however relevant code in addition to the trained random forest model can be found at https://github.com/varvind17/covid-intubation-prediction. Derived aggregate data or findings are available from the corresponding author on request. We retrospectively collected data from 4,087 COVID-19+ patients or patients under investigation (PUI) for COVID-19, from February, 2020 to April, 2020 admitted to a large academic healthcare system in an urban setting. 4,087 patients were included in this study with a mean age of 58.6 ± 21.90 years old. 65.4% of patients were female. 11.03% of patients were intubated, and 24.9% of patients died. Among patients intubated during their admission, 35.29% were extubated. A detailed description of data pre-processing can be found in Methods. In brief, 16 continuous vitals and laboratory variables and 7 static, comorbidity variables were consistently measured and aggregated. Data were upsampled to a ~30min-resolution using feed-forward imputation. Intubation state during the course of admission was annotated by extracting intubation and extubation times from electronic health records (EHR). In total, 451 (11.03%) patients were intubated with a mean length of say (LOS) of 6.7 ± 8.4 days for non-intubated patients and 11.9 ± 11.3 days for intubated patients. Among patients who were intubated, intubation occurred on average 3.1 days from admission, and lasted 5.0 days on average (Table S1 ). Using a 24-hour sampling window, we aimed to predict intubation status 72-hours from the end of the sampling window (Figs. 1, S1). Continuous risk predictions were made every 12-hours from the onset of admission (Fig 1) . In total, 27,226 and 18,371 time-intervals were generated for the training and testing datasets, respectively. Among the time-intervals, 6.8% in the training set and 6.5% in the testing set were labeled as positive. Among time-intervals corresponding with a negative label, 89.3% of windows were associated with patients never intubated, while the remaining 10.7% were associated with patients who were intubated at some point during their admission. Prior to feed-forward filling, among time-series variables, on average 36% of values were missing. When stratified by type of feature, 56% of laboratory values were missing (range: 2.7 -74.4%) and 1.0% of vitals sign value were missing (range: 0.9 -1.2%). For each variable, model weights were extracted and standardized to assess the relative importance of each feature. Among comorbidities, complicated hypertension was highly weighted while diabetes was not. Decision criteria learned by the model closely match normal-abnormal boundary thresholds for vitals and laboratory values (Figs. S2a, S3). Respiratory rate was the most heavily weighted continuous variable (Fig. S2b) . Model performance was compared to the ROX score, a validated tool which uses oxygen saturation, respiratory rate, and fraction of inspired oxygen to predict progression to intubation [10, 11] . ROX scores were calculated using the last observed measurement from the end of the sampling window. The area under the receiver operating characteristic curve (AUC) for the ROX index was 0.64, similar to what has been reported [10, 11] . The AUC for the model was 0.84 (Table 1 ). In class-imbalanced prediction problems with rare events, predictive tools that maintain good PPV without sacrificing sensitivity are challenging to develop. To evaluate this, areas under the precision-recall curves (AUPRCs) were generated. The AUPRC for the ROX index and the model were 0.13 (mean precision: 0.10) and 0.30 (mean precision: 0.22), respectively (Fig. 2) . In all subsequent analysis an optimal threshold was identified using Youden's index, which yielded a recall of 79.2% [12] . Patients with a predicted risk greater than the optimal threshold (alerted) were more likely to require intubation during their admission (Fig. 3 ). We evaluated the model among patients of different ages, gender, and high and low body-massindex, an emerging risk modifier for COVID-19 prognosis [13, 14] to ensure good generalizability. The model had a consistent AUC across all ages except patients less than 40 years of age. There was a noticeable difference in model AUPRC and average precision between patients aged 40-60 and 60-80 (Fig. 4a ). There were no significant differences in AUC among men and women. There was an increase in Among suspected but not confirmed cases the model performed with an AUC, AUPRC, and average precision of 0.81, 0.08, and 0.04, respectively. Additionally, when we stratified patients by hospital to assess if there were biases in model performance based on facility. There were no differences in AUC however there were differences in AUPRC and average precision across hospitals (Fig 4b) . As SARS-COV-2 continues to wreak havoc on the world's population, there has been an insurmountable surge of effort from the medical community at large in an effort to characterize, understand, and predict the nature of the covid-19. With the availability of large databases and collaboration between researchers on an international scale, there has been significant progress on the AI front. To this end, we have defined a supervised binary prediction classification using a sliding-window approach to predict the presence of intubation 72-hours from the end of the 24-hour sampling window. We defined a prediction task to be performed every 12-hours from the time of admission. Our machine learning algorithm performed with an AUC of 0.83 and AUPRC of 0.32, significantly outperforming the ROX index for intubation risk. We demonstrate a valid machine learning algorithm for frontline physicians in the emergency department and the inpatient floors to better risk-assess patients and predict those patients who require intubation and mechanical ventilation. To our knowledge, while some studies have assessed the risk of progression to ARDS and mortality, no recent studies have modeled an algorithm to predict intubation in COVID-19 positive patients. In a large report, 49% of 2,087 critically ill patients with COVID-19 died. Single center studies showed a 62% and 67% mortality rate of ICU patients in Wuhan, China and Washington State, respectively [15] [16] [17] . While the decision to intubate is often based on the discretion of the treating physician, there are empirical guidelines released by the Chinese Society of Anaesthesiology that recommended intubation in a timely fashion [15] . They note that in severe cases, the disease progresses rapidly and develops to acute respiratory distress syndrome, septic shock, metabolic acidosis and coagulopathy. A more recent study by Meng et al. published criteria for intubation to include a SpO2 less than 93% in room air, a PaO2 to FiO2 ratio less than 300 mmHg, respiratory rate > 30 breaths/minute, cardiopulmonary arrest, and/or a lost or jeopardized airway [18] . Although there is currently not enough evidence to show that early intubation reduces mortality, based on the work by Shoemaker et al., there is an association between accumulated oxygen debt and survival in patients undergoing high risk surgery and ICU admission [18, 19] . The high mortality in intubated patients, the empirical guidelines that promote vigilant care and early, albeit not premature, intubation, and the possibility of reduced mortality in patients who do not accumulate oxygen debt emphasize the importance of this study and the algorithm that can predict intubation risk in covid positive patients. Using a 24 hour sampling window, a random forest classifier was trained and tested on retrospective data sampled from a large, urban academic healthcare system at the center of the COVID-19 outbreak. The model was capable of predicting intubation status 72 hours from the end of the sample window with risk predictions made every 12 hours from the onset of admission. Risk assessments every 12 hours were chosen as physician shifts are often 12 hours, thus newly generated risk predictions would serve to update risk assessment of COVID19 patients on the floor. Additionally, predictions every 12 hours allows for measurement of new vitals and laboratory values that would update risk scores. Interestingly, decision criteria learned by the model closely matched thresholds for clinical guidelines ( Fig S2) . We see that the values represented as mean ± standard deviation ( Figure S2A ) signify decision J o u r n a l P r e -p r o o f boundaries for patients being intubated. A pH< 7.3 ± 0.1, CRP >199 ± 59 mg/L, heart rate 126±17, respiratory rate >22 ± 5 breaths/min, temperature >100.0 ± 2.0F, O2 saturation < 96 ± 6%, PaO2 <93 ±25mmHg, PaCo2 >39 ± 11mmHg, HCO3 <25±9, D-Dimer >6 ± 4mg/L, Creatinine >3 ± 3 mg/dL, WBC >16.8 ± 11.6 (1000cells/mm³), Platelets <163 ± 105, systolic pressure 116 ± 14 mmHg, and diastolic pressure 80 ± 17 mmHg closely match normal-abnormal laboratory and vital boundaries, suggesting the model learned physiologically relevant patterns. Respiratory rate was heavily weighted time-series for prediction of intubation, and has been shown to be a factor associated with poor survivorship in COVID19 positive patients in prior studies [20] (Fig. S2b) . Zhou et al. described a 8.89 higher odds ratio for in-hospital mortality in univariate analysis for patients with a respiratory rate > 24. Among comorbidities, complicated hypertension was more heavily weighted than diabetes. This finding was in concordance with risk factors associated with progression to acute respiratory distress syndrome (ARDS) among COVID-19+ patients [21] . Recent studies have also shown that coronary heart disease, diabetes, and hypertension to be significant risk factors for in-hospital mortality. Model performance was compared to ROX index-a validated tool which uses oxygen saturation, respiratory rate, and fraction of inspired oxygen to predict progression to intubation. The area under the receiver operating characteristic curve (AUC) for the ROX index was 0.64, similar to what has been reported [10, 11] . The model (AUC =0.84) significantly outperformed the ROX index (AUC =0.64) (p=0.005) ( Table 1) . Like most machine learning problems involving clinical data, low prevalence of positive cases lead to class-imbalance. Therefore, predictive tools that maintain good PPV without sacrificing sensitivity are challenging to develop. To evaluate this, areas under the precision-recall curves (AUPRCs) were generated. The random forest model (AUPRC= 0.32) significantly outperformed the ROX index (AUPRC = 0.13) (p=0.022) (Fig. 2) . With regards to generalizability across different demographics, we observed that the random forest model was generalizable across gender and age. The model performed with a similar AUC across all ages except for patients <40 years old, which we surmise is due to the rarity of positive cases within this cohort for the model to train from. Among patients less than 40 years of age, only 3.6% (13/358) of patients were intubated while among patients older than 40, J o u r n a l P r e -p r o o f 8.6% of patients were intubated (110/1275). When considering BMI, the model performed with greater AUPRC and average precision among obese patients (BMI ≥ 35). This is likely due to the increase in prevalence of intubation among this cohort (15.6% for BMI ≥ 35 vs. 6.7% for BMI < 35 ) (Fig 4b) . When we compared the performance of the model on patients with documented versus suspected COVID19 we observed similar AUC. Expectedly, with respect to AUPRC and average precision, the model performed worse among suspected but not confirmed patients possibly due to a decrease in prevalence of intubation among these patients. In the holdout dataset 11.4% of patients were intubated. Among patients suspected with COVID-19, 5.3% were intubated. It is likely that some of these cases correspond to patients who were initially suspected for COVID-19 and proceeded to deteriorate and were clinically diagnosed with COVID-19 without laboratory testing, therefore not updating their status. Our proposed model does have some limitations. It currently requires a 24 hour sampling window in order to generate a prediction. Therefore, rapid risk assessments for patients without 24 hours of data are not possible. In future studies, it may be possible to reduce this sampling window. Furthermore, since various laboratory and vital sign values are updated at different times and at different frequencies, we employed an indefinite feed-forward and upsampling interpolation method to normalize feature sampling frequency. Since laboratory and vital signs were updated at differing times, this led to missing values following feature alignment to time. Laboratory values were missing to a greater percentage compared to vital signs. This difference is explained by the decreased frequency of measurement of laboratory values compared to vital signs. While we did use upsampling to interpolate the sampling frequency of vitals and labs to around every 30 mins, It is possible that differences in laboratory and vital sign recording frequency at different hospitals may lead to different results. Thus future studies looking to validate the model are required. Other interpolation methods may lead to improved model performance and generalizability. In a study by Hyland et al., an adaptive imputation scheme was used to resample data to a ~5-min resolution to predict circulatory failure among patients in the ICU [22] . Interestingly, the study found that similar performance could be achieved using simpler imputation schemes such as an indefinite feed forward J o u r n a l P r e -p r o o f method when using decision tree models such as random forest. In this study, patients were not exclusively included from the ICU therefore we decided to upsample data to a resolution of ~30mins to accommodate instances where features were not updated as frequently. Additionally, we found that our random forest classifier was able to learn meaningful decision criteria to predict future intubation and was robust to our imputation scheme. Nevertheless, future studies should seek to investigate handling of missing data and imputation schemes to optimize model performance. The use of supervised learning models may be subject to regional, institutional, or practice specific bias. In this study, the model was developed and tested on patients from a single health care system with patients from a single region. While we stratified model performance across different hospitals in the hold-out dataset, caution should be taken when trying to assess external generalizability of the model. External generalizability should only be appraised following evaluation on patient data from a different healthcare system, as differences in hospital intubation culture may arise. Therefore, future validation studies should seek to test the model on diverse patient cohorts from different hospital systems to assess the validity of the model. Machine learning models by their nature learn patterns in data based on differences between positive and negative samples. While the model learned decision thresholds that resemble clinical guidelines and strongly weighted factors that are associated with poor outcomes among COVID19 positive patients, the nature of machine learning models is a black box and it is therefore impossible to dissect the rationale behind prediction strategies. Among intubated patients that died, death occurred 152.27 hours following intubation. While this suggests the model was required to learn time intervals that were associated with future intubation, rather than future death within 72 hours, future studies are required to understand the model's learned pattern recognition with validation to confirm prediction of future intubation and not generalized deterioration. Lastly, COVID-19 has been a moving target with ongoing changes in clinical guidelines and even virus biology [23, 24] . Therefore as new patient data is generated, efforts to retrain machine learning models should be undertaken to update models to changes in clinical practice and disease progression. All of these limitations point out that this is a preliminary result that requires further work and validation before it can be practically used. However, our findings also suggest that with larger numbers and further refinement, machine learning has a potential to quickly assess patients for intubation risk with a high degree of accuracy and hopefully reduce mortality. As SARS-COV-2 continues to impact our lives in unprecedented ways, we present a possible tool for frontline physicians in the emergency department and the inpatient floors to better risk-assess patients and predict those patients who will require intubation and mechanical ventilation. Our machine learning algorithm with an AUC of 0.83 and AUPRC of 0.32, which significantly outperforms the ROX index for intubation risk. Table 1 . Performance metrics AUC, AUPRC, and average-precision for the machine learning model and ROX-index to predict presence of intubation in 72hours evaluated on the hold-out dataset. Table 1 . Demographic, comorbidity, intubation, and mortality data for the train, test, and combined cohorts. Supplemental Figure 3 . Schematic of a decision tree from the random forest model. Organ boxes denote predictions as negative (no intubation within 72 hours) or positive (intubated within 72 hours). Decisions to the left are made if the above node logic is satisfied, whereas decisions to the right are performed if the logic is false. For example, in the top node if the respiratory rate at the start of the sampling window is ≤ 22.74 breaths per minute, the logic is stratified and the patient progresses to the next decision node on the left. Classification of COVID-19 patients from chest CT images using multi-objective differential evolution-based convolutional neural networks Drawing Insights from COVID-19 Infected Patients With no Past Medical History Using CT Scan Images and Machine Learning Techniques: A Study on 200 Patients n Modeling the trend of coronavirus disease 2019 and restoration of operational capability of metropolitan medical service in China: a machine learning and mathematical model-based analysis Using Machine Learning to Estimate Unobserved COVID-19 Infections in North America Early prediction of mortality risk among severe COVID-19 patients using machine learning n A data-driven drug repositioning framework discovered a potential therapeutic agent targeting COVID-19 n Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study Risk adjustment performance of Charlson and Elixhauser comorbidities in ICD-9 and ICD-10 administrative databases Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): the TRIPOD Statement Predicting success of highflow nasal cannula in pneumonia patients with hypoxemic respiratory failure: The utility of the ROX index An Index Combining Respiratory Rate and Oxygenation to Predict Outcome of Nasal High-Flow Therapy Estimation of the Youden Index and its associated cutoff point Obesity a Risk Factor for Severe COVID-19 Infection: Multiple Potential Mechanisms Obesity in patients younger than 60 years is a risk factor for Covid-19 hospital admission Expert Recommendations for Tracheal Intubation in Critically ill Patients with Noval Coronavirus Disease Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study Characteristics and Outcomes of 21 Critically Ill Patients With COVID-19 in Washington State Intubation and Ventilation amid the COVID-19 Outbreak Role of oxygen debt in the development of organ failure sepsis, and death in high-risk surgical patients Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study Risk Factors Associated With Acute Respiratory Distress Syndrome and Death in Patients With Coronavirus Disease Early prediction of circulatory failure in the intensive care unit using machine learning Consensus guidelines for managing the airway in patients with COVID -19 : Guidelines from the Difficult Airway Society, the Association of Anaesthetists the Intensive Care Society, the Faculty of Intensive Care Medicine and the Royal College of Anaesthetists Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 J o u r n a l P r e -p r o o f