key: cord-0929560-zijytcw5
authors: Song, Wenyu; Zhang, Linying; Liu, Luwei; Sainlaire, Michael; Karvar, Mehran; Kang, Min-Jeoung; Pullman, Avery; Lipsitz, Stuart; Massaro, Anthony; Patil, Namrata; Jasuja, Ravi; Dykes, Patricia C
title: Predicting Hospitalization of COVID-19 Positive Patients Using Clinician-guided Machine Learning Methods
date: 2022-05-20
journal: J Am Med Inform Assoc
DOI: 10.1093/jamia/ocac083
sha: b02775dc79c7f3a27b7a5bea2d8432e9e1f1f743
doc_id: 929560
cord_uid: zijytcw5

OBJECTIVE: The coronavirus disease-19 (COVID-19) is a resource-intensive global pandemic. It is important for healthcare systems to identify high risk COVID-19 positive patients who need timely health care. This study was conducted to predict hospitalization of older adults who have tested positive for COVID-19. METHODS: We screened all patients with COVID test records from 11 Mass General Brigham (MGB) hospitals to identify the study population. A total of 1,495 patients with age 65 and above from the outpatient setting were included in the final cohort, among which 459 patients were hospitalized. We conducted a clinician-guided, three-stage feature selection and phenotyping process using iterative combinations of literature review, clinician expert opinion and EHR data exploration. A list of 44 features, including temporal features, were generated from this process and used for model training. Four machine learning prediction models were developed, including regularized logistic regression, support vector machine, random forest, and neural network. RESULTS: All four models achieved AUC greater than 0.80. Random forest achieved the best predictive performance (AUC = 0.83). Albumin, an index for nutritional status, was found to have the strongest association with hospitalization among COVID positive older adults. CONCLUSIONS: In this study, we developed four machine learning models for predicting general hospitalization among COVID positive older adults. We identified important clinical factors associated with hospitalization and observed temporal patterns in our study cohort. Our modeling pipeline and algorithm could potentially be used to facilitate more accurate and efficient decision support for triaging COVID positive patients.

The recent outbreak of coronavirus disease 2019 (COVID- 19) was declared a public health emergency of international concern (PHEIC) by the World Health Organization on January 30, 2020 [1] . As of December 2021, there are more than 275 million COVID-19 cases conformed worldwide, and over 5.35 million people have died [1] . In the United States, around 97,000 people are currently in the hospital due to COVID-19 [2] . The high volume of patients during the pandemic has caused unprecedented pressure on healthcare systems. Many hospitals in the United States are over capacity due to limited clinical resources, including beds, intensive care units (ICUs), and ventilators, which are crucial for the treatment of COVID-19 patients with severe symptoms [3 4 ]. Older patients are most susceptible to severe illness and have a higher mortality rate [5] . It is critical for clinicians and hospitals to provide appropriate clinical care to patients in the right setting; e.g., home for less severe COVID-19 cases while reserving hospital beds for the more severe cases requiring acute intervention. Given potentially large infected populations during future pandemic waves, it is important to develop accurate and efficient clinical decision support for triaging COVID positive patients before they are admitted to the hospital [18] . There is an urgent need for an individual-level risk prediction model that could predict people in need of hospitalization to optimize this limited clinical resource.

With the wide adoption of electronic healthcare record (EHR) systems, machine learning predictive models have great potential to leverage the large volume of data and provide tools to support medical decision making [6] . An EHR-based predictive tool can improve COVID patient care by facilitating an informed, proactive decision-making process, which can be particularly useful in managing large populations [19] . During the pandemic, many studies were conducted to develop machine leaning-based models to predict COVID19 disease progression [7] . However, most studies focused on predicting severe adverse outcomes (such as ICU admission, mechanical ventilation, and death) among hospitalized COVID19 patients [8] [9] [10] . Few studies focused on predicting hospitalization among patients with confirmed COVID19 and the patient cohort used in those studies is now relatively old, spanning March 2020 to October 2020 [11] [12] [13] [14] [15] [16] [17] [18] . Jehi et al developed a robust individualized prediction model among COVID positive patients using data from 2020 and identified important risk factors of hospitalization [14] . They also provided strategies to integrate the model into clinical workflow and to link their informatics findings with clinical practice. Given the rapid progress of the pandemic, studies with newer datasets could be useful to reflect current clinical status and further validate and expand previous studies.

In this study, we developed and validated machine learning-based prediction models, using more recent EHR data (between March 2020 to May 2021) from the Mass General Brigham (MGB) Health system, to estimate hospitalization in confirmed cases. The outcome of the models is whether older COVID-19 positive patients are hospitalized within 14 days of the COVID-19 positive test date. Different from Jehi's study population, we focused on the patient population aged 65 and above particularly since older adults with multiple comorbidities have been shown to be at an increased risk. Our current goal is to predict hospitalization among COVID-19 positive patients using existing EHR information in an outpatient population. The cause of hospitalization is very likely to be COVID-19, as a hospital policy was in place at the time of study limiting elective services (e.g., non-emergency procedures and surgeries were cancelled or postponed) to ensure adequate beds were available for COVID patients. With further validation and optimization, our long-term goal is to provide a testable framework for subsequent validation and refinement towards a comprehensive prediction system to facilitate timely and appropriate care for COVID-19 patients.

We used clinical databases within the MGB Healthcare system, which has a centralized clinical data warehouse for all types of clinical information from multiple Harvard-affiliated hospitals. Available data items include patient demographics, diagnoses, procedures, medications, laboratory tests, inpatient and outpatient encounter information, and provider data. For the current study, clinical data from 11 MGB hospitals were included. 3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 Using the MGB database, we collected all patients with COVID test records (304,113 patient visits). We further identified 11,348 patients aged 65 and above and a positive COVID test result. We then removed inpatients and 6,765 patients remained. After the removal of patients with high missing values and low data quality, 1,495 patients remained in the final study cohort (Figure 1 ).

We identified risk factors for severity of COVID-19 manifestation using an iterative combination of literature review, qualitative methods (interviews with clinical experts, physicians, who had experience in treating COVID-19 patients) and EHR data exploration (clinical data review and feature engineering). As Our study design used the COVID test date as an objective proxy (index date) for the disease onset time.

We used a one-month time window ( of the model outcome is whether COVID positive patient (age 65+) will be hospitalized within 2 weeks before or 2 weeks after COVID test date (Table 1) . Table 1 ). To overcome the overfitting issue, we tuned models through cross-validation to select the best set of parameters and evaluated their performance on an independent test set.

Before training any of the models, we randomly split the data into 80% training and 20% test set. To ensure that our results would be generalizable, we repeated this random splitting process 30 times and reported the average model performance on the test set over the 30 splits. For a given split, we further divided the 80% training data into 5 equal-sized folds (stratified by class to ensure the minority class is present in equal proportion across all folds for hospitalization outcome). We trained the model on 4 folds 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 and evaluated its performance on the fifth fold (validation set). We repeated this process 5 times while each time a different fold served as the validation set. Model performance was averaged across the 5 folds to determine the best hyperparameters.

To evaluate model performance, we used the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, precision and F1 score. All metrics were calculated using the test set. The mean and standard deviation of each evaluation metric across the 30 test sets were reported. 

In the final study cohort, the case group (n=459) had a record of hospitalization (with hospital stay for more than 24 hours after admission) during a four-week window of the COVID-19 test date (14 days prior to and 14 days post COVID-19 test) and the control group (n=1,036) had no record of hospitalization ( Figure 1 ). Similar age distributions were observed between case and control groups, although the case group was slightly older, more likely to be male and slightly more likely to be a smoker (Supplementary Table 2 ).

We conducted a three-stage clinician-guided feature engineering process (Table 2 ). First, we generated a list of 80 variables based on previous studies on COVID severity models. All these variables were used as predictors at least once. Second, we modified this list based on feedbacks from an online survey from 36 clinicians and generated a list with 45 variables. Third, through a clinician group interview and data quality assessment in MGB database, we developed a final list with 29 variables, including demographics (age, gender), vital signs (such as SpO2 and temperature), lab tests (such as albumin) and chronic diseases (such as respiratory disease and heart failure) ( Table 1 and Supplementary Table 3 ). 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 We extracted duration of hospital stay from the database and used 24-hour hospital stay as the definition of "hospitalization" based on clinicians' recommendation (Table 1) .

We investigated case and control patients' distributions by binning the cohort based on month (15 months in total) (Figure 3 ). We observed a similar distribution of case and control patients over time. There were two peaks over time at months of 04/2020 and 12/2020, which could reflect the accumulated infection trend in our study cohort.

In addition, we created temporal variables to include in model training. Specifically, in accordance with the COVID test date, each month (15 months in our study duration) was encoded as a binary variable (0 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 for patients that not tested for that specific month and 1 for patients that tested for that specific month) for each patient from 03/2020 to 05/2021 (Table 1) .

We combined 29 clinical variables and 15 temporal variables (44 variables in total) as the model input.

All predictive models have AUC > 0.80 (Table 3) , indicating a good prediction was achieved with all four models, among which random forest, SVM had AUC=0.83, while other three models had AUC of 0.81 and 0.82. We did not observe a statistically significant difference in terms of AUC among the four models. 

Among the statistically significant factors associated with hospitalization from the logistic regression model (Table 4) , albumin, a plasma protein which is an important index for nutritional status, was found to have the most impact on the outcome. Multiple EHR variables, including vital signs, lab values and chronic diseases, were also important for the prediction. Two temporal variables, 08/2020 and 05/2021

were also identified as important associated factors. 

In this study, we developed a clinician guided machine learning predictive algorithm to identify high-risk COVID positive patients by using EHR defined "hospitalization" as the outcome and EHR derived variables (input). We used a multi-hospital study cohort and iterative feature engineering process for model development. Our final input variable list was based on considerations of previous studies, expert opinions and the quality of the EHR-based clinical dataset. Feedback from clinicians played an important role in optimizing the input variables for model training. For example, suggestions like " fever (especially prolonged fever) may be early indicator" and "measure for frailty/vulnerable baseline state (such as albumin) would be useful" helped us to narrow down and finalize the feature list. More importantly, during the survey and interview with clinicians, valuable knowledge of practical advantages/disadvantages of each input variable were carefully reviewed. Since we are focusing on the outpatient setting, there is less information available and data quality is more unstable compared with inpatient settings. Selecting useful and more clinically available features was an important task during our feature engineering process and reflected in the final feature list. In general, guidance from the clinical team enabled our model to represent real clinical settings and significantly improved our models' performance and practical value.

One important goal of our study is to identify risk factors for severe COVID patients. Our results suggest that albumin could be a strong factor associated with hospitalization (protector) of hospitalization risk, which is consistent with previous studies [19] [20] [21] . Our study further validated that albumin is an important factor associated with COVID related hospitalization, along with SPO2 and temperature. Since albumin tests are commonly conducted among patients and can be easily obtained from EHR dataset [22] , it can potentially serve as a useful marker for severe COVID patients. Because all of our input variables are routinely available patient features, we expect that the algorithm can be adjusted and applied in other health care systems in the current and potential future pandemic. Moreover, we expect the proposed algorithm could help health care providers to identify those at high risk who need timely in-patient services among COVID-19 positive patients in the community to optimize the use of the limited clinical resources.

Gaining a deeper understanding of "Long COVID" generally used to describe the long-term effects of COVID infection is an increasingly important topic [23 24] . "Hospitalization", as an objective and comprehensive indication of patient status, could be a very useful phenotype for this kind of study. Also, how to utilize fast-growing large-scale clinical datasets to develop practical COVID tools is a challenge for the informatics field. We believe that the integrated model leveraging informatics/clinical components presented in this study provides a useful framework for other researchers and future studies. For example, the database with national level information (e.g. National COVID Cohort Collaborative (N3C)) [25] will be a good target to further test and improve the algorithm in the future.

In addition, a better understanding of temporality of COVID is a very important topic. Different COVID strains may have played a role in pandemic progression and the mechanisms of the disease could have dynamic changes. We did not include COVID strain information in this study due to the lack of data, but this is an important topic for future research. We did however, explore the temporal patterns of the disease from two angles: 1). we visualized the trend of hospitalization and identified two peak time during the study period consistent with the trend of accumulated infection events in MGB system; 2) we included time components in the prediction model and estimated their strength in predicting hospitalization. For example, two particular months were important associated factors and both months corresponded with the downward duration of COVID infection trend. This could be informative for future studies to provide a better understanding of the relationship between time and COVID-related outcomes.

A good example is that in the potential future waves, 3 to 6 months after onset would be an appropriate time point for developing severity models. 21 22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 The current study has several limitations. First, our study focused on patients age 65 and above. A large number of COVID positive patient ages 64 and below were removed during the cohort development stage. Also, we are focusing on the outpatient setting. Many features in the initial feature list (Table 1) have high missing rates for outpatients, for example Medication records (e.g., hospitalized patients had more complete records). Therefore, we did not include medications and other data types with high missingness in our model. We also removed significant number of patients with high missing values to develop the final study cohort, which allows us to have high-quality dataset for model training, but also could create certain level of bias for our study population. Second, only structured data were used for the current study, unstructured data (clinical notes) may provide additional predictive power. Third, we did not observe a statistically significant difference in terms of AUC among the four models. This observation that more advanced machine learning models did not perform better compared to regularized logistic regression could be due to the relatively small sample size and presence of a few strongly predictive features (e.g., albumin, SpO2 and temperature), under which the power of machine learning models in handling large and complex data could not be fully leveraged. Fourth, our study cohort mainly comes from a pre-vaccine stage (March 2020 and May 2021), so we did not include vaccine status in our model.

Due to this characteristic of the study population, this model will be more closely applicable to a nonvaccinated population. In addition, studies have shown that different COVID variants can have different responses to current vaccines. Omicron is about 2.7-3.7 times more infectious than Delta in vaccinated and boosted people [26] . This will be an important topic for our future studies as more vaccination data is becoming available. Fifth, we extracted both model outcome (hospitalization) and time-sensitive input features (lab values) from the four-week time window surrounding the COVID test date, leading to the possibility the outcome might precede the inputs. This could bias the relationship between input and outcome in our model and limit its prediction ability. Our current model design is based on known limitations of outpatient EHR datasets and the undetermined relationship between COVID test date (known) and COVID actual onset time (unknown). Several data challenges existed over the course of the pandemic including limited understanding of COVID progression and incubation period, limited testing capacity and delayed results reporting, limited medical resources, and the inconsistent workflows and EHR documentation patterns in outpatient settings. These limitations are not unique to our project but rather limitations of the availability of testing and the speed at which tests were processed at different periods during the COVID pandemic. Based on available data, we used the test date to approximate the onset time and the difference between these two time points can vary depending on different COVID incubation periods and testing systems. During our study period (especially in the early days of the pandemic), COVID tests were difficult to get and it took relatively longer for patients to get results after COVID testing. Therefore, some patients were sick with COVID but were not tested until they were hospitalized. Today COVID tests are widely available and patients can get the results within 24 hours.

But with increasing home-based testing, this may continue to be a problem with datasets since the results of home tests are not consistently reported. Therefore, a COVID positive patient (home test) may not have an EHR-documented COVID test result until hospitalization. We used a 14-day time window surrounding COVID test date (before and after) to capture the potential COVID incubation period. This is based on clinical expert opinion and published standards (WHO: 0-14 days and ECDC: 2-12 days)[27]. We used a fixed time window to standardize the model pipeline. As part of our future studies, we are working on improving the model by refining the input and outcome time windows and using newer datasets, to make the algorithm more accurate. But this also requires a deeper understanding of basic science of COVID, which is slowly evolving. This knowledge will lead to a better definition of the COVID onset time point.

Lastly, the current study is only using data from the MGB site, which could limit the generalizability and utility of the final algorithms. We are currently conducting the second stage of this project by getting comparable data from an independent site, which would allow testing of generalizability of this algorithm with an extended dataset and further validation of the key features identified here.

In the current study, we developed prediction models for general hospitalization among older COVID positive patients. Our input variables are routinely available patient features and the model development 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 

World Health Organization, Coronavirus disease (COVID-2019) situation reports. Secondary World Health Organization, Coronavirus disease (COVID-2019) situation reports

Coronavirus Pandemic (COVID-19

Projecting hospital utilization during the COVID-19 outbreaks in the United States

The high volume of patients admitted during the SARS-CoV-2 pandemic has an independent harmful impact on in-hospital mortality from COVID-19

COVID-19 and the elderly: insights into pathogenesis and clinical decision-making

Predicting the Future -Big Data, Machine Learning, and Clinical Medicine

Role of Machine Learning Techniques to Tackle the COVID-19 Crisis: Systematic Review

Predictive Modeling of Morbidity and Mortality in Patients Hospitalized With COVID-19 and its Clinical Implications: Algorithm Development and Interpretation

Federated Learning of Electronic Health Records Improves Mortality Prediction in Patients Hospitalized with COVID-19

Early risk assessment for COVID-19 patients from emergency department data using machine learning

An observational study to develop a scoring system and model to detect risk of hospital admission due to COVID-19

Understanding Demographic Risk Factors for Adverse Outcomes in COVID-19 Patients: Explanation of a Deep Learning Model

Individualized prediction of COVID-19 adverse outcomes with MLHO

Development and validation of a model for individualized prediction of hospitalization risk in 4,536 patients with COVID-19

Predictability of COVID-19 Hospitalizations, Intensive Care Unit Admissions, and Respiratory Assistance in Portugal: Longitudinal Cohort Study

Early prediction of level-of-care requirements in patients with COVID-19

Algorithm for Individual Prediction of COVID-19-Related Hospitalization Based on Symptoms: Development and Implementation Study

Personalized predictive models for symptomatic COVID-19 patients using basic preconditions: Hospitalizations, mortality, and the need for an ICU or ventilator

The Prognostic Effect of Serum Albumin Level on Outcomes of Hospitalized COVID-19 Patients

Is Albumin Predictor of Mortality in COVID-19?

Derivation of a Clinical Risk Score to Predict 14-Day Occurrence of Hypoxia, ICU Admission, and Death Among Patients with Coronavirus Disease

Long COVID: An overview

More than 50 Long-term effects of COVID-19: a systematic review and meta-analysis

The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment

Omicron infection of vaccinated individuals enhances neutralizing immunity against the Delta variant

The clinical data set generated and/or analyzed during the current study are not publicly available due to patient privacy and IRB regulation.

There is no sponsor role in the manuscript.