key: cord-1047632-ozmkmy6a authors: King, Z.; Farrington, J.; Utley, M.; Kung, E.; Elkhodair, S.; Harris, S.; Sekula, R.; Li, K.; Crowe, S. title: Machine Learning for Real-Time Aggregated Prediction of Hospital Admission for Emergency Patients date: 2022-03-10 journal: nan DOI: 10.1101/2022.03.07.22271999 sha: 77a1283daeae00b833d216d225c68d11bfd9768a doc_id: 1047632 cord_uid: ozmkmy6a Machine learning for hospital operations is under-studied. We present a prediction pipeline that uses live electronic health-records for patients in a UK teaching hospital emergency department (ED) to generate short-term, probabilistic forecasts of emergency admissions. A set of XGBoost classifiers applied to 109,465 ED visits yielded AUROCs from 0.82 to 0.90 depending on elapsed visit-time at the point of prediction. Patient-level probabilities of admission were aggregated to forecast the number of admissions among current ED patients and, incorporating patients yet to arrive, total emergency admissions within specified time-windows. The pipeline gave a mean absolute error (MAE) of 4.0 admissions (mean percentage error of 17%) versus 6.5 (32%) for a benchmark metric. Models developed with 104,504 later visits during the Covid-19 pandemic gave AUROCs of 0.68-0.90 and MAE of 4.2 (30%) versus a 4.9 (33%) benchmark. We discuss how we surmounted challenges of designing and implementing models for real-time use, including temporal framing, data preparation, and changing operational conditions. To date, most applications of Artificial Intelligence (AI) to healthcare have been applied to address clinical questions at the level of individual patients 1 . Now that many hospitals have electronic health records (EHRs) and data warehouse capabilities, there is the potential to exploit the promise of AI for operational purposes 2 . Hospitals are highly connected systems in which capacity constraints in one area (for example, lack of ward beds) impede the flow of patients from other locations, such as the emergency department (ED) 3 or those ready for discharge from intensive care 4 . Arrivals to the ED show diurnal and seasonal variations, with predicted peaks in the morning and early evening, but workflows elsewhere in a hospital mean that discharges from the hospital happen late in the day, creating flow problems 5 . This mismatch of cadence between different parts of the hospital results in patients boarding in ED, or being admitted to inappropriate wards, with adverse consequences including longer stays 6 , greater risk of medical errors 7 and worse long-term outcomes in elderly patients 8 . Hospital services can be managed more efficiently if accurate short-term forecasts for emergency demand are available 9, 10 . Currently, most hospitals use simple heuristics to make short-term forecasts of numbers of emergency admissions, which are based on rolling averages for each day of the week 11 . Scholars have suggested improvements using Bayesian approaches or auto-regressive inductive moving averages with meteorological, public health and geographic data 9, 12, 13 . However, such methods do not take account of stochastic nature of ED arrivals 14 and cannot be adapted to reflect the case mix of people in the ED at a given point in time. In hospitals with EHRs, where staff are recording patient data at the point of care, there is an opportunity to use EHR data to generate short-horizon predictions of bed demand. These would help the teams responsible for allocating beds make best use of available capacity and reduce cancellations of elective admissions. ML is attractive for such predictions because its aggregation of weak predictors may create a strong prediction model 2 . Emergency medicine scholars have compared predictions made . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 10, 2022. ; https://doi.org/10.1101/2022.03.07.22271999 doi: medRxiv preprint by ML algorithms against conventional approaches like linear regression and naïve Bayes 10, 15 . It is common for such studies to use arrival characteristics (e.g. arrival by ambulance or on foot), triage data and prior visit history [16] [17] [18] to make predictions, although recent studies have included a wider variety of data captured by EHRs, including medical history, presenting condition and pathology data 10, [19] [20] [21] . Hong et al 10 showed that ML algorithms like gradient-boosted trees and deep neural networks, applied to a large EHR dataset of 972 variables, improved predictive performance. By including data on lab test results and procedures, El-Bouri et al 21 were able to predict which medical specialty patients would be admitted to. Barak-Corren et al's study 19 is one of few in emergency medicine to address the challenges of making predictions during a patient's visit to ED. They built progressive datasets from historical data, each intended to reflect the data usually available at 10, 60 and 120 minutes after presentation to the ED. Notwithstanding their use of chief complaint data that was entered by ED receptionists as free text and retrospectively coded by the researchers, they were able to show that the later datasets offered better predictions than at 10 minutes. Their study demonstrates the potential that EHRs offer for improving on approaches that use triage data only. Although these studies demonstrate the predictive utility of ML, they do not unlock its potential to generate predictions in real-time to help managers address problems of patient flow. Building a model for implementation involves several additional challenges to those encountered when simply optimising the technical performance of a prediction model. These is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 10, 2022. ; https://doi.org/10.1101/2022.03.07.22271999 doi: medRxiv preprint greatly from patient to patient in the ED. A patient in the resuscitation area of an ED may have frequent observations, while a patient in the waiting room has no data collected. These heterogeneous data profiles are themselves indicative of likelihood of admission. From the bed planners' point of view, knowing the probability that a particular patient will be admitted is less valuable than knowing in aggregate how many patients to plan for. In this respect a prediction tool that can provide a probability distribution for the number of admissions in a given time frame is more useful than one that solely estimates probability of admission at the patient level. One study in emergency medicine derived an expected number of admissions among a roomful of patients in ED by summing their individual probabilities of admission 28 , but there was no presentation of the uncertainty of their point estimates. Also, when making predictions for admissions within a time-window after the prediction is made, projections must allow for the number of patients not on the ED at the prediction time who will arrive and be admitted within the window 29 . If models are to be used operationally, their performance needs to be sustained over time as care provision, patient characteristics and the systems used to capture data evolve 24 . Realtime operational models also need to cover the 'last mile' of AI deployment; this means that the applications that generate predictions can run end-to-end without human intervention. This last mile is the most neglected 30 , leading to calls for a delivery science for AI, in which AI is viewed as an enabling component within an operational workflow, rather than an end in itself 31 . This research aimed to harness the heterogenous stream of real-time data coming from patients in the ED of a UK hospital to make predictions of aggregate admissions in a short time horizon. Bed planners at the hospital were closely involved with the research team to specify their requirements. They requested predictions for bed requirements in the next four and eight hours to be sent at four times daily, to coincide with their own capacity reporting. As part of the project, we developed an application that formats and sends an email to the . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 10, 2022. ; https://doi.org/10.1101/2022.03.07.22271999 doi: medRxiv preprint bed planners at the four report times. In this paper, we explain how the predictions are generated, evaluate their performance and compare them with standard benchmarks. The contributions of the research are: the development and deployment of a ML-based information product in use in hospital operations; the demonstration of a method to train ML models for real-time use when patient-level data is variable between patients and over the course of individual visits; the incorporation of a method to aggregate individual-level predictions for operational planning purposes; and an exposition of some of the challenges associated with developing models for real-time implementation. . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 10, 2022. ; https://doi.org/10.1101/2022.03.07.22271999 doi: medRxiv preprint a set of ML models. These are combined in c into a probability distribution for the number of admissions among this roomful of patients. d shows the probability of admission within 4 hours calculated from recent data on time to admission, taking into account the time the patient has been in ED up to the prediction time e shows a distribution over the number of admissions among the roomful of patients in the prediction window of 4 hours. f shows a probability distribution over the number of patients who have not yet arrived, who will be admitted in the prediction window, generated by a Poisson equation. g shows the final probability distribution for the number of admissions within the prediction window Figure 1 illustrates a real example of predictions generated at 16:00 on 11 May 2021 using the seven-step pipeline built through this work. As noted above, the bed planners wanted these predictions at four times daily (06:00, 12:00, 16:00 and 22:00). The following paragraphs present an evaluation of the predictions made at the four prediction times on a test set of 97 days from 13 December 2019 to 18 March 2020. At each prediction time, EHR data on the set of patients in ED was retrieved (Step 1). A ML prediction was made for each about their probability of admission at Step 2. At Step 3, the individual probabilities were combined to give a probability distribution for the number of admissions from the patients currently in ED. At Step 4, the individual probability of admission for each patient was combined with survival analysis to give for each patient the probability that they would be admitted within the prediction window, accounting for when they arrived and the number of patients in ED when they arrived. At Step 5 the individual probabilities from Step 4 were combined to give a probability distribution for the number of admissions within the prediction window from patients currently in the ED. At Step 6 Poisson regression was used to give a probability distribution for the number of additional patients that would arrive and be admitted within the prediction window. Finally, at Step 7, the distributions obtained at Steps 5 and 6 were convoluted to give a probability distribution for the total number of admissions within the prediction window by patients currently in the ED and others yet to arrive. The most important features for admission prediction selected by the XGBoost classifier are shown in Figure 2a is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 10, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Figure 2b ) and the case mix includes a higher proportion of more clinically complex cases which are harder to predict. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 10, 2022. ; https://doi.org/10.1101/2022.03.07.22271999 doi: medRxiv preprint within each model. For simplicity of presentation, a feature is excluded from the figure if it had a raw importance of less than 0.01 in all models. b shows the number of visits, admission proportion and performance of each model. See Supplementary Table S2 for a glossary of features and Supplementary Note 6 for equivalent analysis of later visits during the Covid-19 pandemic Calibration plots for each ML model are shown in Figure 3 , applied to all visits in the test set. All models are well calibrated, up to the final two models which related to a very small subset of visits where patients remain in the ED after 8 hours. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 10, 2022. ; https://doi.org/10.1101/2022.03.07.22271999 doi: medRxiv preprint about the intervening steps). From visual inspection of the QQ plots, there is very good concurrence between the predicted distributions and observations after Step 3. After Step 5, concurrence remains good, although (especially for an eight-hour prediction window) the predicted distributions underestimate slightly the number of admissions within the prediction window, suggesting that patients were taking less time to be admitted than predicted. Similar concurrence is observed after step 7. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 10, 2022. ; https://doi.org/10.1101/2022.03.07.22271999 doi: medRxiv preprint to improve on the benchmark. The results were achieved using only data that are available for inference in real-time. The predictions were based on ML models that have equivalent or better performance to other studies. Using logistic regression, Barak-Corren et al 19 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 10, 2022. ; https://doi.org/10.1101/2022.03.07.22271999 doi: medRxiv preprint train the models, and the survival curves are updated nightly to use only time-to-admission data from the last six weeks. The need for a post-hoc adjustment to models is reflective of wider difficulties with drift for all predictive applications that learn from past data, including but not limited to those using ML. Methods have been proposed for dealing with non-stationary learning problems 35, 36 . Here a sliding window was applied to one step of the model, but changing patterns of ED presentations and/or changing operational practice may introduce drift or sudden changes in model performance, necessitating continuous monitoring. Here, the modular nature of the aggregation made it easy to swap out the bit that changed and retrain. No model of an evolving system can be expected to predict accurately in perpetuity, so some human action to monitor and revise models is acceptable and indeed expected by patients 37 . Nonetheless there is evidence for the robustness of the models. The ML models draw on similar features before and during Covid (see Supplementary Note 6) suggesting that the signals of likelihood of admission remain somewhat consistent. The hospital where this work was conducted is urban with a student and commuter case mix and no major trauma centre. As ED organisation varies between hospitals (especially location and pathways) the cross-site generalisability of the findings may be restricted to similar hospitals. But aside from those based on location, the risk factors for admission found here are familiar in ED practice and therefore likely to be common to other sites 33 . There are limitations to this study. Some features used in other studies 19, 21 are not available here, particularly chief complaint and imaging. Because presentations at ED are seasonal, ideally models have more than one complete annual cycle to learn patterns from 12 ; in fact the two years in this study were very different, affected by organisation changes and the impact of is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 10, 2022. ; https://doi.org/10.1101/2022.03.07.22271999 doi: medRxiv preprint This study was novel in two respects. From a research point of view, the use of real-time EHR data is new in the literature on emergency admissions, as is the aggregation of patientlevel predictions into data for operational use. In future studies, researchers may find it useful to deploy the prediction pipeline proposed here in healthcare settings with sparse and heterogeneous flows of patient data, such as outpatient clinics. Second, it presented an information product in use, co-designed with bed managers with continued, and ongoing, iterations to meet their needs. Few published studies complete the 'last mile' of AI deployment by reporting on models in production 38 . Here we demonstrate an application of ML in that last mile, providing an example of how ML for healthcare will need to be delivered if it is to become a dependable and reliable tool. This work draws attention to some challenges that make technically high-performing systems perform poorly for their intended use, including model drift, and considers how to address them. Ultimately of course, crowding in EDs is an outcome of system-wide issues downstream such as bottlenecks and capacity constraints 7,39 . AI does not do anything on its own; to succeed, it must be connected to real-world processes 40 . The source of the data is HL7 messages generated by Epic, the hospital's EHR system. These are captured as they are issued, and stored in EMAP, a PostgresSQL relational database that is kept up to date with latency of less than 5 minutes. The database records a subset of the full patient record, including observations, pathology orders and results, location of patients, consult requests, and a summary of prior visit history. Data were analysed with R version 4.0.0 using MLR3 packages 41 to manage the ML pipeline. The realtime application is run on a security-enhanced Linux machine within the hospital network. The study was deemed exempt from NHS Research Ethics Committee review as there is no change to treatment or services or any study randomisation of patients into different is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 10, 2022. ; https://doi.org/10.1101/2022.03.07.22271999 doi: medRxiv preprint treatment groups. It was considered a Service Evaluation according to the NHS Health Research Authority decision tool (http://www.hra-decisiontools.org.uk/research/). The data include all inpatient and emergency visits involving an ED location from 1 May Visits were added to training, validation and test sets chronologically with the test set containing the most recent 20% of days, and the validation set the 10% before that, as shown in Supplementary Figure S3 . This temporal split avoids any leak of future information into the past, and allows for a fairer test for problems with temporal drift that are commonly in real-time implementation 24 . See Supplementary Table S1 for a detailed explanation of how the training, validation and test sets were used to prepare models and regression equations, and to evaluate the predictions, at each step of the pipeline. Our pipeline was designed to generate bed-level predictions from real-time patient-level data streams. We have four prediction times in the day and use data from an observation window to make predictions about the number of admissions in prediction windows of 4 and 8 hours after each prediction time (italics refer to the terminology of Lauritsen et al 23 ). We constructed the aggregate predictions in a series of seven steps (see Figure 1 ). Figure 6 shows the temporal detail for each step at a hypothetical moment when four patients were in the ED at the prediction time, and an unknown number of patients could be expected to arrive after the prediction time and be admitted within the prediction window. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Step 2 Get each visit's probability of admission Step 3 Generate a probability distribution for the total number of beds needed Step 4 Get each visit's probability of being admitted within the prediction window Step 5 Generate a probability distribution for the total number of beds needed within the prediction window Step 6 Generate a Poisson distribution for patients yet to arrive Step 7 Generate a probability distribution for the total number of beds needed within the prediction window, including patients yet to arrive clinically associated with acute illness. Pathology features were summarised by counting the number of out-of-range results that were higher, and the number that were lower, than the relevant target range for the patient. A glossary and descriptive statistics for features used T0, T90 and T240 models for the pre-Covid period are in Supplementary Table S2. Supplementary Table S3 has descriptive statistics for the T90 model in the various periods analysed (pre-Covid, during Covid and post-SDEC). Supplementary Figure S4 shows how a Covid surge feature was derived for the periods after the Covid outbreak. Figure 7 : Temporal framing of ML models A series of left-aligned datasets was constructed for training as shown in a. In each, the observation window began at the patient's arrival time in the ED. Successive models were trained on longer observation windows. Model T0 had a zero minute cutoff, ie was given only the data known at the arrival time. Model T15 was trained on any data known up to 15 mins. Thus visit A, which lasted less than 15 min, only appears in T0. Model T30 was trained on any data known up to 30 min, so would only include data from visits D, E and F. To make predictions, right-aligned datasets of all patients in the ED at the time of prediction, as shown in b, were created. For each patient, the elapsed time of their visit determined which model was used to predict their probability of admission. Visit I began just less than 30 minutes before the prediction time, so this would be predicted by model T15. Visit K began just more than 30 minutes before the prediction time, so model T30 would be used for this visit. each of the 12 datasets using a binary logistic loss function to generate a probability of admission. XGBoost was chosen due to its efficiency and capability of processing large datasets, its ability to handle missing data and imbalanced datasets, and (relative to other ML algorithms) its interpretability. Ten-fold cross-validation was used to find hyperparameters for each model, optimising for log loss, and the validation set used to assess held-out performance of individual-level models and the whole pipeline during development 43 . See Supplementary Note 3 for details on class balance, tuning and feature generation. To evaluate the individual-level probabilities, predictions were generated for all visits in the test set period and scored using log loss and AUROC as shown in the Results. Log loss was considered the most important metric, because, for input into the aggregation steps, accurate probabilities are more important than classification. At each prediction time, the probabilities of admission for every individual in the ED estimated at Step 2 were combined to give a predicted cumulative distribution function (cdf) for the aggregate number of admissions among this group (see for instance Utley et al 29 ). The observed number of admissions associated with each prediction time was mapped to the midpoint of the relevant portion of its respective predicted cdf 44 . A plot of the cumulative distribution of these mapped observations against the predicted cdf was constructed to give a visual guide to the concurrence between the predicted distributions and the observations analogous to a QQ plot 43 . Survival analysis applied to ED visit durations among admitted patients was used to estimate the probability of a patient that had been on the ED for a given time being admitted within the prediction window conditional on them being admitted eventually, with Cox regression used to adjust such probabilities to account for the time of the patient's arrival (time of day, weekday or weekend, quarter of year) and the occupancy of the ED at that time. (See Supplementary note 4 for more details and Supplementary Table S4 for regression coefficients.) This analysis was combined with the probabilities of admission estimated at Step 2 to give a probability for each patient in the ED at the prediction time of being admitted within the prediction window. These probabilities were then combined to give a predicted cumulative distribution function for the aggregate number of admissions within the prediction window among this group (as per Step 3). At each prediction time during the training set periods, a count was made of the number of patients not on the ED at the prediction time who were admitted via ED within the prediction window. A Poisson regression was fitted to the count data, with coefficients for the prediction time of day (06:00, 12:00, 16:00 and 22:00), quarter of year, and weekday or weekend. (See Supplementary note 5 for more details and Supplementary Table S5 for regression coefficients.) The resulting coefficients were used to generate a probability distribution for the relevant prediction time of patients who have not yet arrived, and this was convoluted with the output from Step 5 to generate the final aggregated predictions of number of admissions within the prediction window, which was evaluated using QQ plots as in Step 3. Comparison with the commonly used six week rolling average was not straightforward, as this metric is for a 24 hour prediction window from midnight. Following practice in the hospital, the observed number of admissions up to 16:00 was subtracted from the daily rolling average to derive a prediction for the remaining 8 hours and compared with the models' 8 hour predictions for all report times in the test set, illustrated in Figure 5 . The datasets analysed during the current study are not publicly available: due to reasonable privacy and security concerns, the underlying EHR data are not easily redistributable to researchers other than those engaged in approved research collaborations with the hospital. Access to a private GitHub repository is available upon reasonable request. Artificial intelligence in healthcare Improving healthcare operations management with machine learning The opportunity loss of boarding admitted patients in the emergency department The Relationship between Inpatient Discharge Timing and Emergency Department Boarding Understanding patient flow in hospitals Are medical outliers associated with worse patient outcomes? A retrospective study within a regional NHS hospital using routine data Emergency department and hospital crowding: causes, consequences, and cures Older medical outliers on surgical wards: impact on 6-month outcomes A hierarchical Bayesian model for improving short-term forecasting of hospital demand by including meteorological information Predicting hospital admission at emergency department triage using machine learning A framework for operational modelling of hospital resources Volatility in bed occupancy for emergency admissions Approach to Prediction Using the Gravity Model, with an Application to Patient Flow Modeling A hybrid tabu search algorithm for automatically assigning patients to beds Generalizability of a Simple Approach for Predicting Hospital Admission From an Emergency Department Predicting hospital admissions at emergency department triage using routine administrative data Using Data Mining to Predict Hospital Admissions from the Emergency Department Emergency department triage prediction of clinical outcomes using machine learning models Progressive prediction of hospitalisation in the emergency department: Uncovering hidden patterns to improve patient flow Early prediction of hospital admission for emergency department patients: A comparison between patients younger or older than 70 years Hospital Admission Location Prediction via Deep Interpretable Networks for the Year-Round Improvement of Emergency Patient Care Featurize: A Cross Domain Framework for Prediction Engineering The Framing of machine learning risk prediction models illustrated by evaluation of sepsis in general wards Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks Early prediction of circulatory failure in the intensive care unit using machine learning Using machine learning techniques to develop forecasting algorithms for postoperative complications: protocol for a retrospective study Machine learning for real-time prediction of complications in critical care: a retrospective study Predicting Emergency Department Inpatient Admissions to Improve Same-day Patient Flow Analytical Methods for Calculating the Capacity Required to Operate an Effective Booked Admissions Policy for Elective Inpatient Services Bridging the implementation gap of machine learning in healthcare Developing a delivery science for artificial intelligence in healthcare Access to same day emergency care Prediction across healthcare settings: a case study in predicting emergency department disposition Near real-time bed modelling feasibility study Concept Drift Detection for Streaming Data Learning under Concept Drift: an Overview Patient apprehensions about the use of artificial intelligence in healthcare Bridging the "last mile" gap between AI implementation and operation: "data awareness" that matters Can network science reveal structure in a complex healthcare system? A network analysis using data from emergency surgical services The Last Mile: Where Artificial Intelligence Meets Reality mlr3: A modern object-oriented machine learning framework in R XGBoost: A scalable tree boosting system Probability plotting methods for the analysis for the analysis of data Development, implementation and evaluation of a tool for forecasting short term demand for beds in an intensive care unit The authors declare no competing interests.