key: cord-0492121-4n4nrafi authors: Vollmer, Michaela A. C.; Glampson, Ben; Mellan, Thomas A.; Mishra, Swapnil; Mercuri, Luca; Costello, Ceire; Klaber, Robert; Cooke, Graham; Flaxman, Seth; Bhatt, Samir title: A unified machine learning approach to time series forecasting applied to demand at emergency departments date: 2020-07-13 journal: nan DOI: nan sha: d9b8f25b81d9ce27cebf1b77b635362d3e9312de doc_id: 492121 cord_uid: 4n4nrafi There were 25.6 million attendances at Emergency Departments (EDs) in England in 2019 corresponding to an increase of 12 million attendances over the past ten years. The steadily rising demand at EDs creates a constant challenge to provide adequate quality of care while maintaining standards and productivity. Managing hospital demand effectively requires an adequate knowledge of the future rate of admission. Using 8 years of electronic admissions data from two major acute care hospitals in London, we develop a novel ensemble methodology that combines the outcomes of the best performing time series and machine learning approaches in order to make highly accurate forecasts of demand, 1, 3 and 7 days in the future. Both hospitals face an average daily demand of 208 and 106 attendances respectively and experience considerable volatility around this mean. However, our approach is able to predict attendances at these emergency departments one day in advance up to a mean absolute error of +/- 14 and +/- 10 patients corresponding to a mean absolute percentage error of 6.8% and 8.6% respectively. Our analysis compares machine learning algorithms to more traditional linear models. We find that linear models often outperform machine learning methods and that the quality of our predictions for any of the forecasting horizons of 1, 3 or 7 days are comparable as measured in MAE. In addition to comparing and combining state-of-the-art forecasting methods to predict hospital demand, we consider two different hyperparameter tuning methods, enabling a faster deployment of our models without compromising performance. We believe our framework can readily be used to forecast a wide range of policy relevant indicators. In 2019 there were 25.6 million attendances at emergency departments (EDs) in the UK, corresponding to 70, 231 patients attending every day [3] . The National Health Service (NHS) trusts across England are under very high pressure to maintain current standards and quality of care [29] . In fact, the rate of attendance has grown by 4.8% since 2018 and by 10.6% over the last 5 years meaning that it is increasing at a rate faster than population growth thus putting high pressure on our health care system. Failure to make provisions for surges in demand can lead to overcrowding, which in turn has been linked to multiple adverse patient outcomes such as unfavourable patient satisfaction, poor quality of care and diseconomies of scale [27] . In order to decrease overcrowding, the NHS introduced a new operational standard in 2010, commonly-known as the "four hour target", requiring that at least 95% of patients attending EDs should either be discharged, admitted or transferred within 4 hours of arrival. However, this target has not been met since 2014 and failure rates have reached a new high in January 2020 with 18% of patients at EDs waiting longer than four hours despite the fact that the overall number of attendances was lower in January 2020 than in January 2019. In fact 15.3% of patients spent more than 4 hours in EDs in 2019 compared to 11.9% last year and 5.5% five years ago, making 2019 the year with the worst annual performance on record [3] . A shortage of staff is not the predominant cause for long waiting times and low quality of care [31] , rather, it is that the correct type of staffing is not matched with patient demand. This inefficiency in staffing can have substantial impacts such as the 86, 264 of elective surgeries that had to be cancelled for non-clinical reasons on the day the patient was due to arrive in 2019. These cancellations leave NHS hospital trusts with lost costs of surgeons, anaesthetists and nurses as well as surgical session time and theatre capacity. Moreover, in the same year, the percentage of patients who had not been treated within 28 days of cancellation decreased from 9.8% in 2018 to 8.8% in 2019, however, still failing one of the NHS's improvement objectives. In addition the NHS England advises that a 85% bed occupancy rate is the maximum safe level of occupancy and it advises that trusts should try and keep bed occupancy below 92%. However, 126 out of 170 trusts recorded bed occupancy above 85%, 58 trusts rates above 92% in the third quarter of 2019, eight trusts had occupancy above 98% and one trust recorded 100% bed occupancy (NHS SitRep). Overall the NHS in England has spent around £130 billion in 2018/19 on the delivery of health services [13] . A major fraction, 44.9% in the financial year 2016-2017 [9] , of this spending is due to the 1.2 million people employed by NHS hospital and community health services. Between February 2018 and 2019 the number of doctors alone rose by 2.5% and by 10% during the past five years [23] . This emphasises the fact that long waiting times may not simply result from a shortage of staff. Nonetheless, every year high agency and staffing costs are required to cover for staffing shortages. A key step to addressing the issue of staffing is the prediction of the rates of admission and the duration of stay at EDs (collectively referred to as "demand"). In particular, an adequate prediction of demand and a better understanding of the reasons for demand is fundamental to the delivery of high quality care. Despite its importance, methods for predicting rates of admission and understanding underlying dynamics have not been studied extensively in the literature. Existing methods have been limited to the application of classical time series forecasting methods. In an Australian study, Boyle et al [6] predict monthly, daily and four hourly demand at 27 EDs in Queensland. Taking public holidays into account as predictors, the authors used an autoregressive integrated moving average (ARIMA) model as well as regression and exponential smoothing methods to predict demand up to a mean absolute percentage error (MAPE) of 11% for daily admissions. McCarthy et al [28] predicted hourly presentations to American emergency departments while including several temporal, weather and patient factors on the number of hourly arrivals in order to characterise behaviour at EDs. Jones et al [24] used seasonal autoregressive integrated moving average (SARIMA), time series regression, exponential smoothing and artificial neural network models to forecast demand. The authors had access to two years of data and made predictions ranging from 1 to 30 days in advance for three hospitals in the USA. The authors found seasonal and weekly patterns to be most important for an accurate prediction and obtained a Mean Absolute Predictive Error (MAPE) of 9 − 13% depending on the facility. Champion et al [12] performed an analysis of monthly demand at an emergency department in regional Victoria. They applied exponential smoothing and Box-Jenkins methods to five years worth of admissions data. Hoot et al [21] carried out a discrete event simulation in order to obtain forecasts for 2, 4, 6 and 8 hours into the future. The authors analysed waiting times, length of stay and bed occupancy. We develop a novel, predictive framework to understand the temporal dynamics of hospital demand and we apply an exhaustive statistical analysis to daily presentations at EDs at St Mary's and Charing Cross Hospitals, evaluating a range of standard time series and machine learning approaches and ultimately developing our own unique approach. In contrast to existing studies, we do not only focus on the application of time series algorithms in order to characterise demand but develop a generic procedure that allows us to compare and combine both time series and machine learning algorithms in order to obtain an informative, more appropriate and consistently accurate approach to the prediction of demand. Our models have the ability to be retrained regularly and efficiently and are therefore a powerful tool for online platforms and near real time prediction. Using novel data from electronic logging systems from eight years of daily presentations to EDs at St Mary's and Charing Cross Hospitals in London, we construct a model that predicts the number of daily arrivals to both hospitals. Our analysis accounts for seasonal fluctuations, daily observed weather data and specific, pre-planned events indicated by staff at both EDs such as the yearly Notting Hill Carnival. Using our procedure can help the hospitals with the provision of the right staffing numbers and deploying resources in the most effective way. Figure 1 ). In order to investigate the demand dynamics, we also collected data on school [30] and bank holidays [20] , as well as on the weather and Google search volume for the word "flu" [2, 19] (see the Appendix for more details). Finally, experienced staff at the EDs of both hospitals provided us with a list of specific known events in the locality that cause surge in demand (e.g. the Notting Hill carnival -an annual festival taking place in the catchment area of the hospital). A central aim of this paper is not only to predict hospital demand accurately but also attempt to understand the factors driving hospital demand. Daily data is driven by a complex web of exogenous variables, many of which are related to seasonal patterns or trends. Both time series depicted in Figure 1 show a strong underlying trend. In the case of Charing Cross Hospital this trend is clearly upwards while it goes downwards first for St Mary's Hospital due to a change of the hospital's infrastructure, see Figure 2 . There is also clear seasonality, the monthly attendance at St Mary's Hospital shows a very clear monthly periodic pattern with troughs in January, April and August and a rise in attendance during the winter months (likely due to increases in acute respiratory infections), see Figure 3 . ED admissions together with the flu season. It also shows that the flu seasons contribution to increased demand runs well into spring. While the monthly attendance at Charing Cross Hospital also shows some periodic behaviour, it is not as strong, see Figure 4 . It is therefore useful to note that dynamics differ even from geographically close hospitals with overlapping catchments. Both series show clear dayof-week patterns, characterised by a strong autocorrelation with respect to their lagged values of order 7, see Figure 5 . Mondays have the highest volume of attendances at both hospitals while attendance reaches its minimum during weekends. This finding validates and confirms other studies on hospital demand [15, 14] . We focus on forecasting demand one, three, and seven days into the future. These particular forecasting intervals are relevant as they allow the hospitals to take action by using short term measures such as the cancellation of elective surgeries or the hiring of additional staff through agencies. We use two different kinds of algorithms for our predictions: traditional time series and machine learning algorithms [26] . A discrete time series is a sequence of data points in chronological order divided into regular time intervals. The fundamental assumption behind both algorithms is that data points that are close to each other in time show a similar behaviour and that there is a dependency between data points at the same position of the time interval, e.g. same time of the year or same day of the week. Time series algorithms use both the chronology of the events and the specified interval in order to make inference and split the time series into different linear components such as seasonality, trend and a residual. The residual, which is assumed to contain some correlative structure, is usually modelled using an autoregressive stochastic process or exponential decay where future values are predicted based on past values [26] . In contrast, machine learning algorithms specify a broad function class (such as trees or smooth curves) with sufficient capacity to learn complex functions. These algorithms learn from data balancing function complexity with predictive accuracy. For both sets of algorithms we create a model (design) matrix containing explanatory variables. For the time series algorithms only lagged demand from previous time points were used. For the machine learning algorithm lagged demand values were used alongside other covariates (see Table 1 ). As predictors we have chosen past values such as demand on the previous day, last week and the average of the past week as well as indicators for bank holidays and school holidays. Moreover, we use data from some of the surrounding weather stations on precipitation, minimal and maximal temperature as covariates. Finally, we use search engine query data as a covariate, as it has proven to be a very efficient measure for the detection of influenza epidemics [18] . Table 1 shows a few rows of our model matrix. 2. ETS -Exponential smoothing methods [22] In general exponential smoothing refers to forecasting methods which also regress on lagged values of the target variable. However, it uses exponentially decreasing weights for past observations. The ETS model we employ uses exponential smoothing for error, trend and seasonality. 3. STLM -a seasonal decomposition of time series by LOESS (STL)a seasonal decomposition of time series by LOESS (STL) STLM is another type of exponential smoothing model. The time series is decomposed into its seasonal components using LOESS (locally estimated scatterplot smoothing) before exponential smoothing is used to model the error and trend component of the time series. Finally the series is re-seasonalized. A StructTS model is formulated directly in terms of unobserved components, such as trends, seasonality and exogenous factors that have a natural interpretation. The StructTS model forms a state space which makes it similar to the ARIMA and ETS models. 1. glmnet [17] is a penalised generalized linear model with built-in variable selection. The glmnet model is an extension of the generalised linear model in which bias (penalty/weight decay/regulariser) is introduced in the form of a mixture penalty consisting of the parameter 1 and 2 norms. The magnitude of this penalty is tuned to balance overfitting vs. underfitting, with the goal of reducing variance by introducing some bias. 2. ranger [33] is a fast implementation of random forests [8] . Random forests are a bagged ensemble of decision trees. Random forests use bootstrapped random subsets of the covariates variables to build decision trees based on these subsets. Random forests prevent overfitting by averaging to reduce variance. 3. Gradient Boosting Machines (gbm) [16] are a generalized boosted regression model. In contrast to random forests which builds a collection of multiple independent decision trees, the decision trees created by gradient boosting machines depend additively on each other. Each new tree is added to an ensemble by improving the previous trees residual error (the functional negative gradient). Hyperparameters are tuned to balance over and under fitting. 4. k-nearest neighbours (k-NN) [5] is a classic supervised learning algorithm based on the idea that data points that are close to each other in the space of covariates should have similar predictions. Hence, to make a prediction at a given location in covariate space, an average of the labels of the k nearest neighbours is taken. The disadvantage of this algorithm is that it slows down quickly once the volume of data points increases. All the above algorithms have strengths and weaknesses. It is therefore challenging to choose a single model for prediction. We therefore create a consensus model by adopting stacked regression, a particularly effective ensemble approach. The idea of stacked regression is to combine the diversity and strengths of multiple algorithms into a single model with a better overall ability to generalise. We choose a linear stacking model subject to convex combination constraints [7] . Following [7, 32] we train a linear stacker on cross validation predictions of the individual time series and machine learning models. This procedure helps us avoid selecting models that overfit to the data. Stacking models is not only empirically motivated, but has a strong theoretical backing and has been proven to perform perform asymptotically exactly as well (by some loss metric) as the best possible choice for the given data set among the family of weighted combinations of the estimators. Stacking has also been showing to work in a variety of settings [4] . Developing an algorithm with the best possible forecast accuracy is the main goal of this study. However, care must be taken to ensure that these forecasts are not simply overfitting to noise in the data but are accurate and can truly forecast to unseen data. A key innovation of this paper is the development of a novel general purpose time series cross validation procedure to ensure that: Temporal or time series cross-validation [22] is a method to split the data into testing and training sets in order to account for temporal structure in the data. The main idea is that each test set only consists of a forecasting window of one day which lies one, three or seven days in the future while the corresponding training set consists of a number of observations prior to the forecasting window. The origin can either be fixed so that the length of the training window grows by one, three, or seven days for each new test set or it can move forward so that the training window size remains fixed. We employ the latter method. Using temporal cross-validation for time series algorithms only requires splitting the data into a training and test set. Adapting the data so that the last 730 days (24 months) of data are held out for testing, 2140 days (roughly six years) were available for training which corresponds to a ∼ 75%/25% split. We then applied temporal cross validation to each day in the test set (see Figure 6 ) using a rolling window so that all training sets consist of the same number of days. Given the fairness and robustness of our cross validation scheme, we are confident that our results are robust to data shift and not only valid for the data times we have collected. Figure 6 : Splitting the data into training, (one day) validation and test set. In order to allow for a fair comparison of the machine learning algorithms, the method for temporal crossvalidation has to be adapted as most machine learning algorithms require tuning of their hyperparameters to balance over and under fitting. Therefore we also have to split the training data into a training and a validation set in order to choose the best set of hyperparameters which minimizes the error on the validation set. All machine learning algorithms listed in Section 4.2 were trained on 4 years worth of data and their hyperparameters were compared on a validation set consisting of 730 days. In our analysis we have used two different approaches to split the training set and to choose the hyperparameters. We will call these the batch and the online method. In order to choose the set of hyperparameters which minimizes the error on the validation set, we consider five different approaches. We choose the set of hyperparameters which minimize the error 1. on the previous day, 2. on the past n days, 3 . over the whole validation period using an exponential moving average, 4. averaged over the whole validation period, or 5. according to caret's built in rules (see [25] ). For each of the five cases above, we choose the best set of hyperparameters for each day of the test set and refitted all models on a daily basis. Of course refitting every model for each day of the test set is computationally expensive. Therefore we develop an online method, described below. In the batch method described above, all models are refitted daily, which takes significant computational power to run and might not be feasible for a deployed version of our methods in a hospital setting. Thus, we explore whether keeping the parameters fixed for longer periods, which yields significant savings in terms of computation, hurts performance. We refit each model over several testing periods of various sizes which are subsets of the test set of length 730 days with a rolling origin. Our chosen testing periods are 1 day, 7 days, 30 days, 60 days, 365 days and 730 days long. That means that we validated the best set of hyperparameters for each algorithm for every testing period, and then rolled forwards. The final error rates are the result of the overall error on all predictions on the whole test set of length 730. We compare the mean absolute error (MAE) and mean absolute percentage error (MAPE) rates for all time series algorithms as well as for the batch and the online method as outlined in Section 5.1.1 and 5.1.2. Finally, we compare our results with the stacked regression. The MAE of the time series algorithms range from 14.46 to 17 as shown in Table 2 , with an MAPE ranging from from 6.9% to 8.3%. Hence, although the time series algorithms are simpler and only based on the time series without using any other predictors, such as weather, bank or school holidays (see Table 1 for details on all predictors), they already give relatively accurate results. In case of the machine learning algorithms, independent of the type of hyperparameter tuning we use, most of the results for the batch methods range between an MAE of 14.3 to 15.2 and an MAPE of 6.8% − 7.4%, as shown in Table 3 . Only the k-nearest neighbour algorithm performs worse despite different ways of tuning. Especially when using the online method, see Table 5 , the linear models produce the best results for St Mary's hospital with an MAPE of 6.8% in the case of daily retraining of the generalized linear model. The results for Charing Cross Hospital are similar although the error rates are slightly increased. As shown in Table 2 , most time series algorithms yield MAE error rates between 10.5% and 14.5% while the corresponding MAPE error rates range from 8.5% to 12%. The best performance reached using the batch method is also around 10.5% although the overall performance is a little bit better, see Table 4 . In case of the online method, the MAPE error rates are as low as 8.6% for gradient boosting machine, the generalized linear model or a simple linear model retrained on a daily basis, as shown in Table 5 . In order to make use of the strengths of all algorithms, we applied a generalised linear model as well as a penalized regression to create an effective ensemble approach, see Figure 6 for details. The best performance was achieved using penalized regression with an MAE error rate of 14.51 for St Mary's Hospital and 10.60 for Charing Cross Hospital. In our analysis we consider a variety of predictors ranging from past values of ED attendance, to the weather forecast and to school and bank holidays. The question left to answer is therefore which of the predictors are actually important for making accurate predictions? Could some of them actually be redundant? In machine learning, variable importance can be defined as the dependence between input and output variables and computed by permuting the values of a given predictor and calculating error on a held out set. This measure has drawbacks in the case of multicollinearity, which could suppress the importance of certain variables in some models. As shown in Figure 8 , the most important variable predicting demand is the average demand the week before -except for glmnet -followed by specific days and months. The importance of these variables varies considerably between algorithms but largely is consistent with one another. To our knowledge, this is the first attempt to quantify the importance of variables in predicting demand. We believe that these importance percentages can be utilised as heuristics to help staff improve their prediction of demand in the absence of statistical analysis. The results of our analysis highlight interesting statistical properties: simple linear methods like generalized linear models are often better or at least as good as ensemble learning methods like the gradient boosting or random forest algorithm. However, though sophisticated machine learning methods are not necessarily better than linear models, they improve the diversity of model predictions so that stacked predictions are more robust and accurate than any single model including the best performing one. This largely confirms a 'wisdom of the crowd' rule for ensemble learning. The online method we have created has the ability to provide accurate results with a very quick turnaround time: an average model only takes a couple of minutes to train and forecast. Running it over longer periods of time without retraining of the model is much faster and at the same time not significantly worse than tuning the hyperparameters on a daily basis. The framework, analysis and methodology proposed in this paper are highly relevant from an operational viewpoint. To the authors knowledge, the majority of EDs in the UK do try to account for demand, but do so using ad hoc heuristics. Our approach provides some scientifically backed information to improve these heuristics, but more importantly provides a framework that is quick and easy to implement. Estimates of 1, 3 and 7 day demand forecasts can be created at any time and updated easily allowing statistically backed estimates to be used to inform hospital policy and practice. Our hope is this paper will fill a knowledge gap and increase hospitals uptake in the use of these methods. Given epidemics and disease outbreaks that can strain health systems, our approach can provide added precision to help EDs operate as efficiently as possible. The challenge will then be for teams in hospitals to implement more insightdriven ways of working, more flexible approaches to rostering staff and innovative ways of communicating with their local communities. In an era of precision medicine, future work will undoubtedly focus on granular data sets including individual level patient data with diagnoses and testing information. This will not only help to improve performance of the predictions but also help to understand dependencies between seasonality, events and reasons for ED presentations. This may come at the cost of longer running times but only an in depth analysis of patient's pathways, the rate of admission for different diagnoses and their corresponding bed occupancy will help improve hospital planning substantially. We The authors declare that they have no competing interests. For weather data we used minimum and maximum daily temperatures as reported by the Met Office [10, 11] from the following weather station: • ID 697 St. James's Park (postcode SW1A 1), straight line distance: 2.3 miles/3.7 km to St Mary's hospital and 4.23 miles/6.81 km to Charing Cross Hospital. Data on the daily precipitation in London is publicly available [1] . The number of Google searches for "flu" on the google search engine is also publicly available and can be downloaded from Google trends [19] . Google provides data on the relative search volume and assigns a score between 1 and 100 to every time unit within the time frame specified by the user. However, the length of this time unit is specified by Google, so the data within the time frame may only show the weekly or monthly searches for "flu". Therefore we have to adjust the data by taking both the daily data, downloaded within a time frame of six months, and monthly data, downloaded over the past few years, into account [2] . Daily areal rainfall calculations Google trends: How to acquire daily data for broad time frames Nhs key statistics, england Improved prediction accuracy for disease risk mapping using Gaussian Process stacked generalisation Pattern Recognition and Machine Learning Predicting emergency department admissions Stacked regressions. Machine learning Random forests. Machine learning Acute Care and Pensions & Employment Services /13710 Workforce/ Workforce/Pay. Title: The nhs pay review body (nhsprb) review for 2018: Written evidence from the department for health and social care for england Midas: Uk daily rainfall data Midas: Uk daily temperature data Forecasting emergency department presentations Reference costs 2017/18: highlights, analysis and introduction to the data Hospital accident and emergency activity 2017-18 Accident and emergency attendances in england Greedy function approximation: a gradient boosting machine Regularization paths for generalized linear models via coordinate descent Detecting influenza epidemics using search engine query data Google trends Bank holidays Forecasting emergency department crowding: a discrete event simulation Forecasting: principles and practice Falling short: the nhs workforce challenge, workforce profile and trends of the nhs in england Forecasting daily patient volumes in the emergency department Building predictive models in R using the caret package The M4 Competition: 100,000 time series and 61 forecasting methods Overcrowding in emergency departments and adverse outcomes The challenge of predicting demand for emergency department services Briefing Paper 7281 School term and holiday dates Reducing waiting times in the nhs: is lack of capacity the problem? Stacked generalization ranger: A fast implementation of random forests for high dimensional data in C++ and R