key: cord-0597337-etrih5zg authors: Lejarza, Fernando; Calvert, Jacob; Attwood, Misty M; Evans, Daniel; Mao, Qingqing title: Optimal discharge of patients from intensive care via a data-driven policy learning framework date: 2021-12-17 journal: nan DOI: nan sha: e21970d3ae163b507cf94b1233b4f349010129a2 doc_id: 597337 cord_uid: etrih5zg Clinical decision support tools rooted in machine learning and optimization can provide significant value to healthcare providers, including through better management of intensive care units. In particular, it is important that the patient discharge task addresses the nuanced trade-off between decreasing a patient's length of stay (and associated hospitalization costs) and the risk of readmission or even death following the discharge decision. This work introduces an end-to-end general framework for capturing this trade-off to recommend optimal discharge timing decisions given a patient's electronic health records. A data-driven approach is used to derive a parsimonious, discrete state space representation that captures a patient's physiological condition. Based on this model and a given cost function, an infinite-horizon discounted Markov decision process is formulated and solved numerically to compute an optimal discharge policy, whose value is assessed using off-policy evaluation strategies. Extensive numerical experiments are performed to validate the proposed framework using real-life intensive care unit patient data. The demand for accurate recommendation systems for discharging patients poses a substantial challenge in intensive care units (ICU). Advanced patient needs, coupled with the greater demand for staff and resources as well as limited capacity, elevate the costs of healthcare delivery in ICUs to levels much greater than those in generic medical-surgical hospital wards (Badawi and Breslow, 2012; Halpern and Pastores, 2010) . These factors intuitively motivate the discharge of patients from ICUs as soon as their health status no longer demands such extensive resources . Nonetheless, patients who are readmitted after being prematurely discharged from ICUs, or whose health status declines rapidly after discharge, impose an even greater burden on hospitals and have disproportionately high mortality rates (Badawi and Breslow, 2012; Mcneill and Khairat, 2020) . A recent review of hospital mortality studies reported that readmitted patients can have readmission mortality rates as high as 40 percent compared to non-readmitted mortality rates of 3.6 to 8.4 percent (Mcneill and Khairat, 2020) . Furthermore, hospitals in the United States have an additional financial motivation to reduce rates of premature or improper discharge, as they can be penalized through the Hospital Readmissions Reduction Program (HRRP) if they fail to meet federal standards (Wasfy et al., 2017) . HRRP penalizes the majority of American hospitals for substandard readmission rates, up to a maximum penalty rate of one percent of the hospital's Medicare base payments (James, 2013) . Ensuring efficient patient discharge with minimal probability of ICU readmission is therefore an important priority for critical care providers, which incentivizes the creation and deployment of decision support tools that can help clinicians optimize the timing of patient discharge (Levin et al., 2021) . This priority has become even more pressing with the ongoing COVID-19 public health crisis, which has imposed a dramatically higher burden on critical care worldwide (Arabi et al., 2021) . The development of well-designed machine learning and optimization schemes could meet this need by interpreting ICU patients' electronic health records (EHR) data in real-time, assessing their probability of readmission, and providing clinicians with concise assessments to assist them with these difficult, high-stakes decisions. Algorithms and decision support tools may also encourage clinicians to discharge patients outside of normal rounding hours, allowing for greater fluidity in patient turnover (McWilliams et al., 2019) . The fields of operations research (OR) and artificial intelligence (AI) are well informed of the challenges in developing algorithms that meet the complex demands of this clinical problem. ICU patient populations are highly heterogeneous with respect to disease state and comorbidity (Forte et al., 2021) , and the relatively low proportions of ICU patients who are readmission cases may yield imbalance in data sets (Loreto et al., 2020) . The death of patients after discharge from ICUs, either at home or at other healthcare facilities outside the purview of data collection, is a competing risk in algorithm development that may often be driven by the same factors that would indicate the need for readmission (Loreto et al., 2020) . The increased mortality rate within ICUs relative to other hospital wards also generates a higher attrition rate for data collection (Wilcox and Ely, 2019; McWilliams et al., 2019) . Despite these setbacks, tools for classifying patients by their probability of readmission continue to be developed with steadily improving performance (McWilliams et al., 2019; Mcneill and Khairat, 2020; Balshi et al., 2020; Czajka et al., 2020; Loreto et al., 2020; Levin et al., 2021) . In this work we present an unsupervised machine learning approach for representing patients' physiological conditions, inherently reflecting mortality and readmission risks. This health state representation is then used to formulate an infinite horizon Markov decision process in order to numerically compute an optimal discharge policy for a given cost function. The proposed policy is validated using recently developed off-policy evaluation algorithms on a large scale real-life ICU patient data set. We discuss the resulting policy's implication to hospital management, as well as its interpretability relative to the clinicians' decisions. Several machine learning methods to predict unplanned hospital or ICU ward readmission within 48 hours (Desautels et al., 2017) , and 7, 14, or 30 days (Maali et al., 2018; Jaotombo et al., 2020; Lo et al., 2021) or within the same hospitalization (Rojas et al., 2018) have been published. A recent review of the application of machine learning in predicting hospital readmission identified 43 relevant studies employing a variety of modeling methods (Huang et al., 2021) . Additional studies have focused on predicting multiple outcomes simultaneously including both post-discharge mortality and readmission (Badawi and Breslow, 2012; Campbell et al., 2008; Ouanes et al., 2012; Thoral et al., 2021) , as well as length of stay and readmission (Hilton et al., 2020) . Machine learning models leveraging EHR data suffer from several limitations including: potential exclusion of patients due to missing data, varying feature consistency and availability across electronic medical record systems, and the temporal availability of data codes for classifying patients Temple et al. (2015) . Because of these reasons models developed to predict patient discharge or time to discharge have been studied to a lesser extent. McWilliams et al. (2019) proposed random forest and logistic regression classifiers to identify dischargeready patients based on a variety of demographic and EHR features. The developed models significantly outperformed a much more conservative nurse-led discharge criteria (Knight, 2003) used as benchmark, although the relatively high rate of false-positives indicated the necessity of further development before clinical deployment. Temple et al. (2015) developed a random forest model to identify neonatal ICU patients with high discharge likelihood within a time window of two to ten days. A random forest regression model was developed by Cuadrado et al. (2019) to predict the discharge time, i.e., the length of stay or time to discharge, of ICU patients based on 49 clinically relevant variables and values from 15 different assessment scores. Several artificial neural network methods have been used to develop prediction algorithms for length of stay of ICU patients (Gholipour et al., 2015) , and specifically for cardiac patients (Rowan et al., 2007; LaFaro et al., 2016) . Safavi et al. (2019) developed a feedforward neural network model to predict daily inpatient surgical care discharges within a 24 hour time window. Further, the framework proposed by Safavi et al. (2019) was used to quantitatively identify clinical milestones to recovery and barriers to discharge that, respectively, indicate progression or postponement towards discharge. As noted in several studies (Rubenfeld and Rhodes, 2014; Stelfox et al., 2015; McWilliams et al., 2019) , identifying patients that are suitable for ICU discharge is a complex task due to the myriad of factors that can drive such recommendations (e.g., patient demands, hospital management culture, procedures favored by a given clinician, resource availability, profit motivations, etc.). In discharge prediction models, determining the risk level or threshold above which a patient can be safely discharged is challenging and involves balancing truepositives and false-positive rates, depending on the needs of the clinical team. For example, clinicians may be interested in a lower threshold to identify all potential discharge patients including those that may not be ready yet for discharge, or hospital bed managers that may prefer higher specificity so they can more accurately balance beds for incoming patient demand (Safavi et al., 2019) . Limitations in the aforementioned prediction algorithms, such as restrictive feature sets and high false positive rates (McWilliams et al., 2019) , as well as missing hospital-level factors such as ICU bed availability and hospital census data (Rojas et al., 2018) have fostered the development of prescriptive (i.e., recommending a specific course of action as opposed to predicting a likely outcome) artificial intelligence frameworks. The focus in this work is to develop patient-specific data-driven discharge recommendation policies, rather than to train models to predict either patient time to discharge, or readmission and out-of-hospital mortality instances. The former approach optimizes over custom objective functions considering patient health dynamics and reflecting clinician and hospital management preferences to recommend a specific course of action, and provide more easily translatable and clinically usable health-based decision rules. Several related modelling schemes have been proposed for constructing optimal policies for sequential decision-making to address various types of medical problems (Schaefer et al., 2005; Bennett and Hauser, 2013; Komorowski et al., 2018) . These studies introduce frameworks typically based on different variants of Markov decision process (MDP) models (Bertsekas, 2012; Puterman, 2014) that can effectively capture the dynamic nature of patients' health condition and provide optimal clinical decisions within the limits of the structural assumptions invoked in defining the system states, actions, transitions, and costs. These often called "artificial intelligence clinicians" (AIC) (Komorowski et al., 2018) simulate clinical decision-making through consideration of multiple dynamic factors, including patient health conditions and demographics, as well as hospital system-related characteristics and costs. For example, Bennett and Hauser (2013) developed a general purpose (non-disease specific) computational environment to sequentially simulate multiple decision paths for replicating and even improving clinician decision-making in real-time. The framework was evaluated in a chronic care setting, in which the "clinical utility" (capturing the cost-effectiveness of treatments reflected by outcomes and costs) was optimized improving patient outcomes by 30-35% at a much lower cost of care. A plethora of frameworks have been proposed for constructing treatment policies for different specific chronic medical conditions including HIV (Shechter et al., 2008) , diabetes (Denton et al., 2009) , anemia (Gaweda et al., 2005) , and breast cancer (Ayvaci et al., 2012) , among several others. Recently, Komorowski et al. (2018) proposed an AIC to compute real-time optimal sepsis treatment strategies consisting of intravenous fluids and vasopressor dosages to maximize patients' 90-day expected survival. Komorowski et al. (2018) demonstrated that the AIC consistently outperformed the clinician policy even on external validation data corresponding to different hospitals, and confirmed that the AIC decisions were also highly interpretable. Comprehensive reviews of successful applications of MDPs and reinforcement learning for medical treatment policies can be found in Schaefer et al. (2005) Similar frameworks related to optimal patient discharging have been less studied in the literature. Kreke et al. (2008) proposed a finite-time MDP framework that modelled when to discharge patients with pneumonia-related sepsis to maximize their expected survival rate. Through structural and computational analyses, Kreke et al. (2008) showed that for specific problem instances the optimal discharge strategy follows a non-stationary control-limit-type policy, implying that the level of illness at which it is optimal to discharge a patient changes over the course of the hospital stay. The Sequential Organ Failure Assessment (SOFA) Score is used to represent the patient health states, which is recognised to be a simplistic metric that may fail to capture the complex set of clinical features available to represent the patient's physiological condition. Chan et al. (2012) developed a demand-driven index-based greedy policy that discharges an ICU patient with the lowest "criticality index" when a new patient arriving in the ICU must be accommodated. The proposed state-space reflected the total number of ICU patients at a given time period, where each patients was labeled from a finite set of ailments/health conditions. The resulting greedy policy balanced patient healthrelated costs (such as physiological deterioration leading to readmission or post-discharge mortality) with system related costs (relating to increased length of stay and capacitylimited ICU resources), and was theoretically proofed and empirically demonstrated to be near-optimal in settings of low ICU utilization. Numerical experiments demonstrated that the resulting policy reduced the patient readmission load by nearly 30% respective to selected benchmark policies. Ouyang et al. (2020) considered a related ICU bed allocation problem during periods of high patient demand. The objective was to minimize the long-run average expected mortality rate, and the model considered two possible health states: critical and highly critical condition. Seven different policies were evaluated, and numerical simulations showed that the proposed framework consistently reduced patient mortality risk. Shi et al. (2021) developed a large-scale MDP to optimize ICU patient flow management, patient quality of care, and patient outcome (readmission risk). The proposed model featured a personalized readmission prediction model to dynamically determine which patients to discharge on each day based on current ICU occupancy. Under the assumption that costs related to ICU congestion and discharge (dependent on the expected number of readmissions for a given a number of discharges) have quadratic structure, Shi et al. (2021) developed an efficient linear decision rule approximation for the optimal policy which was empirically demonstrated to decrease readmission risk from 32% to 28% by increasing the average length of stay from 3.33 days to 3.55 days. To our knowledge, our work is the first to provide an end-to-end framework that includes modeling patient health states from high dimensional time series EHR data, computing optimal discharge policies based on different cost trade-offs, and evaluating their performance and ICU management implications in real ICU patient health trajectories. The main contributions of the present work are: 1. A flexible, data-driven prescriptive framework for making discharge decisions such that a given cost function (reflecting the trade-off between hospitalization expenses, readmission penalties, and out-of-hospital mortality rate) is minimized. 2. A clustering-based approach for identifying discrete patient health states from high-dimensional continuous EHR data, that inherently reflects the relationship between physiological conditions and outcomes such as mortality and readmission. 3. Numerical validation experiments performed using a large ICU database (MIMIC-III, Johnson et al. (2016) ) collected from over 400,000 adult patients in the United States. 4. Off-policy evaluation algorithms and results to derive statistically significant performance guarantees of the proposed discharge policy relative to selected policy benchmarks. 5. The resulting policy can effectively lower hospital readmissions ∼2%, while at the same time reducing the average length of stay by ∼1 day relative to the clinician policy in a hold-out testing set. We developed and validated the proposed framework using the Medical Information Mart for Intensive Care (MIMIC-III, version 1.3) data set (Johnson et al., 2016) . MIMIC-III consists of EHR data from the ICU visits of over 400,000 patients, collected from a large, tertiary care hospital in Boston, Massachusetts, from 2001 to 2012. Each encounter included vital signs, medication information, laboratory measurements, and diagnostic codes, among other data. Because the MIMIC-III data are de-identified, this study qualified as a nonhuman subject study according to the definition of human subjects research put forth in 45 CFR 46, and was exempt from Institutional Review Board approval. Encounters were subjected to the inclusion procedure depicted in Figure S1 . Encounters consisted of measurements of clinical variables, including demographic information, vital signs, and laboratory tests (Tables S1 and S2). The first step of the inclusion procedure removed encounters with no measurement of more than half of the vital signs and laboratory measurements. Measurements were binned into 12-hour periods. Specifically, the measurements of each clinical variable within each 12-hour period were aggregated either by averaging the measurements (for most vital signs and laboratory tests) or by adding them (e.g., for urine output, drug and fluid administration). Missing measurements were imputed by carrying forward the last observation. If no aggregated measurement was available in a period, then the latest available observation of a patient's health state (prior to that time period) was applied. If values were missing for medications or treatments, it was assumed that none were administered or performed and they were imputed with zeros. Outliers were replaced by capping the aggregated feature measurements to their associated 99.9 and 0.1 percentiles. If any missing values remained (patients not having an entire measurement available, or the first observation of a given measurement time series) iterative imputation was performed based on Bayesian Ridge regression as the default estimator (Pedregosa et al., 2011) . We considered the problem of whether or not to discharge a patient from the ICU based on the information presently available regarding their condition (Table S2 ). The objective was to minimize a given cost function comprising two components: • the daily cost of hospitalization and implied treatments; and • the T -day expected rate of unsuccessful discharges, where typical values of T include 30, 60, and 90 days (Kreke et al., 2008) . We considered the discharge of a patient to be unsuccessful if the patient was readmitted to the ICU within T days of discharge, or if the patient died within T days of discharge. The cost function was intended to capture the trade-off between keeping a patient in intensive care to improve the odds of a successful discharge at the expense of higher operating costs to the hospital, higher medical bills to the patient, and higher ICU bed occupancy. The latter may be undesirable, especially in the context of the COVID-19 pandemic. The relative weights of each component of the cost function are to be determined on a caseby-case basis, in such a way that the resulting discharge policy reflects the hospital's ICU management strategy. We formalized the problem using the notation of Bertsekas (2012). According to the model, discharge decisions were made every 12 hours. The discharge task was formulated as an infinite-horizon discounted MDP, that captured the transitions of patients through discrete health "states" which represented the condition of the patient in consecutive 12-hour time periods indexed by t ∈ {1, 2, . . . }. While only the T -day time window following discharge was considered, a patient could in principle be kept in the ICU indefinitely if their health condition was not seen to improve or if a high weight was placed on the readmission component of the cost function. This property will be elaborated in Section 6. The MDP state space, S, was the set of health states {1, 2, . . . , H, SD, U D}, where states 1, 2, . . . , H represented the patient's physiological condition during their ICU stay, and where the (absorbing) states SD and U D corresponded to successful and unsuccessful outcomes following the decision to discharge a patient. Denote the condition of a given patient in the t th 12-hour time period of their stay by x t ∈ S. In Section 4, we detail how states {1, 2, . . . , H} were learned from data, and defined in terms of the information in Table S2 . We remark that, while continuous state-space representations can retain more information about a patient (Raghu et al., 2017) , their use can lead to computational tractability issues, which motivates the use of discrete representations (Kreke et al., 2008; Komorowski et al., 2018) . The action space A consisted of K, the action to keep a patient in the ICU for at least one more time period and D, the action to discharge them from it. In fact, the action u t taken at the end of time period t depended on the health state x t . If x t ∈ {1, 2, . . . , H}, then u t ∈ A but, if x t ∈ {SD, U D}, u t = ∅ (no further action could be taken). Assumption 1. This framework does not consider any actions/decisions with respect to medical treatments, laboratory measurements or other procedures received by the patients during the ICU stay. When estimating transition probabilities P (x|x, K) from data, an implicit policy is learned regarding the procedures used to treat the patient's condition when the health state is x. This treatment policy is tacitly followed every time that the proposed discharge policy recommends keeping the patient in care. Let P (x|x, u) denote the probability that a certain patient's health state isx in the next time period, given that the current state is x and action u is taken. These probabilities were estimated using data corresponding to health transition observed for a large number of different patients during their stay in the ICU. By definition, ifx ∈ {SD, U D}, then P (x|x, u) = 1 if and only if x =x and u = ∅. Note also that a patient could not transition to SD or U D unless discharged, i.e. P (SD|x, K) = P (U D|x, K) = 0. We formulated the MDP problem in terms of cost minimization. Denote the immediate cost of taking action u in state x by g(x, u). If x ∈ {1, 2, . . . , H}, then g(x, K) was the cost of care for one time period. The cost g(U D, ∅), or simply g(U D), corresponded to a penalty for an unsuccessful discharge. A reward in the form of g(SD) < 0 reflected a favorable discharge outcome. As the choice of cost function g will vary across different hospitals and according to ICU bed utilization preferences, the numerical results presented below explored the effects of different cost functions on the behavior of the optimal policy. Together, the states, actions, transition probabilities, and costs define an MDP. The objective of solving an MDP is to find a policy, or a mapping from a patient's health state to a discharge decision for all decision epochs that minimizes a measure of longterm, expected, discounted costs for a patient's stay in the ICU. While discounting in the context of this problem had no practical significance, time-discounted MDPs have been studied extensively and benefit from favorable convergence properties and multiple solution algorithms (Puterman, 2014) . For an initial state x 0 and discount factor α ∈ (0, 1), we are interested in finding the policy π = {µ 0 , µ 1 , . . . } in terms of µ t : S → A, that minimized the cost The optimal policy, for some initial health state x 0 , is the one with the least cost J * (x 0 ) given by where Π is the set of all admissible policies. A policy is said to be stationary if it has the form of π = {µ, µ, . . . }, in which case π is simply referred to as µ. The reader is referred to Bertsekas (2012) and Puterman (2014) for a comprehensive discussion on MDPs. 4. Modeling patient's health states and dynamics We assigned to each 12-hour time period (preceding discharge) of each patient's stay a health state {1, 2, . . . , H}, on the basis of the information in Table S2 available during that period. This involved (i) determining a number H of health states to define and then (ii) defining each health state in terms of the clinical variables. Concerning (i), if too few of health states are defined, then they will reflect the patient's condition poorly (Alagoz et al., 2010) . However, if too many health states are defined, then the subsequent policy optimization step will not be computationally tractable, and there will not be enough data to accurately estimate the transition probabilities between them. Concerning (ii), unsupervised machine learning-specifically, clustering algorithms-have been used to identify groups of patients with similar medical characteristics (Forte et al., 2021; El-Darzi et al., 2009; Alashwal et al., 2019; Komorowski et al., 2018) . We employed the k-means clustering algorithm to identify health states, where k corresponds to the number of possible states, H. In brief, the algorithm seeks to partition the collection of data from each 12-hour time period in the training data into k clusters, such that each example belongs to the cluster with the nearest centroid, which were found by minimizing the intra-cluster distance. Formally, the objective was to solve where · denotes the Euclidean distance and y denotes the examples in the training set. The constituent sets of Y = {Y 1 , . . . , Y k } are the clusters, and c i is the centroid of the points of Y i . Solving (3) directly is NP-hard, and thus heuristic algorithms, like Lloyd's algorithm, are used instead to iteratively find the local optimum of (3) and corresponding cluster centroids. The training data were scaled using the min-max scaler prior to training the clustering algorithm. The number of possible health states H, equivalent to the number of clusters k, was chosen to be 400 according to the elbow method (Syakur et al., 2018) (Figure S2 ). This number of clusters was large enough to ensure separability between patients based on their distinct clinical features, but small enough such that there was a sufficient number of transitions observed in the training set in order to estimate the transition probabilities empirically. The identified clusters are shown in Figure 1 We observed a gradient of average in-hospital mortality rate for the health states visualized in Figure 1A , despite the fact that mortality information was not provided to the clustering algorithm, which is unsupervised by definition. This observation supports the validity of the cluster assignment as a reflection of the severity of the patient's condition. We observed an analogous gradient in the average 30-day readmission rate of the learned health states ( Figure 1B) . The high mortality health states corresponded the states with the lowest 30-day readmission rates because the patients assigned to these states either died in the hospital, and therefore never had the chance to be readmitted, or were in such critical condition that their condition must have improved before being discharged. Figure 1B indicates that the highest readmission rates typically resulted from patients that were in moderate mortality clusters, perhaps after their condition improved but before they had fully recuperated, which may have resulted in readmission. An example time series 12-hour periods of EHR data and associated health states are shown in Figure 2 . The bold, vertical line indicates the time period at which the optimal policy recommended discharge (details on how this policy is computed are presented shortly), and the grey-shaded region corresponds to the continuation of the stay under the clinician policy. After determining the 400 health states, we estimated the transition probabilities between health states. For example, the transition probability P (x|x, K) was estimated by counting all observations in the training set in which a transition from state x to statex occurred, and dividing the count by the total number of observations of state x for which the patient was not discharged. To hedge against discharging patients with very small readmission rates but high mortality rates, P (U D|x, D) was estimated by also taking into consideration instances of in-hospital mortality. That is, P (U D|x, D) was estimated by counting three events: (1) the number of 30-day readmissions when patients were discharged with health state x, (2) the number of 30-day out-of-hospital mortality instances when patients were discharged with health state x, and (3) the number of in-hospital deaths for patients with last observed health state x. The sum of these three observations was divided by the total number of discharges, plus the number of in-hospital deaths observed for patients with health state x. Similarly, P (SD|x, D) was estimated as 1 − P (U D|x, D), as U D and SD are the only two possible outcomes when a patient is discharged. We evaluated the quality of the estimated transition probability matrices using the observations in the hold-out testing data set. Given the number of transitions contained in the test set, we used the estimates of P (x|x, K), P (SD|x, D) and P (U D|x, D) to compute the expected number of observations of all possible transitions (e.g., from state x to statẽ x if the patient was kept, or from x to SD or U D if they were discharged). We compared this expected number against the number of transitions observed in the testing set, to assess the goodness of fit of the empirical transition matrices. The validation results for all three transition probability matrices are shown in Figure 3 . The estimated transition probabilities generalized well to the testing set as indicated by the high coefficient of determination computed for each transition matrix. Table S3 lists the median R 2 and interquartile range obtained from repeating the clustering and validation process with 50 random train-test splits of the entire data set. We further validated the estimated transition probability matrix P (x|x, K) by simulating the first exit time from each state, as in Komorowski et al. (2018) . For each initial health state x, we conducted 500 random simulations of the dynamics, to observe the length of time spent at health state x before transitioning to a different health statex. The results, shown in Figure S3 , confirmed that this time is roughly exponentially distributed for all states. An exponential fit of the times for each state, resulted in a median R 2 coefficient of 0.988 and an interquartile range of (0.996-0.999). In the preceding sections, we defined the MDP components {S, A, P, g, α}, estimated the parameters from the training data, and validated the estimates using a hold-out test set. We turn our focus to identifying the policy which minimizes (1). First, we state the assumptions and properties that are critical for the numerical algorithms we used to solve the MDP. The reader is referred to (Puterman, 2014; Bertsekas, 2012) for a comprehensive analysis of these properties, as well as the ensuing computational aspects. Assumption 2 ((Bertsekas, 2012, Assumption D)). The costs per stage g(x, u) are bounded, that is |g(u, x)| < M for all (x, u) ∈ S × A where M is some scalar, and the discounting factor satisfies 0 < α < 1. Assumption 2 is used to derive necessary and sufficient conditions for the optimality of a policy and the convergence of numerical methods (Bertsekas, 2012) . It holds if the rewards and penalties for successful and unsuccessful discharges respectively are finite, because S and A are finite sets, and because the costs associated with hospitalization are bounded. The assumption that α ∈ (0, 1) ensures that the series in (1) converges. Proposition 1 ((Bertsekas, 2012, Proposition 1.2.2)). The optimal cost function J * satisfies Bellman's optimality equations given by J * = T J * , where the mapping T is given by Furthermore, J * is unique within the class of all bounded functions. Proposition 1 establishes that the optimal cost function is the fixed point of mapping T defined in (4). Bellman's optimality equations for health state x ∈ {1, . . . , H} is given by Bellman's equations associated to the post-discharge absorbing states x ∈ {SD, U D} are given by For every stationary policy µ the associated cost function satisfies J µ = T µ J µ , where the mapping T µ is given by Based on these results, we provide the conditions for which the discharge policy's optimality and existence are guaranteed, which is critical for understanding the numerical solution schemes discussed subsequently in Section 5.2. Proposition 2 ((Bertsekas, 2012, Proposition 1.2.3)). A stationary policy µ is optimal if and only if µ(x) attains the minimum in Bellman's equations J * = T J * for each x ∈ S, i.e., T J * = T µ J * Theorem 1 ((Puterman, 2014, Theorem 6.2.10)). Assume S is discrete and A is finite for each x ∈ S, then there exists an optimal deterministic stationary policy that solves the optimality equation J * = T J * . Theorem 1 establishes the existence of an optimal discharge policy, while Proposition 2 gives a way to identify it in terms of Bellman's equations. We proceed to discuss the algorithm we used to compute the optimal policy. We used policy iteration (Bertsekas, 2012) to find a numerical solution to the discounted cost problem (1). We chose it because the finiteness of the state and action spaces, and the satisfaction of Assumption 2 guaranteed that policy iteration would terminate in a finite number of steps (Bertsekas, 2012) . Because the number of states was not particularly large, we did not expect computational complexity to be an obstacle, and thus preferred it over other approaches, like value iteration. The premise of the policy iteration algorithm is to sequentially compute stationary policies using Bellman's equations, such that the cost of each policy is lower relative to the preceding one. The steps involved in policy iteration are outlined in Algorithm 1. Algorithm 1: Policy iteration (Bertsekas, 2012) Inputs: MDP elements {S, A, P, g, α}, initial policy guess µ 0 for k ∈ {0, 1, 2, . . . } do Policy evaluation: compute J µ k by solving The policy evaluation step involves solving a system of linear equations to compute the value function J µ k for the current best available policy µ k . P µ k and g µ k denote, respectively, the transition probabilities and the cost per stage evaluated for the current policy, and I denotes the identity matrix of appropriate dimensions. We note that policy evaluation is susceptible to the curse of dimensionality, as the problem size increases with the number of state variables, and other numerical algorithms and effective approximations (e.g., (De Farias and Van Roy, 2003) ) might be preferable for MDPs with larger state spaces. Once the current value function has been computed, policy improvement is performed by applying the dynamic programming mapping T defined in (4) on the value function J to compute the subsequent, lower-cost policy. The algorithm terminates once the value function no longer improves, which implies convergence to a fixed point and to the optimal policy (Proposition 2). The policy learning framework presented in this work uses a clinician policy (µ CP ) implicitly represented in the training and testing data to respectively learn and estimate the value of a different policy (µ OP ), which is optimal with respect to the given cost function. Because the execution of a bad policy can be costly and even dangerous in practical settings, particularly pertaining to ICUs in which erroneous discharge decisions can be fatal, it is important to establish the performance of µ OP off-line using an available testing set of trajectories D = {H 1 , H 2 , . . . , H n } where a trajectory H of length L is a state-action history {(x 0 , u 0 ), (x 1 , u 1 ), . . . , (x L−1 , u L−1 )} generated by following the clinician's policy H ∼ µ CP . Approaches for off-policy evaluation are an active topic of research in the context of related reinforcement learning frameworks (Hanna et al., 2017; Thomas et al., 2015; Festor et al., 2021) . It should be noted that the factors driving µ CP may differ from the cost function chosen to derive µ OP , which is why effective calibration is important to objectively compare the two policies. We employed the model-based bootstrapping algorithm proposed in (Hanna et al., 2017 , Algorithm 1) to estimate the value (1) of µ OP using trajectories in the hold-out test set that follow µ CP . Specifically, given a policy µ OP i for i ∈ {1, 2, . . . , 50} (each i corresponds to a random train-test split of the data and resulting clustering solution), the cost of the policy is estimated by following the test trajectories D reflecting µ CP , until a deviation between the policies is observed: either µ OP discharges the patient earlier than µ CP , or µ CP discharged the patient but µ OP keeps the patient in care. In these instances, model-based off-policy evaluation (Hanna et al., 2017) is performed by simulating future trajectories following µ OP i and the empirical transition probabilities estimated using D i -not the transition probabilities estimated using the training data-to estimate the expected policy costs. If the empirical transition probabilities generalize well to unseen data sets, then this approach reduces the variance of the estimate, at the expense of introducing bias resulting from e.g. data sparsity. The average cost over all trajectories in D i was used to estimate the value of µ OP i . Note that the costs associated with µ CP are estimated directly by following the trajectories in D. To obtain a distribution for the cost associated with a policy µ OP , we performed bootstrapping in combination with the aforementioned off-policy model-based estimation approach. The 95% confidence upper bound (95% UB) of the cost of µ OP i was determined as proposed in (Hanna et al., 2017) , by using bootstrapping with replacement to generate B trajectory sets {D j i , j ∈ {1, 2, . . . , B}} whereD j i = {H j 1 , H j 2 , . . . , H j n } such that H j k ∼ U(D i ) and where U(·) is the uniform distribution. Similarly, the 95% confidence lower bound (95% LB) corresponding to the implicit clinician policy µ CP was determined using the same bootstrapped samples. The upper and lower bounds for each policy served to ensure that the worst-case realization of µ OP still performed better than the best-case realization µ CP for a number of trained models. The UB and LB were obtained by computing the 97.5 and 2.5 percentiles (denoted by Q 97.5 and Q 2.5 ) of the bootstrapped policy cost estimates. This off-policy evaluation procedure was previously demonstrated for clinical decision support tools in the context of sepsis treatment (Komorowski et al., 2018) . Algorithm 2 presents this evaluation procedure in detail, and numerical results pertaining to off-policy evaluation are discussed below. Algorithm 2: Model-based off-policy evaluation performed for all policies µ OP i for all i ∈ {1, 2, . . . , 50} (Hanna et al., 2017) Inputs: Evaluation policy µ OP i , set of test trajectories D i , required number of bootstrap estimates B, confidence level δ ∈ [0, 1] Estimate transition matricesP i from D i for j ∈ {1, 2, . . . , B} dõ In addition to comparing µ OP and µ CP against one another other, we evaluated their performances against policies that make discharge decisions at random. The first random policy was In words, under µ RP 1 , a patient was discharged or kept in care uniformly at random by flipping a fair coin at each time period. This random policy was used to compare the costs relative to µ OP and µ CP , and was expected to result in higher expected discounted costs. Since the proposed policy µ OP was intended to reduce the number of ICU readmissions and out-of-hospital mortality instances (unsuccessful discharges) by keeping and treating patients in the ICU longer, we wanted to assess how effectively µ OP selected which patients to discharge and which ones to keep in care. To this end, we considered a second, pseudorandom policy denoted by µ RP 2 that operated as follows: given an encounter trajectory H that implicitly follows µ CP , µ RP 2 determines whether or not to extend the patient's length of stay randomly with probability γ (i.e., by flipping a biased coin). To compare against µ OP i , γ was set to the fraction of patients kept in care by µ OP i relative to all the patients observed in the test set D i . Therefore, by definition, for a given test set of trajectories D, µ RP 2 discharged the same number of patients as µ OP i , but without considering their health state when making the discharge decision. A summary of the four policies considered in this work are displayed in Table 1 . Policy description µ OP Optimal policy computed via Algorithm 1 for a given cost function µ CP Clinician policy implicit in the training and testing data µ RP 1 Purely random policy for arbitrary discharges at any time µ RP 2 Pseudo-random policy with same net discharges as µ OP To assess the effectiveness of the proposed data-driven policy, the computed discharge policies were evaluated against the ones corresponding to the clinician policy observed in the hold-out test set, as well as to the previously described random policies. The test set trajectories exclusively consisted of encounters in which the clinician was observed to discharge a patient, excluding in-hospital death instances. For the subsequent numerical studies, 50 different clustering solutions were computed using different random train-test splits of the entire data set. We compare the policies on the basis of their resulting expected discounted costs in Section 6.1, and evaluate their performances in terms of the ICU management outcomes of interest (e.g., fraction of unsuccessful discharges and average patient length of stay) in Section 6.2. We discuss the interpretability of the proposed policy in relation to the clinician policy in Section 6.3. For a fixed cost function that is well calibrated-meaning that it results in high costs for patients with high 30-day out-of-hospital mortality and readmission rates, as well as in low costs for patients that have a high rate of successful discharge ( Figure S4 )-it is important to establish statistical significance on the performance difference between µ OP and µ CP on the test set trajectories D. The results corresponding to the model-based offpolicy evaluation procedure discussed in Section 5.3 are displayed in Figures 4A and 4B . The policy cost estimates for µ OP for each individual trajectory were computed using 100 random simulations by sampling the transition matrices, and 2000 bootstrapped samples were used to estimate the associated cost distributions (and associated confidence bounds) of µ OP and µ CP . Figure 4A shows that the 95% UB on the costs of µ OP is consistently and largely inferior than the 95% LB on the costs of µ CP . The difference in the performance between the two policies grew with the number of models trained. The distribution of estimated costs for µ OP and µ CP is shown in Figure 4B . Furthermore, we used off-policy evaluation to demonstrate that the performance of learned policies is, to some extent, robust to the the number of discrete health states or clusters used to represent a patient's physiological condition. The off-policy evaluation results corresponding to policies computed using 350, 400, and 450 clusters are shown in Table S4 . The policies associated with each of the 50 clustering solutions were computed via policy iteration for a range of different cost functions, which were determined by varying the penalty associated with unsuccessful discharges g(U D) while keeping all else constant (that is, g(x, K) = 1, g(x, D) = 0, g(SD) = 0, α = 0.95). Naturally, the proposed policy µ OP discharged at most as many ICU stays as were observed in each of the hold-out test sets. Nonetheless, depending on the cost function of choice, µ OP may recommend that a fraction of the patients are kept in care until their health states are seen to improve (i.e., the patient transitions to a state having a lower probability of unsuccessful discharge), thus leading to a lower total number of discharges than the observations in the test set. To confirm that the improvements stemming from µ OP were not a mere result of reducing the number of patient discharged, the pseudo-random policy (µ RP 2 ) was used as a benchmark. Results comparing µ OP and µ RP 2 against µ CP for a range of g(U D) values are shown in Figure 5 . In the extreme case when g(U D) approaches zero, meaning there is no penalty associated with unsuccessful discharges, µ OP discharged all patients to minimize the costs associated with hospitalization. Conversely, increasing g(U D) resulted in a greater fraction of patients kept in care for µ OP (and by definition also for µ RP 2 ) relative to µ CP , thus at the expense of a longer length of stay. Since the discharged encounters are sampled at random from the test set for µ RP 2 , the resulting percentage of patients with U D aligned on average with the clinician policy (the average and standard deviation of p-values for all instances of g(U D) were 0.67 and 0.28 respectively). On the other hand, for the optimal policy, the fraction of unsuccessful discharges consistently decreased relative to µ CP and µ RP 2 , indicating that the proposed framework effectively identified patients at a high risk of being unsuccessfully discharged. The higher variance observed in the fraction of U D corresponding to large values of g(U D) for both µ OP and µ RP 2 was explained by the small sample of discharged patients used to compute the corresponding mean. In addition to reducing the number of unsuccessful discharges, µ OP can identify patients that can be safely discharged at a much earlier stage in their ICU stay than µ CP . For low values of g(U D), µ OP is expected to favor discharges that minimize hospitalization costs, thus reducing the patients' length of stay. This property of µ OP is demonstrated in Figure 6 , showing the resulting fraction of U D and the average length of stay for same range of values of g(U D) used in Figure 5 . For instances in which µ OP recommended delayed discharge relative to µ CP , 100 transitions were simulated from the last health state observation to compute the expected length of stay (which is naturally higher than that of µ CP ). The corresponding results for µ RP 2 relative to µ CP are not shown since the timing of discharge of µ CP was the same as that of µ CP , and thus µ RP 2 had an average expected length of stay greater than or equal to the one observed for µ CP . Interestingly, Figure 6 suggests that µ OP obtained using g(U D) ∼ 2 reduced the mean expected length of stay by 0.88 days (p-value < 0.001) while also reducing the mean percentage of unsuccessful discharges by 2.00% (p-value < 0.001), relative to the mean values observed for µ CP . Letting g(U D) ∼ 3 µ OP (corresponding to the cost function chosen for the results in Section 6.1) yielded an increase of the mean patient length of stay of 2.4 days (p-value < 0.001) and a reduction of the mean percentage of unsuccessful discharges of 2.83% (p-value < 0.001). From Figure 6 it is clear that the two objectives (i.e., the costs associated with hospitalization and unsuccessful discharges) are conflicting, since reducing the number of patients that are readmitted or deceased after discharge can only be accomplished by increasing their length of stay (an implied medical treatment) in the ICU. The parameter g(U D) can be used by hospital management and clinicians to address their preferences associated with the two cost factors, reflecting aspects such as hospital capacity and resource availability at a given period in time. Despite the conflicting objectives, these results show that µ OP improved upon the clinician policy with respect to the two metrics in the hold-out test sets. To better understand the difference in the factors that drive the clinician and proposed policies, classifiers were trained to predict discharge decisions associated with each policy based on the full-dimensional data (i.e., using the complete patient's EHR without any clustering or dimensionality reduction). The binary classification task at hand used the full set of clinical and demographic features observed at a point in time to predict whether a patient was discharged or not by a given policy. A gradient-boosted decision trees classifier (maximum tree depth of 4, and number of estimators of 100) was used to predict the discharge outcome for each policy on the same training set used to perform clustering to generate the patient's health state. The receiver operating characteristic (ROC) curve for the classifiers and corresponding area under the curve (AUC) for the two policies are shown in Figure S6 , indicating that both policies can be predicted with high performance (median AUC of 0.937 and 0.920 for mu OP and µ CP respectively) using the full-dimensional data. The feature importances corresponding to mean decrease in impurity (MDI) or Gini importance for each of the classifiers are shown in Figure 7 , which serve to shed light on to main factors being considered by each policy. Figure 7 : Average feature importances corresponding to classifiers trained to predict µOP and µCP , for all 50 train-test splits with associated clustering models and optimal policies identified using g(U D) = 3. Error bars correspond to the feature importance standard deviation over the 50 train-test splits. The average feature importances shown in Figure 7 confirm that the discharge decisions corresponding to µ OP are clinically interpretable and rely on sensible clinical features. In particular, several of the most important features used in predicting the clinician policy also have high importance in predicting the proposed optimal policy, which further supports its interpretability. Nonetheless, the feature importances corresponding to µ OP reveal that the proposed policy employs a much larger set of relevant clinical features in recommending patient discharge than µ CP . This property enables µ OP to discern patients that are at high risk of being unsuccessfully discharged, thus leading to an overall improved policy relative to the clinician policy and the random policies used as benchmarks. An inherent limitation of the proposed policy learning framework is that the MDP's Markov property may prevent the leveraging of a patient's past clinical history when making the discharge decision. In particular, we note that µ OP may result in discharging patients that show a brief, temporary improvement in their health condition but later exhibit a rapid deterioration and in-hospital death. These results are shown in Figure S5 , by evaluating µ OP on a hold-out test set with in-hospital death (IHD) instances. While, at the moment of discharge, future IHD and non-IHD instances are almost indistinguishable based on the respective mortality risks, it is evident that: (i) patients with IHD were observed to be in clusters with higher mortality rates than non-IHD patients prior to µ OP recommending discharge, and (ii) the transitions following µ OP discharge recommendation also correspond to clusters with much higher mortality rates. These observations suggest that µ OP can be employed in practical settings in combination with the clinician's assessment of the patients clinical history, and with a "wait-and-see" period in which the patient is kept in care for a brief observation period. From a modeling perspective, it would be interesting to explore higher-order Markov models which consider explicit dependencies on prior system states. Additional features reflecting prior health measurements could also be incorporated in the state clustering procedure to implicitly include a patient's clinical history. Moreover, based on the structural elements introduced in Section 3.2 the resulting MDP is time-homogeneous with respect to the transition probabilities and the costs per stage (i.e., these components do not vary in time). Certain practical settings where the transition probabilities might be time-varying depending on a patients length of stay would require formulating time-inhomogeneous MDPs. These models have been much less studied in the literature relative to their time-homogeneous counterparts, and their application to clinical decision-making problems is an interesting direction of future research. We note that despite the proposed framework's superior performance relative to clinician discharge decisions, the resulting policy is still a proof-of-concept and does not constitute a medical device that is ready for deployment. The proposed framework should be validated prospectively to a greater extent on observational data sets collected from multiple different hospitals and in a variety of clinical settings. For clinical use, extensive clinical trials must be performed to evaluate this work in real-time, in closed-loop, and interactively with ICU patients Komorowski et al. (2018) . Furthermore, it is clear that the implementation of clinical decision support tools is hindered by several factors including their compatibility and interoperability with current EHR systems, inconsistency and hospital-specific practices in recording and storing of patient data, and concerns of eventual over-reliance on models over medical expertise Sutton et al. (2020) . Based on these limitations it is evident that on top of the additional validation studies required, the proposed tool must always be used in conjunction with physicians' subjective judgement about patient discharge decisions and treatment strategies. The development of clinical decision support tools can play a significant role in improving the outcome of ICU patient discharges, as well as in improving the overall hospital management efficiency. In this paper, we presented an end-to-end framework for modeling patient health states to subsequently derive data-driven policies for optimal patient discharge recommendation. The proposed policies balance competing objectives relating to hospitalization costs and penalties associated with unsuccessful discharges (i.e., T -day ICU readmission or out-of-hospital mortality). Extensive numerical experiments performed on real-life ICU patient EHR data demonstrated that the proposed policy reduced both the rate of readmission and the average patient length of stay for certain cost parameters. Furthermore, the off-policy evaluation strategies showed that the proposed policy consistently outperformed the clinician and random policy benchmarks. At the same time, the policy was shown to be clinically interpretable, an auspicious property for its deployment in practical settings. Supporting Information: Optimal discharge of patients from intensive care via a data-driven policy learning framework 3. Additional numerical results for policy assessment Figure S4 : Policy calibration showing the rate of unsuccessful discharge related to the costs incurred by the clinician policy using g(U D). Figure S5 : (a) Average observed cluster mortality as a function of time relative to the moment when OP recommends discharge at time t = 0, for instances in a single hold-out test set (error bars correspond to the standard deviation). OP was generated using cost parameter g(U D) = 3., (b) Distribution of time difference between observed time of death and timing of discharge by OP in time period (vertical black line corresponds to the sample median). Figure S6 : Receiver operating characteristic (ROC) curve for all 50 models developed to predict µOP and µCP . Markov decision processes: a tool for sequential decision making under uncertainty The application of unsupervised clustering methods to alzheimer's disease How the covid-19 pandemic will change the future of critical care The effect of budgetary restrictions on breast cancer diagnostic decisions Readmissions and death after icu discharge: development and validation of two predictive models Modified Early Warning Score as a predictor of intensive care unit readmission within 48 hours: a retrospective observational study Artificial intelligence framework for simulating clinical decision-making: A markov decision process approach Dynamic programming and optimal control: Volume II Predicting death and readmission after intensive care discharge Optimizing intensive care unit discharge decisions with patient readmissions Pursuing optimal prediction of discharge time in icus with machine learning methods Validation of APACHE II, APACHE III and SAPS II scores in in-hospital and one year mortality prediction in a mixed intensive care unit in Poland: a cohort study The linear programming approach to approximate dynamic programming Optimizing the start time of statin therapy for patients with diabetes Prediction of early unplanned intensive care unit readmission in a UK tertiary care hospital: a cross-sectional machine learning approach Length of stay-based clustering methods for patient grouping, in: Intelligent patient management Enabling risk-aware reinforcement learning for medical interventions through uncertainty decomposition Identifying and characterizing high-risk clusters in a heterogeneous icu population with deep embedded clustering Individualization of pharmacological anemia management using reinforcement learning Using an artificial neural networks (anns) model for prediction of intensive care unit (icu) outcome and length of stay at hospital in traumatic patients Prevalence and nature of financial considerations documented in narrative clinical records in intensive care units Critical care medicine in the united states 2000-2005: an analysis of bed numbers, occupancy rates, payer mix, and costs Bootstrapping with models: Confidence intervals for off-policy evaluation Personalized predictions of patient outcomes during and after hospitalization using artificial intelligence Application of machine learning in predicting hospital readmissions: a scoping review of the literature Medicare hospital readmissions reduction program. Health Affairs Health Policy Brief Machine-learning prediction of unplanned 30-day rehospitalization using the French hospital medico-administrative database Mimic-iii, a freely accessible critical care database Nurse-led discharge from high dependency unit The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care Modeling hospital discharge policies for patients with pneumonia-related sepsis Neural network prediction of icu length of stay following cardiac surgery based on preincision variables Machine-learning-based hospital discharge predictions can support multidisciplinary rounds and decrease hospital length-of-stay Predictive modeling for 14-day unplanned hospital readmission risk by using machine learning algorithms Early prediction of icu readmissions using classification algorithms Predicting 7-day, 30-day and 60-day all-cause unplanned readmission: a case study of a Sydney hospital Impact of intensive care unit readmissions on patient outcomes and the evaluation of the national early warning score to prevent readmissions: Literature review. JMIR perioperative medicine 3 Towards a decision support tool for intensive care discharge: machine learning algorithm development using electronic healthcare data from MIMIC-III and Bristol A model to predict short-term death or readmission after intensive care unit discharge Allocation of intensive care unit beds in periods of high demand Scikit-learn: Machine learning in python Markov decision processes: discrete stochastic dynamic programming Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach Predicting intensive care unit readmission with machine learning using electronic health record data The use of artificial neural networks to stratify the length of stay of cardiac patients based on preoperative and initial postoperative factors How many intensive care beds are enough? Development and validation of a machine learning model to aid discharge processes for inpatient surgical care Modeling medical treatment using Markov decision processes The optimal time to initiate hiv therapy under ordered health states Timing it right: Balancing inpatient congestion vs. readmission risk at discharge Markov decision processes for screening and treatment of chronic diseases A scoping review of patient discharge from intensive care: opportunities and tools to improve care An overview of clinical decision support systems: benefits, risks, and strategies for success Integration k-means clustering method and elbow method for identification of the best customer profile cluster Predicting discharge dates from the nicu using progress note data High-confidence off-policy evaluation Explainable Machine Learning on AmsterdamUMCdb for ICU Discharge Decision Support: Uniting Intensivists and Data Scientists Readmission rates after passage of the hospital readmissions reduction program: a pre-post analysis Challenges in conducting long-term outcomes studies in critical care Reinforcement learning in healthcare: A survey We acknowledge the insightful comments provided by Dr. Jana Hoffman in writing the manuscript. All authors who have affiliations listed with Dascena (Houston, Texas, U.S.A) are employees or contractors of Dascena