key: cord-0152188-k4dvidr0 authors: Avati, Anand; Seneviratne, Martin; Xue, Emily; Xu, Zhen; Lakshminarayanan, Balaji; Dai, Andrew M. title: BEDS-Bench: Behavior of EHR-models under Distributional Shift--A Benchmark date: 2021-07-17 journal: nan DOI: nan sha: fcc5654c50cdd8a1a310552215a648b96b7a00c4 doc_id: 152188 cord_uid: k4dvidr0 Machine learning has recently demonstrated impressive progress in predictive accuracy across a wide array of tasks. Most ML approaches focus on generalization performance on unseen data that are similar to the training data (In-Distribution, or IND). However, real world applications and deployments of ML rarely enjoy the comfort of encountering examples that are always IND. In such situations, most ML models commonly display erratic behavior on Out-of-Distribution (OOD) examples, such as assigning high confidence to wrong predictions, or vice-versa. Implications of such unusual model behavior are further exacerbated in the healthcare setting, where patient health can potentially be put at risk. It is crucial to study the behavior and robustness properties of models under distributional shift, understand common failure modes, and take mitigation steps before the model is deployed. Having a benchmark that shines light upon these aspects of a model is a first and necessary step in addressing the issue. Recent work and interest in increasing model robustness in OOD settings have focused more on image modality, while the Electronic Health Record (EHR) modality is still largely under-explored. We aim to bridge this gap by releasing BEDS-Bench, a benchmark for quantifying the behavior of ML models over EHR data under OOD settings. We use two open access, de-identified EHR datasets to construct several OOD data settings to run tests on, and measure relevant metrics that characterize crucial aspects of a model's OOD behavior. We evaluate several learning algorithms under BEDS-Bench and find that all of them show poor generalization performance under distributional shift in general. Our results highlight the need and the potential to improve robustness of EHR models under distributional shift, and BEDS-Bench provides one way to measure progress towards that goal. : Illustration of a scenario where a model can encounter data that is very dissimilar to the training data distribution. Machine Learning models are typically validated on test sets that are similar to the training set. A common assumption in statistical learning theory is that all examples (both train and test) are drawn independently and identically from the same data distribution (IID). Though the IID assumption is a strong one, in practice it is hard to ascertain if it is always being met. When an ML model is deployed in a real world setting, the likelihood of encountering OOD inputs is far higher. In situations when an ML model is presented with OOD inputs, its behavior can be hard to describe theoretically, and tends to be unknown practically ( Figure 1 ). The first step in fixing the behavior of models in OOD settings is to measure and quantify it with benchmarks. Benchmarks and datasets paint a target for the research community to focus and align on, thereby catalyzing the progress of the field (Deng et al., 2009; Dua and Graff, 2017) . They also serve a crucial role as an objective measure of progress towards that goal. Yet, there is a lack of good benchmarks for studying the behavior of models on EHR data under OOD settings, which our work attempts to address. Studying the behavior of EHR models under distributional shift is more than just a purely academic endeavour (Nestor et al., 2019) . There are numerous real world situations where a model may encounter patients who are systematically different from the training data for legitimate reasons. Some specific examples where the train and test distributions may differ include: • Changes in the patient population: The demographics of a patient population may change over time due to gentrification of neighborhoods around a health system, maturing public health policies, global population dynamics etc. Consider for example the rising proportion of females in the Veterans Affairs agency. This may result in models encountering patients from a different distribution than the historical data on which the model was trained. • Changes in the practice of medicine: The COVID-19 pandemic is an example of a dramatic shift in the field of medicine as a whole. It introduced major distributional shifts via changes in the patient population, but also changes in the practice of medicine, the therapies being used and the operational processes of the hospital (e.g. due to resource shortages). • Portability of models between health systems: There is increased sharing of pre-trained EHR models between hospital sites, with vendors offering pre-built models and academic consortia such as the Observational Health and Data Sciences Initiative (OHDSI) enabling model portability via common data standards. While this is excellent for broadening the impact of machine leaning and encouraging research reproducibility, it also increases the likelihood of training and deployment datasets being divergent due to differences in both populations and data formats. When the behavior of an EHR model under distributional shift is unknown, there is a risk that predictions on OOD inputs might be wrong yet highly confident, thereby potentially increasing clinical risk for those patients. This is particularly important as EHR models start to be deployed in real world clinical settings (Sendak et al., 2020) . While OOD benchmarks have been extremely impactful in the imaging domain, creating an analagous EHR benchmark is challenging. First, privacy concerns makes it hard to even get access to multiple large EHR datasets. In addition, EHR data is complex, heterogeneous and highly site-specific. This makes it difficult to harmonize multiple EHR datasets in order to perform cross-site experiments to evaluate OOD behaviour. Furthermore, while the benchmark tasks in the imaging domain is typically a classification problem with readily available labels, EHR tasks are often less straightforward. For example, defining a task involving EHR data necessarily involves nuanced data and temporal considerations, such as deciding a consistent prediction time for all examples (e.g. predicting onset of diabetes is meaningful only when the disease is not yet diagnosed), choice of a suitable time window and data sources from which data is extracted for features (broad window makes for more accurate models, but reduces the population who have sufficient data to be applied upon), determining a suitable representation for the extracted sparse and heterogenous data (handling a mixture of real values, categorical values, ordinal values, timestamps, handwritten text, images, missing values etc.), assigning labels (e.g. how to accurately determine which patients actually have diabetes), among other challenges. To this end, BEDS-Bench is a benchmark created using two open access de-identified EHR datasets. BEDS-Bench simulates OOD settings by creating intentionally dissimilar train and test sets, and measures several metrics around model performance in each of these settings (see Figure 2 ), for three common downstream classification tasks. The code for pre-processing and model evaluation is open-sourced and we hope that these benchmarks are a useful resource for the EHR community to develop more rigorous methods to characterize OOD behaviour. Summary of contributions. We summarize our contributions below: 1. We design an OOD benchmark on EHR data that includes suitable definitions of data partitions and splits, downstream tasks, and evaluation metrics. 3. We evaluate several algorithms on this benchmark and report on their performance. The rest of the paper is organized as follows. In Section 2 we give an overview of related work around robustness to distributional shift as well as various benchmark efforts. In Section 3 we describe the BEDS-Bench benchmark in detail. Sections 4 and 5 describe the experiments and results, and we conclude with Section 6. Nestor et al. (2019) used a timestamped version of the MIMIC-III dataset to demonstrate significant deterioration in model performance when EHR models were evaluated on data more recent than the training set. Their proposed mitigation strategy involved harmonizing features into clinical concept groupings. While pre-processing strategies can be effective, there is a complementary need for better model-based strategies for OOD detection and mitigation, motivating the present work. While there is a paucity of literature on EHR robustness evaluation, there has been some progress with the images modality such as the Imagenet-c dataset (Hendrycks and Dietterich, 2019) . Many image based OOD works utilize multiple datasets, such as MNIST (LeCun and Cortes, 2010), ImageNet (Deng et al., 2009) , SVHN (Netzer et al., 2011) , etc. to conduct cross-dataset experiments to analyze model behavior (Nalisnick et al., 2019) . There have also been methods developed to improve model calibration and robustness to OOD examples, though these works mostly focused their experiments and efficacy tests on images (Lakshminarayanan et al., 2017; Liu et al., 2020) . The most related work to ours is a recent paper that has evaluated several ML algorithms on their ability to detect OOD EHR inputs by assigning a higher uncertainty in their outputs (Ulmer et al., 2020) . Their focus is limited to OOD detection, while BEDS-Bench takes a more holistic view of model behaviour under distributional shift (described in Section 3.1). We discuss additional challenges of OOD detection under class imbalance and certain choice of uncertainty metrics in Section 5. The BEDS-Bench tool is designed to generate a performance report of a learning algorithm regarding the behavior of models trained by this algorithm under various types of distributional shift in the test data. The general approach taken by BEDS-Bench is to partition data in several ways into intentionally dissimilar subsets in order to artificially simulate IND vs OOD settings. Models are trained on the train split of a certain subset for one of the standardized tasks, and tested on the test splits of all the subsets while measuring relevant metrics. The test split corresponding to the subset from which the model was trained is considered IND, while test splits from the other subsets in the partition are considered OOD. Figure 2 describes the workflow in one particular setting. This procedure is repeated by cycling through every subset in every partition to be the IND, while the other subsets in the partition are considered OOD. In the rest of this section we describe what is an ideal model's behavior under OOD data, the details of the methodology of BEDS-Bench including descriptions of the datasets used, partitions created, tasks for which models are trained for, and metrics measured on the test sets. Before designing a benchmark, it is crucial to first define what we consider is the ideal behavior of a model. The tests and metrics of the benchmark need to then be chosen to shine light on these aspects of the model and enable objective comparison across multiple algorithms. The following notation of the ideal model behavior informs the design of BEDS-Bench: • Generalization: When a model is tested on a distribution that is different from the one it was trained on, it is possible, and understandable, for the model performance to drop to some extent. A common generalization metric (in case of classification tasks) is the Area Under the Receiver-Operator Characteristic Curve (AUC) which measures the ability of a model to discriminate between two classes. The drop in generalization performance might likely be larger for test distributions that are "farther" from the train distributions. Yet, an ideal model should have at least a minimal level of generalization robustness to OOD data, such as not performing worse than random guessing (i.e. maintain AUC ≥ 0.5). • Calibration: Calibration refers to the property that probabilities output by a model are agree with the observed empirical frequency of events. For example, among all days which had a rain forecast probability of 80%, approximately 8 out of 10 days should observe rain in the long run. Calibration is a property that is orthogonal to discrimination, and hypothetically it is possible to have models with any mix of levels of calibration and discrimination. An ideal model is not only well-calibrated in its predictions on IND data, but also on OOD data, especially when generalization on OOD data has worsened. • Confidence: Closely related to the notion of calibration is confidence. Typically confidence of a prediction is measured with metrics such as predictive entropy, or predictive variance. The larger the entropy or variance, the lower the confidence of that prediction. If an ideal model's OOD generalization performance is lower than IND, then the confidence in the OOD predictions will be lower than in the IND predictions. While we do measure the ability of a model to discriminate OOD vs IND inputs by assigning lower confidence scores to OOD, we also emphasize that this test involves additional nuances that need to be considered before interpreting the results. We discuss this further in Section 5. To develop the benchmark, we make The MIMIC dataset has data related to patients who were admitted to the ICU at Beth Israel Deaconess Medical Center between 2001 and 2012. The dataset covers 58,977 ICU stays of 46,520 patients. All the patients were either adults or neonates (newborn babies). The PICDB dataset was collected at the Children's Hospital of Zhejiang University School of Medicine between 2010 and 2018. It covers a total of 13,450 ICU stays of 12,811 patients who were all minors (newborn up to 18 years of age). An overview and comparison of the two datasets, including types of data present in each dataset and their encoding formats is presented in Table 1 . Both the datasets are represented as relational databases, with a comma separated value (CSV) formatted file per table. The way BEDS-Bench works is by creating intentionally dissimilar subsets of data to simulate OOD settings. One natural setting is to consider model behavior when trained on data from MIMIC and tested on PICDB data, and vice versa. Conducting such cross-dataset experiments is quite common, and straight forward with image data. For harmonizing two images datasets, the main considerations are around matching the resolutions, channel count, bits per color etc. which are all quite easily handled. Yet, harmonizing two different sources of EHR data is a lot more involved, with careful considerations required in finding a common set of tables, vocabularies (to codify categorical data), units (to represent continuous data), representation of time, and other semantic reconciliations. The broad strategy we follow in harmonizing the two datasets is to identify a subset of tables, columns, and rows which can potentially be matched up, and exclude the remaining. In PICDB the diagnostic codes are coded in the Chinese Edition of the International Classification of Diseases, Tenth Revision (ICD-10CN), whereas MIMIC uses the International Classification of Diseases, Ninth Revision (ICD-9). We perform a one-to-many mapping from ICD-10CN to ICD-9 using the Unified Medical Language System (UMLS) database (Bodenreider, 2004) . For medication codes, we map both the data sources to the RXCUI coding. MIMIC medications are coded in the National Drug Code, which uniquely map to RXCUI. For PICDB we start with the textual descriptions of the medications and run them through the MedEx system to extract the RXCUI codes (Xu et al., 2010) . While the laboratory tests are coded with custom codes in both the datasets, some of the custom codes have an accompanying Logical Observation Identifiers Names and Codes (LOINC) code. We use the LOINC code as the common vocabulary and include only those rows for which the custom code has a corresponding LOINC code. MIMIC has a very rich representation of vitals and chart events. PICDB on the other hand has a total of nineteen vital and chart event types. We use the event type groupings from the MIMIC-Extract project to map a subset of the MIMIC chart event codes to the corresponding PICDB chart event codes (Wang et al., 2020) . The Inputevents and Outputevents table record the total volumes of different types of fluids that enter and exit the patient during the stay. While MIMIC records both the volumes and the types of the fluids, PICDB only records the volumes (without an associated fluid type). From a medical perspective, while knowing the type of fluid is certainly useful, just knowing the volume of fluids going in and out of the patients is also informative in itself. Thus we exclude the fluid type codes from MIMIC and retain only the volume information for the purposes of harmonization. Table 7 in the Appendix summarizes the various code harmonization approaches that were applied. In order to create a supervised learning dataset out of an EHR relational database, certain additional data processing steps are necessary. Each example in the supervised learning dataset corresponds to the data from one hospital admission. First, we exclude all hospital admissions that are shorter than 30 hours, and within those included, we use data up to the first 24 hours since admission. The additional 6 hours "gap" after the first 24 hours of data is common practice to avoid leaking of information of the label into the covariates (Wang et al., 2020) . Further, we only include the first admission of a patient, and exclude admissions after the first discharge, if any. Finally the dataset is randomly divided into train and test splits (80% train, 20% test). The set of resulting tables after applying the inclusion and exclusion criteria (including both train and test), with their row and column counts is summarized in Table 9 in the Appendix. This is the harmonized dataset using which the various experimental OOD settings are created. The benchmark creates three different partitions of the data, each partition having between two to five slices. Within each partition, the slices are completely non-overlapping, and are characteristically different to varying degrees depending on the partition. The names and definitions (inclusion criteria) of each of the slices of all the partitions are in Table 2 . The Demographics partition has three slices -MIMIC-adult, MIMIC-neonate, and PICDB-paed. The differences in the slices in this partition are somewhat stark. Not only are the differences between paediatric (especially neonates) and adults particularly pronounced, the MIMIC vs PICDB slices present even more differences, including very distinct populations, health systems and accompanying treatment practices, etc. The Biological Sex partition separates the MIMIC dataset into Female and Male slices. Both the EHR datasets codify sex as binary and BEDS-Benchfollows the convention. This partition intends to highlight the model behavior under the extreme cases of shift in gender balance. The Ageing partition slices the adults into different age bands, representing progressively older patients with each band. The age ranges in years used to define the bands are (15-50], (50-60], (60-70], (70-80] and (80,∞). It may be observed that some of the partitions have slices which are so blatantly dissimilar that sometimes it would be unreasonable to expect a model to ever generalize over to such a distinctly different dataset, or to even consider such generalization goals as clinically relevant. Yet, we argue that these obviously-OOD settings are great examples of scenarios where any reasonably safe model would necessarily need to display some degree of robustness, and hence make for good tests to be included as part of an OOD benchmark suite. We also note that the distribution of Race in the EHR datasets is quite skewed, with several races having too few examples to be sufficient to form a partition that includes slices for all races. After considerations of fairness and ethics, we look forward to finding additional EHR datasets that will allow us to construct a more inclusive race based partitioning. BEDS-Bench includes three supervised learning tasks to evaluate algorithms on: In-Hospital Mortality (Mort), Remaining Length-of-Stay > 3 days (LoS3+), and Remaining Length-of-Stay > 7 days (LoS7+). All the three tasks are common canonical EHR tasks widely explored in the literature Wang et al. (2020) ; Rajkomar et al. (2018) , and framed as binary classification, with names suggestive of their labels. The Mort task has a label of 1 only if the patient passed away during the hospital stay of that example. Even if the patient passed away soon after discharge or during a follow-up admission, the label remains 0. The LoS3+ (or Los7+) task has a label of 1 only if the patient will end up having at least 3 (or 7) days worth of remaining time in their current stay. The class balance varies significantly depending on the task and data slice. While mortality of MIMIC-neonate can be as low as 0.5% (fortunately) on the one hand, the three day length of stay for the PICDB-paed slice is as high as 91.2% on the other. The class balances for each of the three tasks on all the slices, along with the number of examples in each slice is listed in the Appendix (Table 8) . The BEDS-Bench evaluates the performance of an algorithm in several test settings as measured by several metrics. For notation, let use denote the number of examples by n ∈ N, i ∈ N as the example index number where 1 ≤ i ≤ n, y i ∈ {0, 1} as the label (correct answer) of the i th example, andŷ i ∈ [0, 1] as the predicted probability by a model for the i th example. • Task-AUC -This metric is the Area Under the Receiver Operator Characteristic (ROC) curve (AUROC) measured in the context of the model predicting the downstream task label (Mort, LoS3+, LoS7+). • ECE -Expected Calibration Error. To define ECE, we first divide the probability range [0,1] into K equal non-overlapping intervals, each interval denoted I k , k ∈ [K]. We also define K corresponding bins B k , k ∈ [K] where each bin is the collection of example indices whose predicted probability falls in the interval I k , i.e. B k = {i :ŷ i ∈ I k }. With this, the ECE is defined as ECE = 1 n K k=1 i∈B k (y i −ŷ i ) . • Confidence is typically considered to be the variance or entropy of the predicted Bernoulli distribution in case of a binary classification task. The AUC is measured with the label being set to 1 if the example is OOD (and 0 if IND), and the confidence measure is the score assigned to the example. The OOD-AUC will be high when OOD examples have higher variance or entropy than IND examples. Since both the variance and entropy of a Bernoulli distribution are similarly ordered (withŷ=0.5 having the highest variance or entropy, andŷ=0 orŷ=1 having the lowest variance or entropy), the resulting OOD-AUC metrics with either choice will be the same. These metrics are measured for each downstream task, on each data slice (IND) and other data slices within the same partition (OOD). The metrics are tabulated by algorithm, presenting the metrics of different algorithms in the same setting side by side. We evaluate seven algorithms on BEDS-Bench and analyze their performance: Logistic Regression (LogReg), Gaussian Process (GP) (Rasmussen and Williams, 2005) , Random Forest (RF) (Breiman, 2001) , Mondrian Forest (MF) (Lakshminarayanan et al., 2014) , Multi Layer Perceptron (MLP), Bayesian Recurrent Neural Network (BRNN) (Fortunato et al., 2019) with the same setup as (Dusenberry et al., 2020) , and Spectral-normalized Neural Gaussian Process (SNGP) (Liu et al., 2020) . We use both the Scikit-Learn (Pedregosa et al., 2011) and Tensorflow (Abadi et al., 2016) software frameworks for the experiments, depending on the specific algorithm. Six of these models use a fixed length representation and one of them (BRNN) uses sequential embedding representation. The summary of models evaluated in this work is presented in Table 6 in the Appendix. The fixed length representation is calculated as an array of binary indicators for each of the possible codes that might appear in the training data. Age is represented in years, and volumes (inputevents and outputevents) are aggregated over the 24-hour period. The inputs are standardized by column. The sequential embedding representation creates an embedding for each categorical data and maintains the temporal ordering of all the codes. The data format of the generated representation matches that of . In all our experiments, within each partition, we randomly subsample (without replacement) the training examples to the size of the smallest training slice. This keeps all the training sets of equal size and makes is easy to compare metrics across different training slices. The results of one of the tasks (In-Hospital Mortality) are presented in Tables 3 (for AUC) , 4 (for ECE) and 5 (for OOD). The metrics are described in Section 3.7. Each row corresponds to a specific combination of train and test set, and each column (starting from the third column) corresponds to a learning algorithm. In the AUC table (Table 3) and the ECE table (Table 4) , the IND rows have values colored in gray. The red colored values are those where an improvement in performance is desired, and the green colored values are those where the performance is, somewhat unexpectedly, better than expected. Within each row, the best performing algorithm's value is set in bold. We do not color code the OOD table (Table 5 ) since interpreting those numbers is a little more nuanced and should only be viewed in conjunction with the corresponding values in the other two tables, due to reasons described in the following section. The remaining results for the other tasks are reported in the Appendix. In the results from our experiments, we broadly observe that among the algorithms we tested no algorithm dominates another in its performance across the board. We also observe that every model does particularly poorly in staying calibrated in OOD settings, with the exception of testing on Neonates. Among the various algorithms tested in our experiments, a few of them are specifically designed with robustness to OOD inputs in mind. The SNGP, MF, and GP algorithms in particular are known to have stronger uncertainty estimation properties, even under distributional shift, relative to other algorithms. The SNGP algorithm in our experiment is essentially an MLP with additional GP layer as the final layer and spectral normalization in the fully connected layers. The SNGP algorithm is designed to be distance preserving at each of the individual layers in the deep model, with the hypothesis that it helps prevent collapsing of OOD and IND inputs at the final layer. In the left plot, the OOD class is "in-between" the positive and negative class. In the right plot, the OOD class is "closer to 0.5" than both the positives and negatives. Which of these two models is assigning "lower confidence" to the OOD examples? What we observe is that, while in some cases these algorithms do perform better in OOD settings sometimes with respect to AUC score, the overall performance, especially with the ECE score has scope for improvement. Our hypothesis, and hope, is that as these algorithms get increasingly tested against EHR data modality, as they have been so far against images, improvements to the methods would result in increased robustness and better performance in such OOD settings. It is also interesting to note that the difference between Male and Female distributions seem to not matter much for the downstream tasks, and the models are able to generalize over just fine with respect to both AUC and ECE. Indeed, this is also reflected in the fact that the OOD performance between these two subgroups is very close to 0.5. This example highlights the nuances in interpreting the OOD table, especially in isolation. While other works (Ulmer et al., 2020) highlight the fact that the models failing to distinguish Male vs Female distributions with appropriate levels predictive uncertainty and consider it as a failure mode, a more careful analysis shows that this might not be a failure in itself. When models are able to perfectly generalize from Male to Female and vice-versa with respect to both AUC and ECE metrics, the expectation for them to assign lower confidence (higher uncertainty) to the OOD examples is really undue, and even incorrect. Having a high OOD-AUC (i.e. assigning lower confidence to OOD examples) becomes a desiderata only when the model fails to generalize well to those examples. Our view is that while OOD detection performance can be interesting in some situations, the AUC and ECE metrics carry most of the story in terms of the model's OOD behavior. A related consideration is about how to measure confidence from a model's prediction that is in turn used to detect OOD examples. Among the class of models which involve a set of predictions for each input, such as ensembles (Lakshminarayanan et al., 2017) or MC dropout (Gal and Ghahramani, 2016) , confidence is typically measured with the standard deviation (or mutual information (Ulmer et al., 2020) ) using the set of predictions. A large standard deviation among the predictions represent higher uncertainty and vice versa. For the second class of models, including many we have tested in this work, which output just one prediction per input, confidence is typically measured as the entropy of the predicted Bernoulli distribution (or Categorical for multi-class classification). A Bernoulli distribution with mean parameter p = 0.5 has the highest entropy in its family, and therefore represents a prediction with least confidence, while those predictions with mean parameter close to 0 or 1 have lower entropy and hence represent predictions with high confidence. Among this second class of models, an alternate way of describing confidence would be to consider the model's discrimination ability, and inspect whether a given prediction is close to the threshold of maximum discrimination, or away from it (closer being "less confident"). Here the threshold of maximum discrimination refers to the threshold value that maximizes the mean (either arithmetic or geometric) of the sensitivity and specificity of the model. The question now is, is a low confidence prediction a Bernoulli distribution with high entropy, or is it a Bernoulli distribution whose mean is close to the threshold of maximum discrimination? This distinction is typically a moot point when there is a perfect class balance between the positive and negative classes (i.e. the marginal probability p(y) = 0.5). However when the classes are not well balanced, the two interpretations of confidence start to diverge, as illustrated in Figure 3 . The left plot is from a situation where the OOD examples are closer to the threshold of maximum discrimination. The right plot is from a different situation where the OOD examples have higher entropy overall (note the scale of the X-axes in both the plots differ), and thus achieves high OOD-AUC score. Yet, one might also reasonably interpret the right plot as the OOD examples being assigned a "strongly positive" score relative to the two IND classes, and hence the model is in fact being "more confident" on the OOD examples and therefore ought to have a low OOD-AUC score under an appropriately chosen measure of confidence. These observations, in our opinion, makes the problem of OOD-detection itself a little less well defined under class imbalance situations for the class of models that use predictive entropy as a measure of confidence, in addition to it not being a very useful metric in isolation. In this work, we propose a benchmark, BEDS-Bench, to evaluate an EHR ML model performance under distributional shift of the test data. We evaluate this benchmark on several algorithms, and find that no single algorithm demonstrates satisfactory robustness behavior over a wide range of OOD settings. We also find that no single algorithm works better than another across the board, including algorithms designed with OOD robustness in mind. While prior works have identified that most discriminative models are not reliable in detecting OOD examples reliably on medical tabular data, our work confirm this and in addition find that all the models we tested fare poorly in maintaining calibration under distributional shift of EHR data. This underscores the need for further research into robustness evaluation of EHR models, especially as these models are increasingly deployed in real-world clinical settings. Tensorflow: A system for large-scale machine learning The unified medical language system (umls): Integrating biomedical terminology Random forests ImageNet: A Large-Scale Hierarchical Image Database Uci machine learning repository Analyzing the role of model uncertainty for electronic health records Bayesian recurrent neural networks Dropout as a bayesian approximation: Representing model uncertainty in deep learning Phys-ioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals Benchmarking neural network robustness to common corruptions and perturbations Mimic-iii, a freely accessible critical care database Mimic-iii -sequenceexamples for tensorflow modeling Mondrian forests: Efficient online random forests Simple and scalable predictive uncertainty estimation using deep ensembles MNIST handwritten digit database Simple and principled uncertainty estimation with deterministic deep learning via distance awareness Do deep generative models know what they don't know Feature robustness in non-stationary health records: Caveats to deployable model performance in common clinical machine learning tasks Reading digits in natural images with unsupervised feature learning Scikit-learn: Machine learning in Python Scalable and accurate deep learning for electronic health records Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) A path for translation of machine learning products into healthcare delivery Validation of a retrospective computing model for mortality risk in the intensive care unit Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data Mimic-extract: A data extraction, preprocessing, and representation pipeline for mimic-iii MedEx: a medication information extraction system for clinical narratives Pic, a paediatric-specific intensive care database