key: cord-0433405-nozl0org authors: Zhang, Jingqing; Bolanos, Luis; Tanwar, Ashwani; Ive, Julia; Gupta, Vibhor; Guo, Yike title: Clinical Utility of the Automatic Phenotype Annotation in Unstructured Clinical Notes: ICU Use Cases date: 2021-07-24 journal: nan DOI: nan sha: 2edf3364b0e0c33e93303dc06bcddcb824c0e14e doc_id: 433405 cord_uid: nozl0org Objective: Clinical notes contain information not present elsewhere, including drug response and symptoms, all of which are highly important when predicting key outcomes in acute care patients. We propose the automatic annotation of phenotypes from clinical notes as a method to capture essential information, which is complementary to typically used vital signs and laboratory test results, to predict outcomes in the Intensive Care Unit (ICU). Methods: We develop a novel phenotype annotation model to annotate phenotypic features of patients which are then used as input features of predictive models to predict ICU patient outcomes. We demonstrate and validate our approach conducting experiments on three ICU prediction tasks including in-hospital mortality, physiological decompensation and length of stay for over 24,000 patients by using MIMIC-III dataset. Results: The predictive models incorporating phenotypic information achieve 0.845 (AUC-ROC) to predict in-hospital mortality, 0.839 (AUC-ROC) for physiological decompensation and 0.430 (Kappa) for length of stay, all of which consistently outperform the baseline models leveraging only vital signs and laboratory test results. Moreover, we conduct a thorough interpretability study, showing that phenotypes provide valuable insights at the patient and cohort levels. Conclusion: The proposed approach demonstrates phenotypic information complements traditionally used vital signs and laboratory test results, improving significantly forecast of outcomes in the ICU. The accumulation of healthcare data today has reached unprecedented levels: NHS datasets alone record billions of patient interactions every year 1 . In particular, due to the close monitoring of patients in an Intensive Care Unit (ICU), a wealth of data is generated for each patient 2 , with some information being recorded every minute. In the typical setting, an Electronic Health Record (EHR) contains two types of information, which are structured (e.g. blood tests, temperature, lab results) and unstructured information (e.g. nursing notes, radiology reports, discharge summaries), with the latter composing the biggest part (typically, up to 80% 3 ). Both types of information are valuable for the ICU monitoring. The majority of recent research 4-6 relies though on more straightforward structured information, typically being laboratory test results and vital signs. Among the unstructured data, the phenotype 1 has been received the least attention for the ICU monitoring 8 . This is mainly due to the challenge to extract the phenotypic information expressed by a variety of contextual synonyms. For example, such a phenotype as Hypotension can be expressed in text as "drop in blood pressure" and "BP of 79/48". However, the phenotypes are crucial for understanding disease diagnosis, identifying important disease-specific information, stratifying patients and identifying novel disease subtypes 9 . Our work thoroughly investigates the value of phenotypic information as extracted from text for ICU monitoring. We automatically extract mentions of phenotypes from clinical text using a self-supervised methodology with recent advancements in clinical NLP -contextualized word embeddings 10 that are particularly helpful for the detection of contextual synonyms. We extract those mentions for over 15,000 phenotype concepts of the Human Phenotype Ontology (HPO) 11 . We enrich the phenotypic features extracted in this manner with the information coming from the structured data (i.e., bedside measurements and laboratory test results). To provide interpretation into our results we use SHAP values 12 . We benchmark our approach on the following three mainstream ICU tasks following the practice 4 for comparison: length of stay, in-hospital mortality and physiological decompensation. Our main contributions are: (i) approach to incorporate phenotypic features into the modelling of ICU time-series prediction tasks; (ii) investigation of the importance of the phenotypic features in combination with structured information for the prediction of patient course at micro (individual patient) and macro (cohort) levels; (iii) thorough interpretability study demonstrating the importance of phenotypic features and structured features for the ICU cases; (iv) demonstration of the utility of automatic phenotyping for ICU use cases. In this study, we use the publicly available ICU database MIMIC-III 13 and follow the common practice 4 to define the three ICU tasks, data collection and data preprocessing. We formulate the in-hospital mortality problem as a binary classification at 48 hours after admission, in which the label indicates whether the patient dies before discharge. We formulate the problem of physiological decompensation as a binary classification, in which the target label corresponds to whether the patient will die in the next 24 hours. We cast the length of stay (LOS) prediction task as a multi-class classification problem, where the labels correspond to the remaining length of stay. Possible values are divided into 10 bins, one for the stays of less than a day, 7 bins for each day of the first week, another bin for the stays of more than a week but less than two, and the final bin for stays of more than two weeks. For data collection, we use both structured data (e.g. bedside measurements) and unstructured data (e.g. clinical notes) following the filtering criteria 4 for the patients, admissions and ICU stays in all three tasks. In addition, we discard all the ICU episodes in which a clinical note is not recorded. This reduces our train and test data as compared to the benchmark 4 , so we recalculate the baseline scores using their code on our new test set for fair comparison. Overall, there are over 24,000 patients in total and the exact numbers of patients, ICU episodes and timesteps per task are reported in Table A1 . Mortality rate across all patients is 13.12% and decompensation rate across all timesteps is 2.01%. Most patients stay in ICU less than 7 days, and the distribution of ICU stays per LOS class is presented in detail in Table A2 . For data preprocessing of structured data, we follow the steps 2 to collect 17 clinical features (i.e., capillary refill rate, diastolic blood pressure, fraction inspired oxygen, Glasgow coma scale eye opening, Glasgow coma scale motor response, Glasgow coma scale verbal response, Glasgow coma scale total, glucose, heart rate, height, mean blood pressure, oxygen saturation, respiratory rate, systolic blood pressure, temperature, weight, and pH). For data preprocessing of unstructured data, we collect all clinical notes including nursing notes, physician notes and discharge summaries at all timesteps during ICU stays and we observe there is high data sparsity as clinical notes are recorded roughly every 12 hours. The processed structured data and unstructured data are then used as inputs to our approach. The proposed approach consists of two steps. The first step is to collect clinical features (more specifically, phenotypic features, standardised by Human Phenotype Ontology (HPO) 11 ) from unstructured data by using Natural Language Processing (NLP) algorithms. The second step is to combine the phenotypic features from unstructured data and the 17 clinical features from structured data as input features for machine learning classifiers to predict in-hospital mortality, physiological decompensation and LOS in separate. First, to extract phenotypes from free-text clinical notes, we develop a state-of-the-art phenotyping model, which leverages contextualized word embeddings and data augmentation techniques (paraphrasing and synthetic text generation) to capture names, synonyms, abbreviations and, more importantly, contextual synonyms of phenotypes. For example, "drop in blood pressure" and "BP of 79/48" are both contextual synonyms of Hypotension (HP:0002615). As a result of the contextual detection of phenotype, the phenotyping model demonstrates superior performance than alternative phenotyping algorithms. We refer the readers to the work 14 for methodological details. For comparison, we also use alternative phenotyping methods including ClinicalBERT 10 (fine-tuned for phenotyping) and NCR 15 . NCR uses a convolutional neural network (CNN) to assign similarity scores to HPO concepts of phrases encoded by using pre-trained non-contextualized word embeddings. Second, the phenotypic features are combined with structured clinical features together as input features to machine learning classifiers for prediction of the three ICU tasks. We use standard machine learning classifiers: Random Forest (RF) 16 , and Long Short-Term Memory Network (LSTM) 17 for prediction. We distinguish the phenotypic features between persistent and transient ones to reduce feature sparsity. More precisely, if a phenotype is clinically deemed likely to last an entire admission in the vast majority of typical cases (e.g., tuberculosis, cancer), it is marked as 'persistent'. In contrast, if the phenotype can be acquired or improved during an ICU stay, such as pain, fever, cough, it is marked as 'transient'. We make transient and persistent phenotypes present from the moment it appears until a new clinical note appears, and until the end of the ICU stay, respectively. We find this beneficial and will discuss it in Section 5.1. We also address data sparsity by aggregating HPO terms into their parents (according to the HPO hierarchy). To compare with the previous study 4 , we use Area Under the Curve of Receiver Operating Characteristic (AUC-ROC) 18 and Area Under the Curve of Precision-Recall (AUC-PR) for In-Hospital Mortality and Physiological Decompensation tasks. We primarily rely on AUC-ROC for statistical analysis as it is threshold independent and used by the benchmark 4 as the primary metric. For the LOS task, we use Cohen's Kappa 19 and Mean Absolute Deviation 20 (MAD) with primarily relying on the Kappa scores for statistical analysis. We use a train-test split based on the benchmark 4 , but exclude patients without clinical notes, resulting in 21,346 and 3,824 patients for train and test set, respectively. Further, we perform 4-fold cross validation on the training set. All splits are deterministic, so that all the classifiers with different data settings are trained and evaluated with the same subsets of data. We use the bootstrap resampling following the benchmark for statistical analysis of the scores. To compute confidence intervals on the test set we resample it 1,000 times for length of stay and decompensation, and 10,000 times for in-hospital mortality task. Then, we compute the scores on the resampled data to calculate 95% confidence intervals. To provide interpretability and insights into model predictions, we use SHAP values 12 , the implementation details of which are explained more in Appendix A.1. The SHAP values are typically used to explain black box models, and allow us to quantify the importance of a feature and whether it impacts positively or negatively the outcome. Across all three tasks, ClinicalBERT finds 664 phenotypes, NCR finds 1,441 phenotypes, and our methodology finds 1,446, in average. 30% of these phenotypes are persistent (on average across tasks). In general, we investigate the performance of two classifiers: Random Forest (RF) and LSTM. For each of them we investigate the following set of features: structured features only (S) and structured features enriched with phenotypic features coming from one of the three phenotype annotators (ClinicalBERT, NCR, ours). The main results are presented in Table 1 and the results from statistical tests are presented in Table A4 . Overall, they show that phenotypic information complements positively the structured information to improve performance on all tasks. The improvements with our phenotyping model are statistically significant across all tasks compared against using structured features only or alternative phenotyping algorithms, except for In-Hospital Mortality with RF. This is explained by the fact that phenotypes carry highly valuable information, including response to therapy, development of complication, comorbidities and unmeasured indicators of illness severity, all of which are fundamental to correctly estimate the LOS and mortality risk of a patient 21 While we conclude that phenotypic information provides useful information to correctly conduct the three ICU tasks, decision support systems in the healthcare domain should be reliable, interpretable and robust. Therefore, we accompany the above results with a thorough study on interpretability, providing explanations both at the patient and cohort levels for the observed predictions, and an assessment of robustness by studying performances across disease-specific sub-cohorts. We find it beneficial to propagate phenotypes forwards in time. More precisely, each phenotypes is marked by one human clinical expert based on whether it would typically persist throughout an entire ICU stay or not. Consequently, transient (e.g., fever, cough, dyspnea) and persistent (e.g., diabetes, cancer) phenotypes are propagated until the appearance of a new clinical note or the end of the ICU stay, respectively. We perform an ablation study and observe the phenotype propagation is more beneficial to Random Forest (RF) than LSTM. The RF models with phenotype propagation achieve 4.6% higher AUC-ROC for in-hospital mortality, 2.5% higher AUC-ROC for decompensation and 3.4% higher Kappa for LOS than RF without phenotype propagation. However, the LSTM models with phenotype propagation achieve 1.4% higher AUC-ROC for in-hospital mortality, comparable results for decompensation and 1.1% lower Kappa for LOS. We hypothesise this is because LSTM by design can better capture temporal relationship given a large amount of data to learn from. The full results can be found in Table A6 . We believe further investigation focused on learning persistency of phenotypes would be beneficial, not only to boost prediction accuracy, but also to provide insights about temporal duration of phenotypes in the ICU. To further understand the contribution of phenotypic features to the prediction performance, we have studied the most important features with the help of SHAP values 12 . This analysis and all involving SHAP values are conducted on the Random Forest (RF) models. An illustration of our investigation is in Figure 1 , where we present the top predicting features for in-hospital mortality and physiological decompensation. It confirms that phenotypic features are particularly helpful for the in-hospital mortality prediction, given that 13 out of the 20 most important features are phenotypes. This is explained by the fact that forecasts need to rely on information that is able to provide insights accurately into the long-term future. Contrary to bedside measurements which may not correlate well with future outcomes due to their dynamic nature, phenotypes are highly informative given that they capture, for instance, comorbidities, which are essential for predicting mortality 22 . Furthermore, another study 23 including 230,000 ICU patients found that combining the comorbidities with acute physiological measurements yielded the best results, outperforming all mortality scores (APACHE-II, SAPS-II). Unexpectedly and interestingly, the top ranking feature for mortality prediction is whether the patient experiences pain or not. We observe also that the second top ranking feature is Constitutional symptom (HP:0025142). Noting this is actually the resulting phenotype after aggregating all of its children, this phenotype should be interpreted not as a textual mention in the patient's EHR of the broad term, but rather as a mention of any of its children (most notably generalized pain). Consequently, the second top feature again highlights the importance of pain. Although not decisive, there is some initial evidence corroborating the fact that pain management improves outcomes in the ICU 24 . However, pain could also be interpreted as a proxy for establishing a high level of consciousness, which has been correlated with better outcomes in the ICU 25 . The other top ranking phenotypes, such as atrial arrhythmia, and nausea and vomiting, cover most of the body systems (i.e., heart, lungs, GI tract, central nervous system, coagulation, infection, kidneys) which are typically assessed through clinically validated scores e.g., APACHE, SAPS. Our study also showed that though phenotypic features are not as important for decompensation as for in-hospital mortality (only 3 out of the top 20 features for this task were phenotypic ones), they are still useful because they provide a better estimation of the predicted risk. Given that this task concerns predicting mortality within the next 24 hours, bedside measurements become more informative thanks to their temporal correlation (also shown in Figure A3 ). Nevertheless, bedside measurements can be ambiguous or provide an incomplete picture of the patient's status without the data found in clinical notes. For example, for one patient Neoplasm of the respiratory system (HP:0100606) was found to be the top feature, and although this phenotype is persistent, it increases appropriately the risk of decompensation, giving overall a better estimation. An illustration of this patient is shown by Figure A4 . Similarly, the top features for long length-of-stay (more than 1 week) are presented in Figure A1 where we notice 10 of 20 top features are phenotypes. Calibration of machine learning models compares the distribution of the probability predicted by models with the distribution of probabilities observed in real data (e.g. real patients). To measure model calibration, we use the Brier score 26 (the lower the better). Our investigation of the respective calibration curves (see Figure 2 and Figure A2 ) shows that phenotypes from unstructured notes improve model calibration across setups, especially for physiological decompensation and in-hospital mortality, which means the distribution predicted by models is closer to real distribution of patients. Besides, LSTM overall also produces better calibration than RF. Calibration curves are presented with its Brier score (the lower the better). Note that overall inclusion of phenotypic features from unstructured data helps with calibration. LSTM in legend refers to using structured features only. Ours, NCR, CB: phenotypic features from our phenotyping model, NCR and ClinicalBERT, respectively. Beyond producing clinically relevant explanations at the cohort level, with the help of SHAP values we can shed light onto a patient's journey and discover retrospectively when the patient was the most vulnerable and why. For example, the fragment of a patient's LOS forecast in Figure 3 illustrates an estimated probability, after 41 hour from admission, of a LOS longer than 14 days being of 69%, mainly because the patient scored 1 in the Glasgow Coma Scale Verbal Response. One hour after, when a clinical note becomes available, worrisome phenotypes appear (including edema, hypotension and abnormality of the respiratory system). Consequently, the estimated probability increases to 88%. We assess performance of our approach to the cohorts of the patients with different diseases especially underrepresented diseases to understand its robustness and generalisability. The test set is split into four disease-specific cohorts for patients with cardiovascular diseases, diabetes, cancer, and depression, and then the accuracies of the best LSTM models (using structured features and phenotypic features) are reported individually for each cohort on each ICU task. We notice the patient number of cardiovascular diseases or diabetes is at least twice that of cancer and around five times that of depression. For in-hospital mortality and physiological decompensation, we observe comparable accuracies across the four cohorts. We report the range of AUC-ROC between 0.780 and 0.826 for in-hospital mortality and between 0.792 and 0.820 for physiological decompensation for the four cohorts. In contrast, for LOS, we observe lower Kappa 0.321 and 0.330 for small cohorts cancer and depression, respectively, as opposed to 0.413 and 0.424 for larger cohorts with cardiovascular diseases and diabetes. We hypothesise the nature of diseases has strong implication on in-hospital mortality and physiological decompensation while LOS can be influenced by more factors which require larger data samples to model their interactions. The full results are available in Table A7 . We have investigated only one data source, MIMIC-III, and our observations are to be confirmed with other data sources. The analysis on phenotype importance is produced on the Random Forest, whose accuracy is superior than the baselines but not as good as LSTM. This is limited by the poor computation efficiency of SHAP values for LSTM and the explanations from neural network based models are to be investigated in future studies. Moreover, the phenotypes annotated as transient are made present only until a new clinical note appears in the timeline. This has the inconvenience that phenotypes might be prematurely considered as not present because the next available clinical note did not mention them. Even though the LSTM classifier is able to learn temporal correlations on its own, a more elaborated feature modelling could prove useful. Illustrative case for an ICU length of stay of more than 14 days. Top plot: time course of the normalised predicted probability for a stay of more than 14 days, and feature heatmap for a representative segment of the ICU stay. Each row of the heatmap represents one of the top features. At each time step, a feature can contribute positively (red) or negatively (blue) for predicting a stay of 14 days or more. Black horizontal bars at the right of each row represent the importance of the features. Note that a new clinical note that is available at the 42nd hour (vertical dashed line) leads to an increase confidence of longer stay due to new features. Bottom plot: Inspection of the contributing features at the 41st and 42nd hours. Each force plot illustrates features contributing positively (red) and negatively (blue) to the prediction of a longer stay. Probability of long stay increases from 69% to 88% when the clinical note provides critical information. We would like to thank Dr. Rick Sax, Dr. Garima Gupta and Dr. Matt Wiener for their feedback throughout this research. We would also like to thank Dr. Garima Gupta, Dr. Deepa (M.R.S.H) and Dr. Ashok (M.S.) for helping us create gold-standard phenotype annotation data. Hyperparameters Random Forest num of estimators=300, criterion="gini", max depth=None, min samples split=2, min samples leaf=1 LSTM epochs=30, hidden size=128, batch size=8, num of layers=1, patience=10, dropout rate=0, learning rate=1e-4, weight decay=0.0 Figure A3 . AUCROC for (a) physiological decompensation and (b) in-hospital mortality for LSTM for patients with different LOS values. While the in-hospital mortality task benefits consistently for any duration of the ICU stay, decompensation sees the best improvements when patients stay the longest. This behaviour is a natural consequence of the fact that while near future forecasts can rely strongly on bedside measurements, forecasting without a fixed endpoint in time is significantly more difficult. Nevertheless, patients who stayed for less than two weeks still saw a benefit when introducing phenotypic features, as they calibrate better the algorithm's prediction. Here, S represents structured features and Ours refers to phenotypes from our phenotyping model. Figure A4 . Time course of the physiological decompensation prediction for an illustrative patient in the test set. The top plot represents the time series of the prediction in probability (0 for no risk of decompensation, 1 for decompensation). The heatmap illustrates how the contribution of each feature (i.e., each row) varies across time for this subject. Features are sorted in decreasing order according to their importance for this patient, represented by the black horizontal bar at the right of each row. The colour of a row indicates how that feature contributes to the prediction at a moment in time, with red representing a positive contribution (i.e., that the patient will decompensate), and blue for a negative contribution. For this patient, although fluctuations in the prediction come from changes in structured data, taking into account the neoplasm of the respiratory system allows to better estimate the baseline risk of decompensation. Nhs digital annual report and accounts Machine learning and decision support in critical care Managing unstructured big data in healthcare system Multitask learning and benchmarking with clinical time series data Comparing machine learning algorithms for predicting icu admission and mortality in covid-19. npj Digit Dynamic prediction of icu mortality risk using domain adaptation Deep phenotyping for precision medicine Physician documentation matters. using natural language processing to predict mortality in sepsis Ensembles of natural language processing systems for portable phenotyping solutions Publicly Available Clinical {BERT} Embeddings The human phenotype ontology in 2021 A unified approach to interpreting model predictions Mimic-iii, a freely accessible critical care database Self-supervised detection of contextual synonyms in a multi-class setting: Phenotype annotation use case Identifying clinical terms in medical text using ontology-guided machine learning Random forests Long short-term memory The use of receiver operating characteristic curves in biomedical informatics A coefficient of agreement for nominal scales The mean and median absolute deviations Are icu length of stay predictions worthwhile? Comorbidities and medical history essential for mortality prediction in critically ill patients Survival prediction in intensive-care units based on aggregation of long-term disease history and acute physiology: a retrospective study of the danish national patient registry and electronic patient records The impact of pain assessment on critically ill patients' outcomes: A systematic review Glasgow coma scale score in the evaluation of outcome in the intensive care unit: Findings from the acute physiology and chronic health evaluation iii study Verification of forecasts expressed in terms of probability From local explanations to global understanding with explainable ai for trees Competing interests: This study is under collaboration with Imperial College London, Queen Mary University of London, Hong Kong Baptist University and Pangaea Data Limited. Shapley values come from game theory and are used to estimate the impact of a feature on a system's output. Feature impact is defined as the variation in the output of the model when the feature is observed versus when it is unknown.Shapley values belong to a category of methods denominated additive. In particular, the additivity is formulated aswhere f (x) is the prediction made by the model, x are the features fed to the model, M is the number of features, φ i is the Shapley value of the i-th feature, and φ 0 = E[ f (x)] is the expected value of the model over the training dataset. Also, this assumption ensures the values correctly reflect the difference between the expected model output and the output for a particular prediction.The Shapley value of a feature is computed viawhere S is a subset of all M input features, and f x (S) = E[ f (x)|x s ] with x s in a subset of the input features with only those belonging to S present.In this study we used the SHAP library 12 and its optimisation for tree-based classifiers 27 Class Label Class Description (Days) CV-1 CV-2 CV-3 CV-4 Test 0 <1 131913 129634 131693 133186 95439 1 1 -2 85311 83558 84065 85818 61372 2 2 -3 56353 54074 54007 54780 38858 3 3 -4 39416 37605 38106 38054 27142 4 4 -5 29384 27982 28760 28573 20171 5 5 -6 22830 22384 22360 22626 15878 6 6 -7 18816 18612 18626 18582 12940 7 7 -8 15925 15583 15697 15863 10953 8 8 -14 62655 58512 59905 60611 40856 9 >14 69800 62283 60928 72238 45741 Total 532403 510227 514147 530331 369350