key: cord-0962838-0w9llb8f authors: Liu, Nan; Chee, Marcel Lucas; Koh, Zhi Xiong; Leow, Su Li; Ho, Andrew Fu Wah; Guo, Dagang; Ong, Marcus Eng Hock title: Utilizing machine learning dimensionality reduction for risk stratification of chest pain patients in the emergency department date: 2021-04-17 journal: BMC Med Res Methodol DOI: 10.1186/s12874-021-01265-2 sha: 6ea93d23fcd9c1559ffb055111abc1fe78c6e4d1 doc_id: 962838 cord_uid: 0w9llb8f BACKGROUND: Chest pain is among the most common presenting complaints in the emergency department (ED). Swift and accurate risk stratification of chest pain patients in the ED may improve patient outcomes and reduce unnecessary costs. Traditional logistic regression with stepwise variable selection has been used to build risk prediction models for ED chest pain patients. In this study, we aimed to investigate if machine learning dimensionality reduction methods can improve performance in deriving risk stratification models. METHODS: A retrospective analysis was conducted on the data of patients > 20 years old who presented to the ED of Singapore General Hospital with chest pain between September 2010 and July 2015. Variables used included demographics, medical history, laboratory findings, heart rate variability (HRV), and heart rate n-variability (HRnV) parameters calculated from five to six-minute electrocardiograms (ECGs). The primary outcome was 30-day major adverse cardiac events (MACE), which included death, acute myocardial infarction, and revascularization within 30 days of ED presentation. We used eight machine learning dimensionality reduction methods and logistic regression to create different prediction models. We further excluded cardiac troponin from candidate variables and derived a separate set of models to evaluate the performance of models without using laboratory tests. Receiver operating characteristic (ROC) and calibration analysis was used to compare model performance. RESULTS: Seven hundred ninety-five patients were included in the analysis, of which 247 (31%) met the primary outcome of 30-day MACE. Patients with MACE were older and more likely to be male. All eight dimensionality reduction methods achieved comparable performance with the traditional stepwise variable selection; The multidimensional scaling algorithm performed the best with an area under the curve of 0.901. All prediction models generated in this study outperformed several existing clinical scores in ROC analysis. CONCLUSIONS: Dimensionality reduction models showed marginal value in improving the prediction of 30-day MACE for ED chest pain patients. Moreover, they are black box models, making them difficult to explain and interpret in clinical practice. Chest pain is among the most common chief complaints presenting to the emergency department (ED) [1] [2] [3] . The assessment of chest pain patients poses a diagnostic challenge in balancing risk and cost. Inadvertent discharge of acute coronary syndrome (ACS) patients is associated with higher mortality rates while inappropriate admission of patients with more benign conditions increases health service costs [4, 5] . Hence, the challenge lies in recognizing low-risk chest pain patients for safe and early discharge from the ED. There has been increasing focus on the development of risk stratification scores. Initially, risk scores such as the Thrombolysis in Myocardial Infarction (TIMI) score [6, 7] and the Global Registry of Acute Coronary Events (GRACE) score [8] were developed from post-ACS patients to estimate short-term mortality and recurrence of myocardial infarction. The History, Electrocardiogram (ECG), Age, Risk factors, and initial Troponin (HEART) score was subsequently designed for ED chest pain patients [9] , which demonstrated superior performance in many comparative studies on the identification of low-risk chest pain patients [10] [11] [12] [13] [14] [15] [16] [17] . Nonetheless, the HEART score has its disadvantages. Many potential factors can affect its diagnostic and prognostic accuracy, such as variation in patient populations, provider determination of low-risk heart score criteria, specific troponin reagent used, all of which contribute to clinical heterogeneity [18] [19] [20] [21] . In addition, most risk scores still require variables that may not be available during the initial presentation of the patient to the ED such as troponin. There remains a need for a more efficient risk stratification tool. We had previously developed a heart rate variability (HRV) prediction model using readily available variables at the ED, in an attempt to reduce both diagnostic time and subjective components [22] . HRV characterizes beat-to-beat variation using time, frequency domain, and nonlinear analysis [23] and has proven to be a good predictor of major adverse cardiac events (MACE) [22, 24, 25] . Most HRV-based scores were reported to be superior to TIMI and GRACE scores while achieving comparable performance with HEART score [17, 24, 26, 27] . Recently, we established a new representation of beat-tobeat variation in ECGs, the heart rate n-variability (HRnV) [28] . HRnV utilizes variation in sampling RRintervals and overlapping RR-intervals to derive additional parameters from a single strip of ECG reading. As an extension to HRV, HRnV potentially supplements additional information about adverse cardiac events while reducing unwanted noise caused by abnormal heartbeats. Moreover, HRV is a special case of HRnV when n = 1. The HRnV prediction model, developed from multivariable stepwise logistic regression, outperformed the HEART, TIMI, and GRACE scores in predicting 30-day MACE [28] . Nevertheless, multicollinearity is a common problem in logistic regression models where supposedly independent predictor variables are correlated. They tend to overestimate the variance of regression parameters and hinder the determination of the exact effect of each parameter, which could potentially result in inaccurate identification of significant predictors [29, 30] . In the paper, 115 HRnV parameters were derived but only seven variables were left in the final prediction model, and this implies the possible elimination of relevant information [28] . Within the general medical literature, machine learning dimensionality reduction methods are uncommon and limited to a few specific areas, such as bioinformatics studies on genetics [31, 32] and diagnostic radiological imaging [33, 34] . Despite this, dimensionality reduction in HRV has been investigated and shown to effectively compress multidimensional HRV data for the assessment of cardiac autonomic neuropathy [35] . In this paper, we attempted to investigate several machine learning dimensionality reduction algorithms in building predictive models, hypothesizing that these algorithms could be useful in preserving useful information while improving prediction performance. We aimed to compare the performance of the dimensionality reduction models against the traditional stepwise logistic regression model [28] and conventional risk stratification tools such as the HEART, TIMI, and GRACE scores, in the prediction of 30-day MACE in chest pain patients presenting to the ED. A retrospective analysis was conducted on data collected from patients > 20 years old who presented to Singapore General Hospital ED with chest pain between September 2010 to July 2015. These patients were triaged using the Patient Acuity Category Scale (PACS) and those with PACS 1 or 2 were included in the study. Patients were excluded if they were lost to the 30-day follow-up or if they presented with ST-elevation myocardial infarction (STEMI) or non-cardiac etiology chest pain such as pneumothorax, pneumonia, and trauma as diagnosed by the ED physician. Patients with ECG findings that precluded quality HRnV analysis such as artifacts, ectopic beats, paced or non-sinus rhythm were also excluded. For each patient, HRV and HRnV parameters were calculated using HRnV-Calc software suite [28, 36] from a five to six-minute single-lead (lead II) ECG performed via the X-series Monitor (ZOLL Medical, Corporation, Chelmsford, MA). Table 1 shows the full list of HRV and HRnV parameters used in this study. Besides, the first 12-lead ECGs taken during patients' presentation to the ED were interpreted by two independent clinical reviewers and any pathological ST changes, T wave inversions, and Q-waves were noted. Patients' demographics, medical history, first set of vital signs, and troponin-T values were obtained from the hospital's electronic health records (EHR). In this study, high-sensitivity troponin-T was selected as the cardiac biomarker and an abnormal value was defined as > 0.03 ng/mL. The primary outcome measured was any MACE within 30 days, including acute myocardial infarction, emergent revascularization procedures such as percutaneous coronary intervention (PCI) or coronary artery bypass graft (CABG), or death. The primary outcome was captured through a retrospective review of patients' EHR. Dimensionality reduction in machine learning and data mining [37] refers to the process of transforming highdimensional data into lower dimensions such that fewer features are selected or extracted while preserving essential information of the original data. Two types of dimensionality reduction approaches are available, namely variable selection and feature extraction. Variable selection methods generally reduce data dimensionality by choosing a subset of variables, while feature extraction methods transform the original feature space into lowerdimensional space through linear or nonlinear feature projection. In clinical predictive modeling, variable selection techniques such as stepwise logistic regression are popular for constructing prediction models [38] . In contrast, feature extraction approaches [39] are less Table 1 List of traditional heart rate variability (HRV) and novel heart rate n-variability (HRnV) parameters used in this study. HRnV is a new representation of beat-to-beat variation in ECGs and parameter "n" controls the formation of new RR-intervals that are used for parameter calculation. Details of HRnV definition can be found in [28] Mean NN average of R-R intervals, SDNN standard deviation of R-R intervals, RMSSD square root of the mean squared differences between R-R intervals, NN50 the number of times that the absolute difference between 2 successive R-R intervals exceeds 50 ms pNN50, NN50 divided by the total number of R-R intervals, NN50n the number of times that the absolute difference between 2 successive RR n I/RR n I m sequences exceeds 50 × n ms, pNN50n NN50n divided by the total number of RR n I/RR n I m sequences, VLF very low frequency, LF low frequency, HF high frequency, SD standard deviation, SampEn sample entropy, ApEn approximate entropy, DFA detrended fluctuation analysis a In frequency domain analysis, the power of spectral components is the area below the relevant frequencies presented in absolute units (square milliseconds) commonly used in medical research, although they have been widely used in computational biology [40] , image analysis [41, 42] , physiological signal analysis [43] , among others. In this study, we investigated the implementation of eight feature extraction algorithms and evaluated their contributions to prediction performance in risk stratification of ED chest pain patients. We also compared them with a prediction model that was built using conventional stepwise variable selection [28] . Henceforth, we use the terms "dimensionality reduction" and "feature extraction" interchangeably. Given that there were n samples (x i , y i ), i = 1, 2, …, n, in the dataset (X, y), where each sample x i had original D features and its label y i = 1 or 0, with 1 indicating a positive primary outcome, i.e., MACE within 30 days. We applied dimensionality reduction algorithms to project x i into a d-dimensional space (d < D). As a result, the original dataset X ∈ ℝ n × D becameX∈ℝ nÂd . There was a total of D = 174 candidate variables in this study. As suggested in Liu et al. [28] , some variables were less statistically significant in terms of contributions to the prediction performance. Thus, we conducted univariable analysis and preselected a subset ofD variables if their p