key: cord-0687038-ejyni96u
authors: Mueller, Brianna; Kinoshita, Takahiro; Peebles, Alexander; Graber, Mark A.; Lee, Sangil
title: Artificial intelligence and machine learning in emergency medicine: a narrative review
date: 2022-03-01
journal: Acute Med Surg
DOI: 10.1002/ams2.740
sha: 40e1cf002b558479a1a40da8d1e3714cfc85fcad
doc_id: 687038
cord_uid: ejyni96u

AIM: The emergence and evolution of artificial intelligence (AI) has generated increasing interest in machine learning applications for health care. Specifically, researchers are grasping the potential of machine learning solutions to enhance the quality of care in emergency medicine. METHODS: We undertook a narrative review of published works on machine learning applications in emergency medicine and provide a synopsis of recent developments. RESULTS: This review describes fundamental concepts of machine learning and presents clinical applications for triage, risk stratification specific to disease, medical imaging, and emergency department operations. Additionally, we consider how machine learning models could contribute to the improvement of causal inference in medicine, and to conclude, we discuss barriers to safe implementation of AI. CONCLUSION: We intend that this review serves as an introduction to AI and machine learning in emergency medicine.

I N RECENT YEARS, advancements in artificial intelligence (AI) technologies have resulted in the rapid growth of machine learning (ML) research in medicine. 1 Specifically, the development of unprecedented ML applications has shown great potential to significantly impact the field of emergency medicine. These applications address prevailing challenges in the emergency department such as triage and disposition, early detection of conditions and outcomes, emergency department operations, and therapeutic intervention. With the increasing availability of clinical data, it is exceedingly advantageous for emergency medicine clinicians to understand computational techniques like ML that are able to meaningfully process large quantities of complex data. This review aims to provide a conceptual introduction to AI/ML and increase awareness of emerging clinical tools derived from ML methods. We present examples of ML models used in clinical research and highlight recent applications in the field of emergency medicine. Specifically, we focus not only on predictive studies, which the vast majority of ML research has targeted until the present, but also on causal inference studies because the goal of clinical research is often determining the effects of interventions on clinical outcomes. To conclude, we discuss challenges to the implementation of AI and consider reasons why only a few ML solutions have been applied in actual clinical practice despite the proliferation of applications in clinical literature. By examining barriers of clinical adoption, we also intend for this review to encourage more discussion on how to practically address these concerns and integrate machine leaning into routine clinical operations.

such as problem-solving and learning. Machine learning is a branch of AI focused on leveraging data to develop computer systems that are able to learn and improve from experience without being explicitly programmed. 2 Statistical methods and algorithms are used to recognize patterns and learn relationships from data in order to build models capable of making predictions or decisions. Machine learning algorithms fall into three main categories: supervised learning, unsupervised learning, and reinforcement learning. Supervised ML is defined by the use of labeled data to learn a mapping between input variables and an outcome variable of interest (e.g., positive diagnosis or negative diagnosis). The process of developing a supervised model involves three datasets. The algorithm first learns on the training dataset by adjusting weights to minimize a loss function that computes the distance between the predicted outcome and the true outcome for a given data point. After fitting the model, a validation set is used for optimization through tuning model parameters. The validation set can also detect overfitting, which is observed when model performance is significantly better on the training set. Finally, a test set is used to provide an estimate of how well the model can generalize to new data.

Conversely, unsupervised learning refers to methods that use unlabeled data to find naturally occurring groups or clusters. These clusters are analyzed to identify similarities and differences between data points, and understand the distribution of data in the feature space. In reinforcement learning, a computer agent learns to achieve a goal in an interactive environment by trial and error. Unlike supervised learning where data labels serve as model feedback, reinforcement learning uses rewards or penalties as feedback based on the actions the agent performs. Over time, the agent learns action sequences that maximize the reward.

Deep learning is a subfield of ML that has gained massive popularity in health care the past few years due to its success on a variety of complex classification tasks. 3 This can partly be attributed to increases in computational power and access to ever-growing amounts of data. Inspired by the structure and function of neurons in the cerebral cortex, a neural network is the backbone of deep learning algorithms. A neural network architecture consists of layers of interconnected nodes that are analogous to neurons ( Fig. 1) . In a process called forward propagation, data is fed into an input layer and flows through the system of hidden nodes connected by weights. The input to each node is a weighted linear combination of node outputs from the previous layer and a nonlinear transformation is applied to the node's output. A loss function evaluates the difference between the predicted value from the output layer and the true value. An optimization algorithm called backpropagation uses the prediction error to iteratively adjust the weights to learn the structure of the training data, gradually improving model accuracy. 

One of the primary ways ML and deep learning techniques differ is in the learning process. Deep learning algorithms have the ability to automatically learn feature hierarchies, whereas traditional ML algorithms require manual feature extraction and engineering. Second, deep learning often requires a considerable amount of data to make predictions, whereas traditional ML methods could reach a level where model performance no longer scales with the amount of data. Another major difference between the two techniques is the execution time. Due to the large number of parameters to learn, deep learning methods often take significantly longer to train. Finally, deep learning methods involve a large number of matrix multiplication operations, which can result in a heavy dependence on high-end machines.

O NE OF THE most interpretable ML models used in clinical research is a decision tree. A decision tree is a supervised learning algorithm structured like a flowchart that can be used for both classification and regression tasks. The goal is to develop a model that can be used to predict the target variable of future instances based on a set of decision rules. The algorithm recursively partitions the data into subsets (decision nodes) based on the value of the feature that reduces the impurity of the resulting subsets the most. A node is considered to be "pure" if all of the data points in the node are of the same class. When a node contains equal percentages of each class, impurity is maximized. Figure 2 shows an example of a decision tree to predict medication dosage. The first split is based on body mass index (BMI), indicating BMI is the best predictor for dosage level. Each of the final subsets (leaf nodes) is assigned class membership probabilities for each data point in the node.

Ensemble learning is a ML method that aims to increase accuracy and reduce variance by combining multiple algorithms. A random forest is a commonly used ensemble model that aggregates the outputs of several decision trees to make a single prediction. In general, ensembles have higher predictive power than their constituents do individually. Other widely used algorithms include linear regression, naive Bayes, support vector machine (SVM), k-nearest neighbors, and various ensemble methods such as gradient and adaptive boosting.

Among breakthroughs in deep learning, the convolutional neural network (CNN) has gained significant attention from researchers because of its high performance in computer vision tasks. A CNN is a specialized type of deep neural network that adaptively learns features through backpropagation, eliminating the need for manual feature extraction. Several studies in recent years have illustrated the potential of deep learning for medical imaging tasks. [4] [5] [6] Mzoughi et al., Khan et al., and Qummar et al., have reported applications of a CNN for brain tumor classification, COVID diagnosis from chest X-ray images, and diabetic retinopathy detection, respectively. [4] [5] [6] With the emergence of high performing deep learning models, neural networks have become an attractive tool for radiologists due to their ability to automatically learn feature hierarchies. For some classification tasks, CNNs have shown success in overcoming the limitations of traditional ML models.

In addition to computer vision, deep learning has made large contributions to the field of natural language processing (NLP), which is concerned with the development of machines to analyze and derive meaning from human language. 7 Natural language processing has become a part of the clinical flow in emergency medicine as a way to harness the vast amounts of textual data in electronic health records. Certain neural network architectures are designed to effectively extract valuable information from unstructured text data in electronic health records such as clinical reports and health-care provider notes. 8, 9 Natural language processing methods have shown potential in leveraging medical records for various clinical tasks such as identifying sepsis, appendicitis, and influenza. [10] [11] [12] [13] Examples of deep learning methods that have garnered interest for NLP tasks include recurrent neural networks, CNNs, and deep generative models. Unsupervised learning and reinforcement learning methods have also been applied for NLP. Supervised ML algorithms have rapidly replaced traditional methods in predictive studies aiming to forecast the occurrence of outcomes using patient characteristics that are measured prior to the outcomes. As these models allow us to capture relationships between features (predictors) and outcomes flexibly, the prediction performance is expected to outweigh simple scoring systems. In addition to predictive studies, supervised ML algorithms have come into use in causal inference studies targeting the investigation of the effects of interventions on outcomes of interest. As causal inference in observational studies generally needs models to estimate treatment effects, we can expect to reduce bias using sophisticated ML models.

Triage and disposition T RIAGE REFERS TO the process by which patients are assessed upon arrival to the emergency department (ED) and prioritized based on the severity and urgency of their medical condition. Traditionally, a triage nurse will carry out the evaluation using vital signs, demographics, and ordered tests. Proficiency in triage takes time and experience. When immediate life-threatening conditions are not identified, emergency severity index level 3 is a default choice, which could leave a large number of patients waiting long hours for a provider and evaluation. 14 The use of ML models in the ED can facilitate triage with more accuracy and efficiency, requiring only information routinely collected by the triage staff. In addition to predicting the urgency of medical conditions, ML techniques can be applied to develop screening tools for disease specific risk prediction.

The ED provider does not always have timely access to radiology interpretation. An accurate identification of a fracture in X-ray images or a stroke in magnetic resonance imaging scans conventionally requires timely access to avoid misdiagnosis and a delay in treatment. This is especially critical when working in a smaller ED with limited access to specialists. Deep learning models for medical imaging with high sensitivities could help clinicians quickly identify lifethreatening pathologies. Recent reports suggest that the quality of AI interpretation is not inferior to an expert radiologist. [15] [16] [17] [18] Emergency department operations and management Stochasticity in ED operations, such as patient arrivals, types of medical treatments and diagnostic tests required, and the duration of treatments and tests imposes unique challenges to predict future service demands. Artificial intelligence has the potential to transform ED operations and hospital leadership at multiple steps in the patient care process from arrival to discharge. The integration of ML could improve ED operations by better matching resources to patient needs, ultimately reducing costs and improving patient outcomes. 19 Emergency department overcrowding is an increasing issue in health care that can have negative implications for the quality of patient care. 20 Predictive models for ED volume could help plan staffing models and prepare for surge and disaster situations. Second, many complicating factors make it difficult to estimate ED wait times. Machine learning algorithms with the ability to identify patterns in complex feature sets have the potential to produce more accurate ED wait times. On a larger scale, if multiple nearby hospitals report accurate wait times, low acuity patients could have a choice for ED service based on wait time and travel distance.

Not all AI applications will survive and win trust from clinicians and patients. It seems that triage and radiology models are likely to be adopted faster than other applications in ED operations. 19 Table 1 outlines the selected works covered in this narrative to provide an overview of the most recent ML techniques in emergency medicine that have shown promise to improve patient outcomes.

H ERN an et al. has categorized medical and epidemiological research data science approaches in three ways: description, prediction, and causal inference (counterfactual prediction). 34 Although descriptive and predictive studies are essential to understanding the frequency, determinants, and prognosis of diseases or conditions, clinicians usually cannot achieve the ultimate goal by solely using these types of research methods when we aim to improve patient outcomes through interventions. Thus, causal inference, a type of study that compares hypothetical potential outcomes using two or more different treatments in a targeted population, attracts great interest in medical research. For example, we aspire to know whether resuscitative endovascular balloon occlusion of the aorta (REBOA) improves mortality in patients with life-threatening trauma.

Machine learning models can contribute to the improvement of causal inference in several ways. First, unsupervised learning models can identify groups of patients that share specific characteristics. A recent article identified four clinically meaningful phenotypes of sepsis using data from several observational studies and randomized control trials. 35 In this study, k-means-based consensus clustering was used for the grouping. Although this study did not explicitly conduct causal inference, the authors suggested that the effects of early goal-directed therapy differed across these identified phenotypes.

Second, predictions from supervised learning models can find high-risk patients more accurately than the previous approaches. As an example, a study referenced earlier (Table 1) predicted critical care (intensive care unit admission or in-hospital death) and hospitalization of patients presented to the ED using baseline demographics, vital signs, chief complaints, and patient comorbidities. 21 The authors found that the discrimination accuracies of the four ML models (lasso regression, random forest, gradient boosting, and deep neural network) were higher than that of the prediction model using logistic regression. Even though the purpose of this study was not causal inference, earlier detection of high-risk patients might lead to the identification of a subgroup that benefits from immediate aggressive interventions.

Finally, ML models can be directly used in causal inference to improve the model fit of either a treatment model to construct a propensity score or an outcome model or both. Although logistic regression is almost always used for propensity score estimation, it is plausible to use more sophisticated methods for this purpose. A representative example was a study that evaluated the effect of transthoracic echocardiography on 28-day mortality in intensive care unit patients with sepsis. 30 The authors used gradient boosting, rather than logistic regression, for the treatment model to estimate propensity scores to receive transthoracic echocardiography. Similarly, we can also use ML models for the outcome model to estimate treatment effect to draw causal inference.

A challenge in using ML models for causal inference is that there are not enough reliable ways to verify that these methods are better than the traditional parametric models using linear or logistic regression. In the predictive studies, we can compare the performance of ML models with that of parametric models in preserved test datasets by some metrics, including the area under the receiver operating curve (AUROC) and the area under the precision-recall curve (AUPRC). Additionally, there is no guarantee that ML algorithms better eliminate confounding, even though the prediction accuracy for the treatment model and outcome model is improved. As scientists do not know the ground truth of the causal effect, there is always a risk of overfitting when we use complicated models. Hern an et al. advocated for using a sophisticated epidemiological method named doubly robust estimators, which combines a model to predict the outcome using multiple covariates with a model for the exposure (i.e., the propensity score model), to estimate the causal effect of an exposure on an outcome. They also suggest that sample splitting with cross-fitting could overcome the risk of overfitting of ML models. Machine learning models are not "magic wands" that automatically answer causal questions. However, it can help researchers estimate the effects of interventions accurately.

A RTI ficial intelligence is not a panacea for diagnostic and therapeutic dilemmas. Many prediction models using AI are presented in academic articles. However, the number of algorithms that have been used for the improvement of patient care is still limited. Yin et al. 36 found that only 51 relevant studies reported implementing and Garbage in, garbage out

One of the greatest barriers to the safe implementation of AI is the accuracy of input. Datasets used for ML training will be chosen and scored by "expert" clinicians. These datasets could contain cases that are incorrectly diagnosed, leading to fundamental flaws in decision making. Additionally, the composition of these datasets might be subject to bias. Cases that stand out in the programmer/clinician's mind may be over-represented (e.g., "availability bias"). There can also be "spectrum bias" in ML. For example, a computed tomography (CT) dataset of biopsy-proven lung cancer could be visually different than that diagnosed incidentally in the ED, leading to degraded AI performance. Additionally, there may be "base rate neglect". A weighting of "cancer" versus "not-cancer" should be based not only on lesion characteristics but on the pretest probability in the population to which the AI is being applied. This pretest probability will differ when, for example, AI is applied to patients in an academic referral center where a cancer CT ML/AI dataset is developed than it will in a community ED. Errors can be amplified in ML iterations (Table 2) . While the inclusion of a misdiagnosis in the first dataset can lead to less-than-ideal diagnostic accuracy, further ML based on this initial dataset can reinforce this bias. One can think of it as a form of "confirmation bias"; the AI is looking for patterns it already knows even if they are erroneous. For example, several mislabeled echocardiograms or radiographs could lead to the incorporation of similar, erroneously interpreted studies into the ML process. This is not just theoretical; several algorithms have become "self-fulfilling prophecies". In one case, questionable race-based adjustments for glomerular filtration rate biased the process of referral for kidney transplants against Black patients. 37 There are cautionary tales of algorithms used to identify "drug seekers" that include spurious information. 38 Concern has also been raised about the accuracy of AI in those with a disability. 39 The accuracy of AI can also be hampered at the bedside based on the subjective nature of required data. Even with something so fundamental to the diagnostic process as patient history, there is often a lack of interobserver agreement. This can lead to variable scoring of predictive models. For example, interobserver agreement of the patient history is poor even with something as straightforward as the HEART score, designed to predict the 6-week risk of major cardiac events. 40 As history will necessarily make up part of a predictive model, the prediction for any individual patient will be dependent on the accuracy of this data. 

Another barrier to the safe implementation of AI is the proprietary nature of most systems. External validation must be assured as part of the quality control process. For example, the sepsis decision support tool in the EPIC electronic medical record was found to be neither sensitive nor specific when applied to an external validation set. 41 Data sharing is another issue when we undertake ongoing training, validation, and improvement of AI algorithms. The model can be quickly outdated due to dynamically evolving clinical practices. Thus, users need to continuously provide data to finetune the model to fit the current situation. Massive efforts should also be made for anonymization and de-identification of the data to protect patients' privacy.

We cannot expect diagnostic perfection; whether made by an AI or human, diagnosis and treatment decisions are probabilistic. Errors will be made. Ideally, we would accept the same rate of "misses" by an AI as by a human provider. This remains a fertile area for research.

Some regulation is already in place. The European Union ranks medical AI applications as "high risk," making them subject to stricter oversight than, for example, the AI that fills out your music playlist or recommends your next bingewatch. Artificial intelligence/ML is also subject to regulation by the Food and Drug Administration as a "medical device" in the United States. We have argued in a prior paper that, given the role of AI/ML in patient care, programmers/designers of diagnostic software should be considered medical providers and should be subject to traditional principles medical ethics such as beneficence and nonmaleficence. 42 LIMITATIONS M ANY RESEARCH STUDIES at the forefront of innovation could be found in articles published in nonmedical journals or preprints that have yet to be peerreviewed. Therefore, we are not able to capture all state-ofthe-art ML technologies in emergency medicine.

T HIS REVIEW SUMMARIZED the current status of ML research in emergency medicine. Although many applications have demonstrated efficacy in academic literature, few have been implemented in practice due to barriers such as potential bias in datasets, the proprietorship of systems, and regulation. Quality measures and ethical controls need to be developed, including appropriate external oversight. Although this might make AI development more cumbersome, ensuring accuracy when potentially lifechanging decisions are being made is critical. Future research should work towards overcoming these challenges to bridge the gap between academic research and clinical integration.

Machine learning in relation to emergency medicine clinical and operational scenarios: an overview

Machine learning and artificial intelligence: definitions, applications, and future directions

A guide to deep learning in healthcare

Deep Multi-Scale 3D Convolutional Neural Network (CNN) for MRI gliomas brain tumor classification

CoroNet: a deep neural network for detection and diagnosis of COVID-19 from chest x-ray images

A deep learning ensemble approach for diabetic retinopathy detection

Recent trends in deep learning based natural language processing

Adverse drug event detection from electronic health records using hierarchical recurrent neural networks with dual-level embedding

Deep learning in clinical natural language processing: a methodical review

Natural language processing-enabled and conventional data capture methods for input to electronic health records: a comparative usability study

Extracting actionable findings of appendicitis from radiology reports using natural language processing

Building a natural language processing tool to identify patients with high clinical suspicion for Kawasaki disease from emergency department notes

The effects of natural language processing on cross-institutional portability of influenza case detection for disease surveillance

Decreasing length of stay in the emergency department with a split emergency severity index 3 patient flow model

Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study

Analysis of head CT scans flagged by deep learning software for acute intracranial hemorrhage

Utility of artificial intelligence tool as a prospective radiology peer reviewer -detection of unreported intracranial hemorrhage

Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms

How artificial intelligence could transform emergency department operations

Short and Long term predictions of Hospital emergency department attendances

Emergency department triage prediction of clinical outcomes using machine learning models

Improving ED emergency severity index acuity assignment using machine learning and clinical natural language processing

Emergency department disposition prediction using a deep neural network with integrated clinical narratives and structured data

Automated detection of altered mental status in emergency department clinical notes: a deep learning approach

A machine learning approach to predicting need for hospitalization for pediatric asthma exacerbation at the time of emergency department triage

Predicting adult neuroscience intensive care unit admission from emergency department triage using a retrospective, tabular-free text machine learning approach

Machine learning for prediction of septic shock at initial triage in emergency department

Predicting urinary tract infections in the emergency department with machine learning

Deep neural network improves fracture detection by clinicians

Transthoracic echocardiography and mortality in sepsis: analysis of the MIMIC-III database

Predicting waiting time to treatment for emergency department patients

Improving emergency department efficiency by patient scheduling using deep reinforcement learning

A medical procedure-based patient grouping method for an emergency department

A second chance to get causal inference right: A classification of data science tasks

Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis

Role of artificial intelligence applications in real-life clinical practice: systematic review

Examining the potential impact of race multiplier utilization in estimated glomerular filtration rate calculation on African-American care outcomes

A drug addiction risk algorithm and its grim toll on chronic pain sufferers | WIRED

Toward fairness in AI for people with disabilities SBG@a Ó 2022 The Authors

A prospective evaluation of clinical HEART score agreement, accuracy, and adherence in emergency department chest pain patients

External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients

The wizard behind the curtain: programmers as providers

Ó 2022 The Authors. Acute Medicine & Surgery published by John Wiley & Sons Australia, Ltd on behalf of Japanese Association for Acute Medicine