key: cord-0951484-dd4hqo1u authors: Cohen, Joseph Paul; Cao, Tianshi; Viviano, Joseph D.; Huang, Chin-Wei; Fralick, Michael; Ghassemi, Marzyeh; Mamdani, Muhammad; Greiner, Russell; Bengio, Yoshua title: Problems in the deployment of machine-learned models in health care date: 2021-09-07 journal: CMAJ DOI: 10.1503/cmaj.202066 sha: 8400f43ee93353ae65871cbe2809c0ed93f6c896 doc_id: 951484 cord_uid: dd4hqo1u nan I n a companion article, Verma and colleagues discuss how machine-learned solutions can be developed and implemented to support medical decision-making. 1 Both decisionsupport systems and clinical prediction tools developed using machine learning (including the special case of deep learning) are similar to clinical support tools developed using classical statistical models and, as such, have similar limitations. 2,3 A model that makes incorrect predictions can lead its users to make errors they otherwise would not have made when caring for patients, and therefore it is important to understand how these models can fail. 4 We discuss these limitations -focusing on 2 issues in particular: out-of-distribution (or out-of-sample) generalization and incorrect feature attribution -to underscore the need to consider potential caveats when using machine-learned solutions. Herein we use the term "machine-learned model" to refer to a model that has been created by running a supervised machine learning algorithm on a labelled data set. Machine-learned models are trained on specific data sets, known as their training distribution. Training data are typically drawn from specific ranges of demographics, country, hospital, device, protocol and so on. Machinelearned models are not dynamic unless they are explicitly designed to be, meaning that they do not change as they are used. Typically, a machine-learned model is deterministic, having learned a fixed set of weights (i.e., coefficients or parameters) that do not change as the model is run; that is, for any specific input, it will return the same prediction every time. Although "adaptive systems" have been developed that can "learn" while being deployed by incor porating new data, such systems may give a different prediction for the same input and their safety and oversight is still unclear. 5 We refer to the data that a machine-learned model will encounter when it is deployed for use as the model's performance distribution. If a machine-learned model's training distribution does not match its performance distribution, then the performance of the model may be lower than expected 6,7 -a challenge that is commonly referred to as out-of-distribution generalization (discussed in detail below). Another challenge is if the training data contain features that are spuriously correlated with the outcomes the tool is being designed to predict, as this may cause a machine-learned model to make predictions from the "wrong" features (also discussed below). A model's creator should seek a training data distribution that matches the perform ance distribution as closely as possible, and clinicians who use the tool should be aware of the exact limitations of the model's training distribution and potential shortcomings. What are some potential problems of machine-learned models? Newly graduated physicians are typically most comfortable managing patients who exhibit conditions they encountered during their residency training, but they are also able to manage patients with conditions they have not previously seen because ANALYSIS Problems in the deployment of machine-learned models in health care • When training, machine learning algorithms take the "path of least resistance," leading them to learn features from the data that are spuriously correlated with target outputs instead of the correct features; this can impair the effective generalization of the resulting learned model. • Avoiding errors related to these problems involves careful evaluation of machine-learned models using new data from the performance distribution, including data samples that are expected to "trick" the model, such as those with different population demographics, difficult conditions or bad-quality inputs. they can use theoretical knowledge to recognize patterns of illness. In contrast, machine-learned methods are limited by the data provided during the training and development phase. Furthermore, machine-learned models do not typically know their own limits unless components are included to help the model detect when data it encounters are out of distribution (for example, a component may be built in that prevents a human chest radiograph diagnostic system from processing a photo of a cat and diagnosing pneumonia 8 -see strategies listed below). Three categories of out-of-distribution data, 9 summarized in Figure 1 , include the following: • Data that are unrelated to the task, such as obviously wrong images from a different domain; for example, magnetic resonance images presented to a machine-learned model that was trained on radiograph images; and less obviously wrong images, such as a wrist radiograph image processed using a model trained with chest radiographs • Incorrectly prepared data; for example, blurry chest radiograph images, those with poor contrast or incorrect view of the anatomy, images presented in an incorrect file format or improperly processed, and images arising from an incorrect data acquisition protocol • Data not included in the training data owing to a selection bias; for example, images showing a disease not present in the training data or those arising from a population demographic not similar to that of the training data set A machine-learned model will perform suboptimally or deliver unexpected results on out-of-distribution data. Many strategies have been developed to detect and prevent out-of-distribution data from being processed. A typical approach is for a model to compute the degree to which a data sample matches the model's training distribution, which may be presented as a score. If the score is above a certain threshold, then the model can decide not to process a data sample. One way for the model to do this -in the case of image interpretation -is for the model to attempt to reconstruct the image and compare the reconstruction to the original by some measure of similarity, such as the absolute pixel difference. 8, 10 Typically, a model will do a poor job of reconstructing an image it did not encounter in training. If the reconstructed image is scored as similar enough to be judged "correct," the model can proceed to process that image; if not, processing will not occur. However, in order to build and evaluate such out-of-distribution detection systems, known out-of-distribution examples must be used; so, even strategies to prevent errors have limits. Machine-learned models typically use only the minimally complicated set of features required to reliably discriminate between the target outputs in their training data set. That is, the model takes a "path of least resistance" during its learning, [11] [12] [13] finding features that are highly predictive of the target output, which helps to make it accurate. However, a learning model may also find some distractor feature in the data that is spuriously correlated with the target output 14 and, once this happens, the model may stop looking for new true discriminative features even if they exist. 15 For example, in a model learning to read chest radiographs, distractor features may be the hospital, image acquisition parameters, radiograph view (e.g., anteroposterior v. anteroposterior supine), and artifacts such as presence of a pacemaker or endotracheal tube. If clinical protocols or image processing change over time, this can lead to patterns in the training data that can be detected by the model and serve as a distractor. 16 Or if images from multiple hospitals are grouped together and the rate of a disease varies among hospitals, a model may learn to detect the hospital using subtle visual cues and may then base its predictions on the hospital associated with the image rather than data in the image itself. This can lead Figure 1 : This figure shows 3 categories of out-of-distribution data, all in the context of training a machine-learned algorithm to read adult chest radiographs (see image C iii). A) Images that are unrelated to the task. B) Images that are incorrectly acquired. C) Images that are not encountered owing to a selection bias in the training distribution (e.g., images with lung cancer lesions and pacemakers were not included in the training set and therefore were unseen during training). C) (iii) Training data that are subject to a selection bias. to a model appearing more accurate than it actually is if the evalu ation data contain the same artifacts (e.g., the same hospital-specific distribution), but the same model could fail dramatically if the performance data do not exhibit these artifacts. Furthermore, patient demographics (e.g., age or sex) can be inferred from aspects of the training data and may be used by a learning model to predict outcome prevalence (that is, prior probability) in the training sample if better true features related to the outcome of interest are less obvious in the data. Medical data sets are often relatively small, which may increase the likelihood of spuriously correlated features. Research into altering the ways models learn to avoid this problem is ongoing. 11, 17 However, using a large, diverse data set for training a machine-learned model will help to avoid the effect of distractors. Other solutions include unsupervised learning and transfer learning, 18 processes that use data that are unlabelled or labelled for another task to train models, to avoid detection of spurious features that are specific to a particular data set. These methods typically enable the use of much more data and have a better chance of learning features that will be general enough and useful for the intended task. 18 In cases where pathology-specific features are simply not predictive enough for some images, the learning model may be forced to guess and predict the prevalence of a disease or outcome in the training distribution. The machine-learned model will appear to work when applied to data in which the disease or outcome prevalence is the same as in the training data; it may give the "right" answer. However, when applied to a different population with a different outcome prevalence, the model will likely predict incorrectly 19, 20 and lead to harm. It is therefore important that model developers and users verify that the machine-learned model appropriately detects features that are truly associated with the prediction or outcome of interest, using a feature attribution method such as the "image gradient" method 21 or creating a counterfactual input showing what would change the classifier's prediction 22 during development and when deployed. Related to this point, another concern is that some models may simply learn to copy the actions taken by the clinicians when the data were generated. For example, if a model is trained to predict the need for blood transfusions based on historical data about transfusions, it may not have anything informative to predict from and instead will learn to replicate existing practices. A model will learn "bad habits" unless the data set used to develop it is corrected. One approach to overcome this problem would be to have expert reviewers label the data set with the true outcomes of interest (e.g., appropriate v. inappropriate blood transfusions), although this may be resource intensive and experts may not always agree on labels. It would be even better to use only labels that are objective and do not depend on human experts. Avoiding errors related to the issues discussed above involves careful evaluation of machine-learned models 23 using new data from the performance distribution, including samples that are expected to expose model failures, such as those with different population demographics, difficult conditions, poor-quality images, or errors. A potentially useful approach is to create simulated test distributions by balancing data based on attributes unrelated to the target task to observe differences in performance of a model according to factors such as demographic minority class 24 or geographic region. 25 If a model learned to focus on a spurious feature such as age, deploying it using data in which the age of the population composed of a single age, although balanced in terms of the target variable the model was trained to predict, would lead to poor performance. Results of such tests of a model's performance should be transparently presented to illustrate its limitations in use. 26 A related article discusses evaluation of machine-learned models in some depth. 27 It is important to understand and tackle these problems of machine-learned models before deployment so that large investments do not end in failure, which could be costly or catastrophic. IBM's "Watson for Oncology" program 28 was suspended after an investment of $62 million, allegedly owing to problematic clinical recommendations that resulted in poor acceptance by clinicians. Google's machine-learned initiative to detect diabetic retinopathy 29 struggled when it encountered "real-world" images in clinics in Thailand that were of lower quality than those in its training set, causing considerable frustration to both patients and staff. Anticipating and mitigating the challenges outlined herein will be key to avoiding such costly failures. Implementing machine learning in medicine How to read articles that use machine learning: users' guides to the medical literature Artificial intelligence for medical image analysis: a guide for authors and reviewers Human-computer collaboration for skin cancer recognition medical-devices/software-medical-device-samd/artificial-intelligence-and -machine-learning-software-medical-device Learning from data: a short course Potential biases in machine learning algorithms using electronic health record data Chester: a web delivered locally computed chest x-ray disease prediction system A benchmark of medical out of distribution detection. uncertainty & robustness in deep learning workshop at ICML. ArXiv A less biased evaluation of out of distribution sample detectors Right for the right reasons: training differentiable models by constraining their explanations Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study Deep learning predicts hip fracture using confounding patient and healthcare variables Saliency is a possible red herring when diagnosing poor generalization Neural smithing: supervised learning in feedforward artificial neural networks Feature robustness in non-stationary health records: caveats to deployable model performance in common clinical machine learning tasks Unsupervised domain adaptation by backpropagation Deep learning of representations for unsupervised and transfer learning A unifying view on dataset shift in classification ADHD-200 global competition: diagnosing ADHD using personal characteristic data can outperform resting state fMRI measurements Deep inside convolutional networks: visualising image classification models and saliency maps Gifsplanation via latent shift: a simple autoencoder approach to counterfactual generation for chest x-rays Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation CheXclusion: fairness gaps in deep chest x-ray classifiers No classification without representation: assessing geodiversity issues in open data sets for the developing world Model cards for model reporting Evaluation of machine learning solutions in medicine Stories of AI failure and how to avoid similar AI fails Lexalytics A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy Contributors: All authors contributed to the conception and design of the work, drafted the manuscript, revised it critically for important intellectual content, gave final approval of the version to be published and agreed to be accountable for all aspects of the work. This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY-NC-ND 4.0) licence, which permits use, distribution and reproduction in any medium, provided that the original publication is properly cited, the use is noncommercial (i.e., research or educational use), and no modifications or adaptations are made. See: https://creativecommons. org/licenses/by-nc-nd/4.0/