key: cord-0977481-fo9ekd1l authors: Wiemken, Timothy L; Rutschman, Ana Santos title: Methodology Minute: A Machine Learning Primer for Infection Prevention and Control date: 2020-10-01 journal: Am J Infect Control DOI: 10.1016/j.ajic.2020.09.009 sha: 4ee1c9a97cc95e131128b4607f05dd418c489d18 doc_id: 977481 cord_uid: fo9ekd1l The use of machine learning and predictive modeling in infection prevention and control activities is increasing dramatically. In order for infection preventionists to make informed decisions on the performance of any particular model as well as to determine if the output of the model will be useful for their program needs, a suitable understanding of the creation and evaluation of these models is necessary. The purpose of this primer is to introduce the infection preventionist to the most commonly used machine learning method in infection prevention: supervised learning. The use of machine learning and predictive modeling in infection prevention and control activities is increasing dramatically. In order for infection preventionists to make informed decisions on the performance of any particular model as well as to determine if the output of the model will be useful for their program needs, a suitable understanding of the creation and evaluation of these models is necessary. The purpose of this primer is to introduce the infection preventionist to the most commonly used machine learning method in infection prevention: supervised learning. Machine learning is an umbrella term encompassing many algorithms used for assisting human understanding of large amounts of data (1). Often, machine learning is conflated with the terms 'predictive' or 'prescriptive' modeling; though they are not directly interchangeable. Machine learning is also commonly confused with 'artificial intelligence', a phrase with little agreement on definition. Simply put, machine learning can be considered a computational method, while general artificial intelligence could be considered a physical manifestation utilizing machine learning to perform a task(s). Several subclasses of machine learning have been developed and used in medicine (2), but the three major groups utilized in medicine include: unsupervised, supervised, and reinforcement learning. These groups include different algorithms, each with their unique pros and cons for specific tasks. Here, we focus on supervised machine learning, as it is the most commonly used approach for predictive modeling in infection prevention and control (IPC). Since this is only a brief primer, reviews can be found elsewhere in the literature for further study (1). When the phrase machine learning is used, most often the speaker is talking about supervised machine learning, an approach most synonymous with predictive modeling. Supervised machine learning is a method where an algorithm learns from available data to create a model, which is then used to predict an outcome for new data. The learning process is called 'training the model', where various features (AKA variables) are provided to an algorithm. The prefix 'supervised' means that an outcome is known to the model during training. For example, one may be interested in being able to predict if someone has a catheter-associated urinary tract infection (CAUTI) before a diagnostic test. In this example, a binary variable, CAUTI, is the outcome of interest and would be obtained from an electronic health record retrospectively. A multitude of features that the modeler thinks might explain the presence or absence of CAUTI are added to the model as well. Depending on the algorithm used, different mathematical computations are done to allow the computer to learn and compare the complex patterns of the features in patients with and without CAUTI. Next, new data without the outcome known (e.g. prospective patients with unknown CAUTI status) are passed to this model, which will output a prediction of the presence or absence of CAUTI. Supervised machine learning is used regularly in infection prevention for prediction of various health outcomes such as Clostridiodes difficile infection (3) and other healthcare-associated infections (4, 5) , as well as development of vaccine candidates for SARS-CoV-2 (6). Model performance is focused on several statistics representing how good it is at predicting the outcome. Care must be taken when evaluating performance statistics as there are many and they are often utilized inappropriately, particularly model accuracy. A case-in-point is a product vendor who creates a model to predict sepsis. They report the model to be 90% accurate. Although great at face-value, it is more important to look at the sensitivity, specificity, positive, and negative predictive values, rather than a grand statistic such as accuracy, as accuracy is calculated as the total correct predictions divided by the total predictions, not separating false positives and negatives. Furthermore, if the model was trained inappropriately, this accuracy statistic may be incongruent with the performance of the model in real life. For example, if the model was trained on a dataset where 90% of the patients did not have sepsis and 10% did, the model could simply report that every patient did not have sepsis and it would still be 90% accurate. It would miss that 10% with sepsis, and the purpose of the prediction -to predict what is called the 'minority class', or the outcome group that is less frequent (7). A complex part of building a supervised machine learning model is 'feature engineering'; creating and modifying variables (features in machine learning language) for inclusion. It is not as simple of an approach as traditional explanatory regression models where only 'clinically meaningful' or 'biologically plausible' variables are added to the model (e.g. age, gender, race, etc.). Those variables can certainly be added to a machine learning model, but the outcome should not be expected to be superior to any traditional regression model with the same variables (8) . With machine learning, the goal is not to identify the impact of variables on the outcome with a risk ratio or odds ratio, but rather to have a model that can accurately predict an outcome. The features used in the model are largely irrelevant as long as they improve the performance of the model. Natural Language Processing (NLP) is another umbrella term which encompasses a great deal of methods for dealing with textual data such as text mining (extracting specific words or phrases from a corpus of text) to topic modeling (defining topics from a corpus), and defining the sentiment of text, among a wide variety of other methods often used in machine learning models. NLP has proven useful in various including for identifying healthcare-associated infections from notes in the electronic health record (9, 10). It is critical to understand that machine learning models are based on data that are often likely to reflect contextual biases due to having been produced in high-resource settings, such as university-affiliated hospitals (11) . Outside these settings, data are unlikely to properly account for diversity in clinical features of the patients or may not have the same variables available (e.g. various laboratory values) available for modeling. Since machine learning models strictly learn patterns present in the data supplied to the algorithm, any biased input data will result in biased outputs and predictions. As we begin to rely more on computational predictions in all areas of medicine and health, we must recognize and start addressing the data gaps. Failing to do so can lead to racism, sexism, ageism, or other forms of discrimination. The future will bring more and more electronic decision support tools using machine learning to our workday. Remembering that any computer model is created by humans and may be error prone underscores the need to implement any decision support tool with caution, ensuring the clinical aspects of any predictions are not disregarded. Machine Learning in Epidemiology and Health Outcomes Research High-performance medicine: the convergence of human and artificial intelligence Data-Driven Approach to Predict Daily Risk of Clostridium difficile Infection at Two Large Academic Health Centers Detecting Healthcare-Associated Infections in Electronic Health Records : Evaluation of Machine Learning and Preprocessing Techniques Detecting hospitalacquired infections: A document classification approach using support vector machines and gradient tree boosting Computationally Optimized SARS-CoV-2 MHC Class I and II Vaccine Formulations Predicted to Target Human Haplotype Distributions Commentary: The problem of class imbalance in biomedical data A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models Applying deep learning on electronic health records in Swedish to predict healthcare-associated infections Detection of healthcare-associated urinary tract infection in Swedish electronic health records. Stud Health Technol Inform