key: cord-1027999-uro4rd8n authors: Subudhi, Sonu; Verma, Ashish; B.Patel, Ankit title: Prognostic machine learning models for COVID‐19 to facilitate decision making date: 2020-08-18 journal: Int J Clin Pract DOI: 10.1111/ijcp.13685 sha: a9722062af73fb62973e6718ce601e6712141ddf doc_id: 1027999 cord_uid: uro4rd8n An increasing number of COVID‐19 cases worldwide has overwhelmed the healthcare system. Physicians are struggling to allocate resources and to focus their attention on high‐risk patients, partly because early identification of high‐risk individuals is difficult. This can be attributed to the fact that COVID‐19 is a novel disease and its pathogenesis is still partially understood. However, machine learning algorithms have the capability to analyze a large number of parameters within a short period of time to identify the predictors of disease outcome. Implementing such an algorithm to predict high‐risk individuals during the early stages of infection would be helpful in decision making for clinicians such that irreversible damage could be prevented. Here, we propose recommendations to develop prognostic machine learning models using electronic health records so that a real‐time risk score can be developed for COVID‐19. Abstract: An increasing number of COVID-19 cases worldwide has overwhelmed the healthcare system. Physicians are struggling to allocate resources and to focus their attention on high-risk patients, partly because early identification of high-risk individuals is difficult. This can be attributed to the fact that COVID-19 is a novel disease and its pathogenesis is still partially understood. However, machine learning algorithms have the capability to analyze a large number of parameters within a short period of time to identify the predictors of disease outcome. Implementing such an algorithm to predict high-risk individuals during the early stages of infection would be helpful in decision making for clinicians such that irreversible damage could be prevented. Here, we propose recommendations to develop prognostic machine learning models using electronic health records so that a real-time risk score can be developed for COVID-19. The current surge in COVID-19 patients has created an unprecedented stress on healthcare infrastructure. Early identification of high-risk patients can allow healthcare workers to allocate their efforts and resources early during the clinical course to maximize their impact on patient health. Early critical care management in certain clinical settings has demonstrated improvement in mortality 1 . However, identification of patients at high risk of progressive and severe disease remains a challenge. Although patient characteristics, such as chest radiograph findings, lymphopenia, age, gender and viral load, is associated with severe COVID-19 disease, there is no currently available biomarkers that can reliably distinguish patients that need immediate medical attention 2 . Here we lay out recommendations to implement a machine learning algorithm which would facilitate early clinical decision making during outbreaks like COVID-19. Rationale for machine learning: In the case of the COVID-19 outbreak, there have been more than 2.3 million cases in the United States and more than 9 million cases worldwide as of June 23, 2020 3 . Given the number of cases, an analog approach to reviewing cases to identify patterns that indicate poor prognosis is not feasible. A large number of cases has particularly stressed intensive care unit (ICU) settings with increase in need for ICU beds. With this increase in ICU beds, there is an immense need for ventilators and continuous renal replacement machines given high rates of pulmonary and renal failure 2 . A prediction model, which can identify patients more likely to Accepted Article deteriorate and require ICU care at an early stage, will allow physicians to allocate manpower and resources in an expeditious and informed manner. In certain cases, it has been demonstrated that machine learning models might outperform traditional clinical scoring systems or regression methods 4, 5 . Machine learning models which utilize decision tree or neural networks can detect non-linear relationships and interactions between variables, which could explain their better performance. Prediction models can also estimate risk for specific diseases, e.g. distinguishing those who will develop respiratory failure and require ventilators from those who will develop renal failure and require renal replacement therapy, as well as identifying patients at risk of requiring both lifesupporting treatments. The integration of prediction model with the electronic health record (EHR) can give physicians immediate information about the expected patient course and predicted response to treatments. Outcome of interest and applicability: Machine learning models could be trained to learn and detect patterns in a large number of records in a fraction of time. Supervised machine learning is a type of machine learning where the model trains itself using patient traits as input and disease outcome as output. Early clinical, radiological, and laboratory data could be considered as input, while disease severity by a variety of metrics could be the output to train a predictive model for COVID-19. By providing input data from the electronic health records, certain characteristics or lab values that have yet to be associated with disease severity could be found to be strong predictors in specific situations giving clinicians information they had otherwise not had time to investigate in a novel disease such as COVID-19. Multiple examples of machine learning in predicting clinical outcomes currently exist. Using a longitudinal dataset of electronic health records (EHR) from more than 700,000 patients, a machine learning model was able to predict future acute kidney injury 6 . Another similar machine learning model based on hospital data from a Portuguese and American hospital was able to predict the risk of ICU admission 7 . A study from Denmark using a machine learning model was able to predict 90-day mortality for intensive care unit patients using time series data 8 . The key findings of this model were that the predictive performance significantly improved over the timecourse of ICU stay, and input features of patients interacted and compensated for one another to This article is protected by copyright. All rights reserved pull the patient towards survival at one timepoint and towards mortality at another. Static prognostic scoring systems usually fail to adapt to such patient dynamics. In the field of cardiology, machine learning techniques have been employed for in-hospital monitoring, precision medicine, imaging and electrocardiography, and have shown to improve patient outcomes 9 . These examples underscore the capabilities of machine learning. Building a machine learning model for COVID-19 would require early-stage clinical, radiological, and laboratory data from a large cohort (Figure 1 ). The training dataset must also include information about the patient outcomes one is looking to predict, which forms the primary basis of machine learning training. To improve the robustness of the model, we suggest removing patients with unknown outcomes during model development. The training dataset should include patient data, such as demographic data, co-morbidities, outpatient medications, vital signs and laboratory values, which could be obtained from EHR database. In this situation, supervised machine learning algorithms for classification, such as random forest, gradient boost, support vector machines, naïve bayes, k-nearest neighbor and logistical regression, could be Machine learning approaches were implemented on COVID-19 patient data in China 10, 11 . The aim was to predict the severity of disease based on initial presentation data. One of the model was accurately able to predict disease outcome in 90% of the cases 11 . In this model, the most important features used for prediction were lactate dehydrogenase (LDH), lymphocyte and high-sensitivity C- This article is protected by copyright. All rights reserved reactive protein (hs-CRP). Similar implementations of the machine learning approach in larger cohorts from other countries can provide more specific models to understand local factors as predictors of disease. Advantages and challenges: Machine learning models benefit from larger sample sizes, which in most cases, improve the accuracy of models 12 . Also, machine learning algorithms can be used to detect novel biomarkers which have non-linear interactions with each other. An added advantage of this model is that the risk score can be generated in real-time by integrating the model with electronic health records. As vitals or laboratory values change, the model can predict a new score based on the most recent data, allowing real-time monitoring. Another key challenge of clinical data is the missing data in variables. Although there are different imputation methods available for substituting missing data, we recommend not using it for variable where high levels of missingness exists. But for small percentage of missingness, imputation using k-nearest neighbour algorithm could be used, which are more accurate than using mean/median values 13 . With introduction of newer medications, the model performance might be affected. This limitation needs to be assessed and necessary changes in covariates should be updated to ensure good performance of the model. The existing machine learning models are usually single-centred, retrospective studies 10, 11, 14 . Although they identified LDH, hs-CRP and lymphocytes as key factors, this data needs to be evaluated at a multicenter level with a larger dataset, which would also remove any bias that might have been introduced in a single-center study 11 . These models have also not discussed how different co-morbidities alter COVID-19 severity. For example, patients with diabetes and cancer might not have the same risk factors. Splitting patient data into subgroups and generating models would allow us to identify important covariates for specific co-morbidities. Some complex machine learning models such as gradient boost and neural network models are black box models which are difficult to interpret but general perform better than white box models. While selecting the best model, this trade-off between accuracy and interpretability should be considered and based on risk-benefit assessment, appropriate decisions should be taken. It is important to be cautious of the model overfitting the data which can be compensated by increasing the number of patients used in training the model. As the COVID-19 outbreak expands, the accuracy of Accepted Article the model should improve. For the clinician to remember that machine learning provides you with prediction. However, blind reliance on predictive models may lead to automation bias and must be carefully monitored during implementation of predictive models. The overall goal of this approach would be to provide a system to identify high risk individuals at an early stage to help in allocating adequate resources, such as ICU beds, and to provide necessary interventions before irreversible clinical damage occurs. Lower mortality of COVID-19 by early recognition and intervention: experience from Jiangsu Province Intensive care management of coronavirus disease 2019 (COVID-19): challenges and recommendations Coronavirus disease (COVID-2019) situation reports Prediction of cardiac arrest in critically ill patients presenting to the emergency department using a machine learning score incorporating heart rate variability compared with the modified early warning score Multicenter Comparison of Machine Learning Methods and Conventional Regression for Predicting Clinical Deterioration on the Wards A clinically applicable approach to continuous prediction of future acute kidney injury Predicting Intensive Care Unit admission among patients presenting to the emergency department using machine learning and natural language processing Dynamic and explainable machine learning prediction of mortality in patients in the intensive care unit: a retrospective study of high-frequency data in electronic patient records. The Lancet Digital Health State-ofthe-Art Machine Learning Techniques Aiming to Improve Patient Outcomes Pertaining to the Cardiovascular System Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity An interpretable mortality prediction model for COVID-19 patients Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners Comparison of Performance of Data Imputation Methods for Numeric Dataset Predicting COVID-19 Pneumonia Severity on Chest Xray with Deep Learning. arXiv e-prints This article is protected by copyright. All rights reserved