key: cord-0587897-9ihm3mgz authors: Li, Yifan; Yoon, Garrett; Nasir-Moin, Mustafa; Rosenberg, David; Neifert, Sean; Kondziolka, Douglas; Oermann, Eric Karl title: Identifying and mitigating bias in algorithms used to manage patients in a pandemic date: 2021-10-30 journal: nan DOI: nan sha: 63fefba381f003fb28740905efbee2b398b96e07 doc_id: 587897 cord_uid: 9ihm3mgz Introduction Numerous COVID-19 clinical decision support systems have been developed. However many of these systems do not have the merit for validity due to methodological shortcomings including algorithmic bias. Methods Logistic regression models were created to predict COVID-19 mortality, ventilator status and inpatient status using a real-world dataset consisting of four hospitals in New York City and analyzed for biases against race, gender and age. Simple thresholding adjustments were applied in the training process to establish more equitable models. Results Compared to the naively trained models, the calibrated models showed a 57% decrease in the number of biased trials, while predictive performance, measured by area under the receiver/operating curve (AUC), remained unchanged. After calibration, the average sensitivity of the predictive models increased from 0.527 to 0.955. Conclusion We demonstrate that naively training and deploying machine learning models on real world data for predictive analytics of COVID-19 has a high risk of bias. Simple implemented adjustments or calibrations during model training can lead to substantial and sustained gains in fairness on subsequent deployment. Biased algorithms risk worsening disparities and decreasing access in healthcare as modern healthcare relies on algorithms to guide diagnostic and treatment decision making. [1] [2] [3] An ongoing "living review" found that only two out of 232 published models are sufficiently reported without bias to merit further validation 4, 5 , while a recent review of COVID-19 studies found that none of the 2,212 surveyed are of clinical use due to methodological shortcomings. 4, 5 This study is the first to assess the risk of naively obtaining a predictive model that is biased by race, age, or sex trained to predict the outcomes of patients suffering from COVID-19 using a real-world dataset consisting of four hospitals in New York City. We subsequently demonstrate how simple adjustments to the training process to account for bias can lead to more equitable models. We focused on point of care screening for COVID-19 and used overall classification performance demonstrated by area under the receiver-operating curve (AUROC) and sensitivity as key performance metrics. We included all patients who were screened for COVID-19 by nasal swab in hospitals affiliated with the NYU Langone Health System between January 1st 2020 to December 31st 2020. We constructed logistic regression models to predict three clinical cases: requirement for admission, requirement of a ventilator, and mortality. A model was considered biased if it failed to have equal sensitivities for each protected feature (see Supplemental Methods for details). 6 Protected feature tested for this study included race, sex, and age over 62 (senior). We trained models naively and with a bias minimization constraint for equal sensitivity (calibrated models). We conducted post-hoc analyses to compare the sensitivity between subgroups in a protected feature using two sample z-tests on a one hundred held-out test sets to simulate how the models would perform in a real-world setting. All analyses used an alpha of 0.05 with Bonferroni adjustment. For the 21,768 patients who screened positive for COVID-19, the median age was between 48 to 52, 48% were female, 32% were over the age of 62, and 56% were a racial minority (non-white). Different subgroups within each sensitive feature have unequal label representation (Supplemental table 1 ). Naively trained predictive models were found to frequently have a non-zero risk of bias by sex, race, or age (Supplemental table2) as six out of nine analyses showed bias for that particular feature tested. This effect was present for all protected features ( Table 1 ). The naively trained models achieved an average AUC of 0.943, an average sensitivity of 0.527, and an average difference in sensitivity of 0.063 between subgroups. After re-calibrating the models to mitigate bias, the average AUC remained unchanged at 0.943, average sensitivity increased to 0.955, and the average difference in recall between groups of protected class decreased by 75% to 0.016. The recalibration of the models to eliminate bias led to an expected increase in false positives with the magnitude depending upon the underlying distribution of risk (Supplemental Figure 1) . After simulating real-world deployment on 100 bootstrapped samples, re-calibration decreased the probability of obtaining a biased algorithm across six out of nine cases and decreased the overall risk of biased predictions by 56% (Table 1 ). We demonstrate that naively training and deploying models on real world data for predictive analytics of COVID-19 has a high risk of being biased by sex, age, or race. While prior studies have suggested a risk of bias 4 , and demonstrated it in select cases 1,2 , this is the first study to systematically assess the risk of biased model training on real-world data for COVID-19 and provide validation of these concerns. We demonstrate how simple awareness of the problem and easily implemented post-hoc or train time solutions can lead to substantial and sustained gains in fairness on subsequent deploymemnt. 6 There are multiple causes of biased medical algorithms: datasets, modelling decisions, and choice of deployment environment, and addressing these first begins with awareness of the underlying issue. Study limitations include incorporating data only within a single hospital system in New York City and using an easily implemented method compared to more sophisticated pre-existing solutions to reduce bias. The first and most critical safeguard against bias in medical AI models is awareness on the part of physicians and developers, and having a plan to actively screen for bias before deployment. In our increasingly algorithmically driven medical systems, failing to recognize and account for bias has the risk of worsening inequalities. YL and EKO designed the study. YL and GY conducted the experiments and analysis. YL and EKO drafted the manuscript. All authors provided comments used to finalize the manuscript. All authors approved the final manuscript. Table 1 : Bootstrap testing of bias correction over trial datasets We studied all patients who were screened for COVID-19 by nasal swab in hospitals affiliated with the NYU Langone Health System between January 1st 2020 to December 31st 2020. We constructed 100 logistic regression models on all COVID-19 positive patients to predict three clinical cases: requirement for admission, requirement of a ventilator, and mortality. A model was considered biased if it failed to satisfy an equal opportunity criteria defined as an equal likelihood of patients from each group in the protected feature being predicted as case positive. The protected features tested for this study included race, gender, and age over 62. For each model, the dataset was split into training (60%), validation (20%) and test (20%). We trained naively and under a bias mitigation constraint. To mitigate bias during training we employed two separate methods. First, we trained models naively, and then after training adjusted the model's predictive threshold to obtain equal sensitivities for each subgroup, while ensuring the model had a greater than 85% sensitivity. Second, we trained penalized logistic regression models using stochastic gradient descent and a penalty for having different sensitivities for each subgroup. After training, we conducted a post-hoc analysis to compare the model sensitivity between members of each protected feature (e.g., male vs female patients) using two sample z-tests for proportions on the validation set. Then we calibrated the thresholds per subgroup to equalize sensitivities and applied those new subgroup thresholds on the testing set. Lastly, we compared models trained to mitigate bias (post-calibration) versus naively trained models (pre-calibration), on 100 newly-created bootstrapped datasets from the original data to simulate how the models would perform in the real-world pre-and post-bias mitigation. All analyses used an alpha of 0.05 with Bonferroni adjustment for multiple comparisons. We chose to focus on screening models that Dissecting racial bias in an algorithm used to manage the health of populations An algorithmic approach to reducing unexplained pain disparities in underserved populations Key challenges for delivering clinical impact with artificial intelligence Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans Equality of opportunity in supervised learning Dr Oermann reported consulting for Google, equity for Artisight Inc, and employment for Merck.No other disclosures were reported.No funding was provided for the research Figure 1 : a. Demonstration of underlying distribution of ventilator requirements based on age for all patients, age less than 62 (non-senior) , and age greater than 62 (senior) demonstrating significantly higher ventilators in the greater than senior group with an original threshold (dotted blue line) chosen to capture this quantity. The bias mitigated threshold (red line) is left shifted to ensure equal recall in the age less than the senior group b. Predicted positive versus negative in pre-and post-mitigation age groups showing increase in false positives for age greater than the senior c. Calibration largely increased both true positives and false positives and decreased the false negatives. The calibration penalizes accuracy.