key: cord-0910935-kl2qofui
authors: Gutierrez, J. M.; Volkovs, M.; Poutanen, T.; Watson, T.; Rosella, L.
title: Development of a Multivariable Model for COVID-19 Risk Stratification Based on Gradient Boosting Decision Trees
date: 2020-12-30
journal: nan
DOI: 10.1101/2020.12.23.20248783
sha: 7bb3308161e6297d9e3f821b9235283cec04684c
doc_id: 910935
cord_uid: kl2qofui

Importance Population stratification of the adult population in Ontario, Canada by their risk of COVID-19 complications can support rapid pandemic response, resource allocation, and decision making. Objective To develop and validate a multivariable model to predict risk of hospitalization due to COVID-19 severity from routinely collected health records of the entire adult population of Ontario, Canada. Design, Setting, and Participants This cohort study included 36,323 adult patients (age >= 18 years) from the province of Ontario, Canada, who tested positive for SARS-CoV-2 nucleic acid by polymerase chain reaction between February 2 and October 5, 2020, and followed up through November 5, 2020. Patients living in long-term care facilities were excluded from the analysis. Main Outcomes and Measures Risk of hospitalization within 30 days of COVID-19 diagnosis was estimated via Gradient Boosting Decision Trees, and risk factor importance was examined via Shapley values. Results The study cohort included 36,323 patients with majority female sex (18,895 [52.02%]) and median (IQR) age of 45 (31-58) years. The cohort had a hospitalization rate of 7.11% (2,583 hospitalizations) with median (IQR) time to hospitalization of 1 (0-5) days, and a mortality rate of 2.49% (906 deaths) with median (IQR) time to death of 12 (6-27) days. In contrast to patients who were not hospitalized, those who were hospitalized had a higher median age (64 years vs 43 years, p-value < 0.001), majority male (56.25% vs 47.35%, p-value<0.001), and had a higher median [IQR] number of comorbidities (3 [2-6] vs 1 [0-3], p-value<0.001). Patients were randomly split into development (n=29,058, 80%) and held-out validation (n=7,265, 20%) cohorts. The final Gradient Boosting model was built using the XGBoost algorithm and achieved high discrimination (development cohort: mean area under the receiver operating characteristic curve across the five folds of 0.852; held-out validation cohort: 0.8475) as well as excellent calibration (R2=0.998, slope=1.01, intercept=-0.01). The patients who scored at the top 10% in the validation cohort captured 47.41% of the actual hospitalizations, whereas those scored at the top 30% captured 80.56%. Patients in the held-out validation cohort (n=7,265) with a score of at least 0.5 (n=2,149, 29.58%) had a 20.29% hospitalization rate (positive predictive value 20.29%) compared with 2.2% hospitalization rate for those with a score less than 0.5 (n=5,116, 70.42%; negative predictive value 97.8%). Aside from age, gender and number of comorbidities, the features that most contribute to model predictions were: history of abnormal blood levels of creatinine, neutrophils and leukocytes, geography and chronic kidney disease. Conclusions A risk stratification model has been developed and validated using unique, de-identified, and linked routinely collected health administrative data available in Ontario, Canada. The final XGBoost model showed a high discrimination rate, with the potential utility to stratify patients at risk of serious COVID-19 outcomes. This model demonstrates that routinely collected health system data can be successfully leveraged as a proxy for the potential risk of severe COVID-19 complications. Specifically, past laboratory results and demographic factors provide a strong signal for identifying patients who are susceptible to complications. The model can support population risk stratification that informs patients protection most at risk for severe COVID-19 complications.

As of 5 October 2020, over 35 million cases of confirmed coronavirus disease 2019 and at least 1 million deaths had been reported worldwide [1] . The COVID-19 outbreak has led to an increased demand for healthcare resources and a shortage of medical equipment and staff. Governments and healthcare organizations around the globe are currently working on containing and slowing down the spread of infections while trying to understand the risk factors associated with severe complications of COVID-19. It remains unclear which and how risk factors contribute to COVID-19 severity. Such understanding is crucial to help mitigate the healthcare system's burden by prioritizing testing and resource allocation for those patients at the highest risk. Furthermore, now that vaccines are available [2] , the ability to accurately estimate population risk can guide vaccine rollout strategies and return-to-work prioritization.

A number of diagnostic and prognostic models for COVID- 19 have been developed to support medical decision making [3] . The majority of these models depend on clinical data obtained upon hospital admission (e.g. X-ray images, blood tests) as well as on demographic and medical records (age, comorbidity history) to make a prediction [4] , [5] , [6] , [7] . Since these models can only be applied to patients already hospitalized for COVID-19, it is not possible to extend their use for the general population to identify individuals with the highest potential risk of hospitalization or death from COVID-19. Therefore, risk stratification models that depend only on historical medical records are necessary to fill this gap. Such models are particularly effective in countries with single-payer healthcare systems such as Canada, the United Kingdom and Australia since single-payer systems facilitate access to population-wide medical records. Access to extensive medical records is not limited to single-payer countries. However, databases of commercial insurance claim data are also available for large portions of the population in countries with private healthcare systems such as the United States. Consequently, we believe that with sufficient adaptation, our proposed model has wide applicability for assessing the risk of severe COVID-19 complications in the population using routinely collected data.

The province of Ontario in Canada is one of a number of jurisdictions in the world that has linked medical records on its entire population due to its single-payer health system and robust infrastructure that links all residents through a unique identifier. Analyzing the medical records of Ontario's population is particularly interesting due to its high diversity. Almost three in ten Ontarians identify as members of visible minorities, with most of these individuals living in large metropolitan areas, such as the city of Toronto [8] . Here, we leveraged this extensive and comprehensive data source from COVID-19 positive patients in Ontario to develop a machine learning model to predict the risk of COVID-19 hospitalization.

Our methodology utilizes general medical and demographic attributes commonly collected in claims data in other countries, thus facilitating its repurposing in other jurisdictions.

We obtained health administrative records from a comprehensive data repository held at ICES, a not-for-profit research institute in Ontario. To support COVID-19 research, ICES developed a high-performance infrastructure, the Health AI Data Analytics Platform (HAIDAP), as well as a continually-updated COVID-19 data resource. Using this resource, we identified all patients at least 18 years of age who were enrolled in the Ontario Health Insurance Plan (OHIP), which covers all Ontario residents, and had nasopharyngeal swabs tested for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) between February 2, 2020, and October 5 2020 ( Figure 1 ). Patients were followed up through November 5, 2020 to allow a follow-up period of 30 days.

Individual laboratory data from the Ontario Laboratories Information System (OLIS) to relevant datasets containing healthcare use, demographic, and geographic information using unique encoders and held at ICES. It is important to note that the OLIS data captures most SARS-CoV-2 but maybe missing results from certain private laboratories in the province, which may result in discrepancies between the number of cases in our study and those officially reported [9] . This study was approved by the ICES Privacy and Compliance Office. This study followed the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) reporting guideline [10] .

Patients were defined as positive for COVID-19 if they had one viral RNA positive polymerase chain reaction (PCR) test during the observation period. The index date for all analyses was defined as the date of the first recorded positive test.

We determined the adverse outcome as hospitalization due to COVID-19 (ICD10 code U071) at or within 30 days of the index date, since the median time to event from the index date and interquartile range (IQR) were 1 day and 0-5 days, respectively. We included baseline sociodemographic and clinical characteristics such as age, sex and comorbidity history ( Table 1 ) .

To ensure that only the most recent data prior to COVID-19 diagnosis were included in our model, we included medical records dated no later than 30 days prior to the index date and not earlier than 2 years before the index date ( Figure 2 ). The 2-year window was selected due to the fact that the majority (>92%) of patients in our study cohort had at least two years of recorded clinical history in our database. Other windows were considered (3, 4, and 5 years), but we discarded them for not covering more than 90% of our patients. The 30-day buffer before the index date was applied to ensure that only historical medical records are used to make a prediction for each patient and not tests are done as a result of the COVID-19

infection.

In addition to hospitalizations, we aggregated historical records of doctor visits, outpatient services, drug prescriptions, and laboratory results for each individual. For each type of laboratory result (blood and urine), we calculated absolute deviations from normal ranges and counted the number of abnormally high or abnormally low measurements recorded in the 2-year window prior to the index date. We excluded variables with records for less than 50% of the patients in our cohort. For example, visits to a nephrologist are only recorded for those patients seeking kidney care; thus if the variable "number of visits to a nephrologist in the last 2 years" is recorded for less than 50% of the patients, then this variable would be discarded from our model. The full list of independent predictor variables extracted from the COVID-19 data source, as well as the fraction of patients lacking observations for each variable can be found in the Supplementary Table 1 .

An 80%/20% random split of the dataset (where each example corresponds to one patient) was used to define development and validation sets. The validation dataset was held back and not used for model training or tuning. For the final model, we built a Gradient Boosted Trees model using the XGBoost All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 30, 2020. ; https://doi.org/10.1101/2020.12.23.20248783 doi: medRxiv preprint algorithm [11] . The set of variables included in our XGBoost model were selected using a backward search approach, and hyperparameter tuning (learning rate, maximum tree depth, number of trees, alpha and gamma) was done with the grid-search algorithm to maximize the cross validation area under the receiver operating characteristic curve (AUCROC). XGBoost allows explicit handling of missing values and thus we did not perform data imputation in our model. Finally, discrimination was evaluated by using the AUCROC, in which a value of 0.5 indicates no predictive ability and 1.0 indicates perfect discrimination.

The ICES COVID-19 data source included 58,948 patients who tested positive for COVID-19 between February 2 and November 5, 2020. From these, we excluded patients with index date after October 5, 2020 as well as those patients currently living in a long-term care facility. After exclusions, 36,323 patients were included in our study cohort and followed up for 30 days (Figure 1 ). The hospitalization rate was 7.11% (2,583 hospitalizations) with median (IQR) time to event of 1 (0-5) day(s), and the mortality rate was 2.49% (906 deaths) with median (IQR) time to death of 12 (6-27) days after the index date. The median age of patients in the cohort was 45 years (IQR of 31-58). Table 1 shows the baseline characteristics for all patients in the cohort, and Table 2 shows the same characteristics for the development (29,058; 80%) and validation (7,265; 20%) datasets.

We identified 18 important predictor variables of COVID-19 hospitalization and ranked them by their Shapley contribution score which measures the impact of the variable value on the model output [12] ( Table 3 and Supplementary Figure 1 ). These variables are age, days since the last creatinine blood test, geographical latitude, days since the last basophils blood test, gender, number of doctor visits in the last 2 years, number of comorbidities, number of different unique subclasses of drugs taken in the last 2 years, highest value of creatinine recorded in the last 2 years, number of radiology studies in the last 2 years, average value of neutrophils in blood in the last 2 years, the average value of leukocytes in blood in the last 2 years, number of creatinine blood test in the last 2 years, highest value of hemoglobin recorded in the last 2 years, history of chronic kidney disease, and days since the last mean corpuscular hemoglobin test in the last 2 years. All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 30, 2020. ; https://doi.org/10.1101/2020.12.23.20248783 doi: medRxiv preprint The final XGBoost model achieved high discrimination in the 5-fold cross-validation setting with a mean AUCROC of 0.852, and an AUCROC of 0.8475 in the held-out validation cohort. Figure 3A shows the ROC curve of the final model. The model also shows excellent calibration with R 2 =0.998, slope=1.01, and intercept=-0.01 ( Figure 3B ). Patients in the validation cohort with a score of at least 0.5 (n=2149, 29.58%) had a 20.29% hospitalization rate compared with 2.2% hospitalization rate for those patients with a score less than 0.5 (n=5116, 70.42%). Furthermore, patients in the validation cohort scored at the top 10% represent 47.41% of actual hospitalizations, and those scored at the top 30% capture 80.56% of hospitalizations ( Figure 4 , top bar). Interestingly, geographical latitude and laboratory test history were ranked among the top 10 predictors in our model (Supplementary Figure 1 ). Of note, blood biomarkers such as basophils, creatinine, and leukocytes were identified as important predictors, despite the fact that these are historical values and not measurements taken at admission [13] .

After vaccines with sufficient efficacy were announced in October and November of 2020 [14] , governments of virtually all affected countries started to actively develop vaccine rollout and prioritization schedules [15, 16] . The most prevalent approach to assessing vaccine risk is to start with the oldest patients (especially those in the long term care facilities) that account for the majority of reported deaths, and healthcare workers that have a high risk of exposure [17] . However, after these two groups, no agreement has been reached on how to proceed with the rest of the population. Two factors that commonly influence expert recommendations are age and pre-existing conditions (comorbidities) [18] .

Our model provides an alternative approach to leverage machine learning to predict a risk score for every patient. These risk scores can then be used to rank patients and prioritize vaccination. To compare commonly recommended risk factors, we constructed two empirical rules and applied them to the held-out validation cohort to see how many actual hospitalizations we could capture. These two empirical rules are: (1) rank patients by age and select the oldest patients; (2) rank patients by the number of comorbidities and select patients with the most comorbidities.

We split the held-out validation cohort (n=7,265) into percentiles and calculated the recall at the top 10th (n=726), 20th (n=1,453), and 30th (n=2,180) percentiles after ranking the cohort according to our model or each empirical rule. The results of this comparison are shown in Figure 4 . The final XGBoost model outperformed these rules across the top three percentiles with relative gains between 10% and 30%. These results indicate that our model can more accurately identify people at high risk of severe complications from the COVID-19 infection.

We have developed and validated a Gradient Boosted Trees model for predicting the risk of COVID-19 hospitalization in a cohort of patients in Ontario. Our model showed excellent calibration and a high discrimination performance consistent across 5-fold cross-validation cohorts, which was comparably superior to two empirical rules. We envision our model to be deployed and utilized by public health organizations, as opposed to by clinicians directly, to effectively plan resource allocation and vaccination campaigns. Interestingly, past laboratory test results contributed to model predictions and suggests that legacy blood tests can be leveraged as a proxy for future risk of COVID-19 hospitalization. We identified past neutrophil counts in blood as a strong contributor to our model predictions. These findings are consistent with recent studies documenting the role of excessive neutrophil counts in severe COVID-19 pneumonia [19, 20] .

The availability of variables included in the final model is not limited to Ontario's region, as these are variables readily available in most medical record and insurance claim databases around the world. Thus, our methodology could be extended for scoring populations and informing decision making in other jurisdictions outside of Ontario, Canada.

Many recently developed prognostic models for COVID-19 rely on information that must be collected post-infection or at admission into a hospital [21, 22] . A key strength of our model is that it depends only on historical medical records and demographic variables available before infection. These are variables that are routinely collected and readily available in both public and private medical claims databases used across many countries. Furthermore, Ontario has a diverse population that covers a range of population groups and thus will likely have applicability outside of Ontario or could be easily adapted to score other populations. Although future work with an external dataset would be required to validate the model performance in other geographies, we have observed that models developed on these data can usually be repurposed to other jurisdictions [23] [24] [25] . An important strength of our study is the use of Gradient All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 30, 2020. ;

Boosted trees, which allow for highly interpretable models to yield novel insights and relationships among predictor variables.

Our study has limitations. Although we have a diverse data source that captures all healthcare interactions, we are limited to some data elements that are not collected in routine data holdings. For example risk factors such as diet and physical activity associated with disease immunity [26] and not including the ability to perform risk stratification at a population-wide level, for being based on an accurate and explainable algorithm, and finally for demonstrating the potential use of legacy laboratory data as a proxy for potential risk of severe COVID-19 complications. These risk stratification models are currently not in practice in our setting to support health system decision-making for COVID-19. We envision our model to provide a more effective way to use routinely collected data to support strategies that protect patients most at risk for serious COVID-19 complications and more careful and precise management for those at low risk.

We thank Elisa Candido and Keng-Yuan Liu for providing technical support and giving relevant advice on handling the data. The icons used in Figure 2 were made by Freepik, Monkik, and Vitaly Gorbachev from www.flaticon.com .

LR is supported by a Tier 2 Canada Research Chair in Population Health Analytics. This study was supported by ICES, which is funded by an annual grant from the Ontario Ministry of Health and Long-Term Care (MOHLTC). The study sponsors did not participate in the design and conduct of the All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 30, 2020. ; https://doi.org/10.1101/2020.12.23.20248783 doi: medRxiv preprint study; collection; management, analysis and interpretation of the data; preparation, review or approval of the manuscript; or the decision to submit the manuscript for publication.

ICES has obtained ethical approval (and repeats this review tri-annually) for its privacy and security policies, procedures, and practices. Each research project that is conducted at ICES is also subject to internal ethical review by the ICES Privacy and Compliance Office. Please find attached to this submission a letter with more information regarding the ethical review and approval process for this research; please do not hesitate to contact me with any questions. ICES is a prescribed entity under section 45 of Ontario's Personal Health Information Protection Act (PHIPA). Section 45 is the provision that enables analysis and compilation of statistical information related to the management, evaluation, and monitoring of, allocation of resources to, and planning for the health system. Section 45 authorizes health information custodians to disclose personal health information to a prescribed entity, like ICES, without consent for such purposes. Projects conducted wholly under section 45, by definition, do not require review by a Research Ethics Board. As a prescribed entity, ICES must submit to trio-annual review and approval of its privacy and security policies, procedures and practices by Ontario's Information and Privacy Commissioner. These include policies, practices and procedures that require internal review and approval of every project by ICES' Privacy and Compliance Office. Figure 1 -Study design . The ICES COVID-19 cohort was last updated on November 7th 2020 and it includes patients with index (diagnosis) dates between February 2 2020 and November 5th 2020. Patients with an index date after October 5, 2020 or currently living in a long-term care facility were excluded. Included patients were followed up for 30 days for the outcome of hospitalization due to COVID-19. All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 30, 2020. ; preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 30, 2020. ; preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 30, 2020. ; https://doi.org/10.1101/2020.12.23.20248783 doi: medRxiv preprint

COVID-19 Map -Johns Hopkins Coronavirus Resource Center

Pfizer-BioNTech COVID-19 vaccine: What you should know

Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal

Early triage of critically ill COVID-19 patients using deep learning

Living risk prediction algorithm (QCOVID) for risk of hospital admission and mortality from coronavirus 19 in adults: national derivation and validation cohort study

An interpretable mortality prediction model for COVID-19 patients

Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: development and validation of the 4C Mortality Score

Government of Ontario, Ministry of Finance

COVID-19 (coronavirus) in Ontario

XGBoost: A Scalable Tree Boosting System

From local explanations to global understanding with explainable AI for trees

Risk Factors for Hospitalization, Mechanical Ventilation, or Death Among 10 131 US Veterans With SARS-CoV-2 Infection

Who gets a COVID vaccine first? Access plans are taking shape

How COVID vaccines are being divvied up around the world

Immunological considerations for COVID-19 vaccine strategies

A Vaccine Is on Its Way to Canada. Who Will Get It First? The New York Times

Excessive Neutrophils and Neutrophil Extracellular Traps in COVID-19

The emerging role of neutrophil extracellular traps in severe acute respiratory syndrome coronavirus 2 (COVID-19)

Development and external validation of a prognostic multivariable model on admission for hospitalized patients with COVID-19

ACP risk grade: a simple mortality index for patients with confirmed or suspected severe acute respiratory syndrome coronavirus 2 disease (COVID-19) during the early stage of outbreak in Wuhan

for the PHIAT-DM team. A population-based risk algorithm for the development of diabetes: development and validation of the Diabetes Population Risk Tool (DPoRT)

Development and Validation of the Chronic Disease Population Risk Tool (CDPoRT) to Predict Incidence of Adult Chronic Disease

Measuring Burden of Unhealthy Behaviours Using a Multivariable Predictive Approach: Life Expectancy Lost in Canada Attributable to Smoking, Alcohol, Physical Inactivity, and Diet

Lifestyle factors in the prevention of COVID-19