key: cord-1052852-r4nzqbgv
authors: Adeoye, Elijah A.; Rozenfeld, Yelena; Beam, Jennifer; Boudreau, Karen; Cox, Emily J.; Scanlan, James M.
title: Who was at risk for COVID-19 late in the US pandemic? Insights from a population health machine learning model
date: 2022-05-11
journal: Med Biol Eng Comput
DOI: 10.1007/s11517-022-02549-5
sha: e43c414b75023bfa431f135ef1c33421b2c47b07
doc_id: 1052852
cord_uid: r4nzqbgv

Notable discrepancies in vulnerability to COVID-19 infection have been identified between specific population groups and regions in the USA. The purpose of this study was to estimate the likelihood of COVID-19 infection using a machine-learning algorithm that can be updated continuously based on health care data. Patient records were extracted for all COVID-19 nasal swab PCR tests performed within the Providence St. Joseph Health system from February to October of 2020. A total of 316,599 participants were included in this study, and approximately 7.7% (n = 24,358) tested positive for COVID-19. A gradient boosting model, LightGBM (LGBM), predicted risk of initial infection with an area under the receiver operating characteristic curve of 0.819. Factors that predicted infection were cough, fever, being a member of the Hispanic or Latino community, being Spanish speaking, having a history of diabetes or dementia, and living in a neighborhood with housing insecurity. A model trained on sociodemographic, environmental, and medical history data performed well in predicting risk of a positive COVID-19 test. This model could be used to tailor education, public health policy, and resources for communities that are at the greatest risk of infection. GRAPHICAL ABSTRACT: [Image: see text] SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s11517-022-02549-5.

Early in the coronavirus disease 2019 (COVID-19) pandemic, a popular interest in predicting risk of infection gave rise to mobile applications and tools for predicting exposure risk. These tools used factors such as medical history, mask compliance, location, demographics, and social activity to predict likelihood of infection or mortality [1] . As the pandemic progressed, systematic reviews elucidated additional individual-and populationlevel characteristics associated with disease progression and mortality. At-risk groups identified by our group and others included people who were older, had laboratory markers of kidney or liver dysfunction, were current smokers, had pre-existing cardiovascular disease, or were Asian, Black, Hispanic or Latino, and non-English-speaking [2] [3] [4] . These early efforts to categorize at-risk populations were instructive and shaped the initial clinical and population-level responses to the pandemic. However, they generally relied on traditional statistical techniques and limited amounts of data available at the time.

In parallel with simpler prediction tools, artificial intelligence (AI) has been used since the early days of the pandemic to classify and predict risk. For example, a recent review of 130 publications found 71 papers related to computational epidemiology of COVID-19, 40 papers related to early detection and diagnosis of COVID-19, and 19 papers related to COVID-19 disease progression [5] . Common techniques used by these studies were deep learning and transfer learning [5] . Elsewhere, an analysis of 264 papers found that the convolutional neural network method was the most frequently applied AI technique in COVID-19 studies, followed by random forest classifier, ResNet, Support Vector Machine, and deep learning [6] . These studies described the rapid expansion in machine learning and AI tools during the COVID-19 pandemic.

In 2021, mass vaccinations altered risk of COVID-19 infection for much of the US population, but did not eliminate the need for risk prediction. Emergence of vaccine-eluding variants, barriers to accessing vaccines, and widespread vaccine refusal have made it important to continuously re-evaluate risk on an ongoing basis, particularly because disparities in vaccine acceptance may overlap with disparities in infection and/or severe outcomes. For example, older individuals (who were the first to be offered vaccines) are more likely to accept COVID-19 vaccinations than younger individuals, and acceptance rates are highest among Asian and Alaska Native/American Indian populations, and lowest among Black people [7, 8] .

In 2020, we used logistic regression to examine risk factors associated with COVID-19 infection in 34,503 cases from the Providence health system [2] . As the pandemic evolved, we recognized the need for updated risk assessments and the utility of AI in risk assessment across our growing numbers of cases. Thus, the present paper updates our previous risk predictions [2] using a more sophisticated machine learning technique in a larger sample of patient data. Our findings confirm the need for ongoing risk assessment and focusing public resources on the highest-risk communities.

The Providence Institutional Review Board (IRB) approved this study and waived the requirement for written informed consent (IRB identifier STUDY2020000220). The study was conducted in compliance with IRB rules and the Declaration of Helsinki.

Data for the development and validation data sets were collected from the electronic medical record (EMR) of Providence St. Joseph Health. Records were included for all people from Alaska, Washington, Oregon, Montana, and California who had at least one COVID-19 PCR test result on a nasal swab sample between February 21, 2020, and October 20, 2020. People with at least one positive test were coded as a positive for infection; people with exclusively negative tests were coded as negative for infection. Location outcomes were evaluated by linking EMR geocoded data to data from the US Census Bureau's 2018 American Community Survey at the census block group or tract level as previously described [2] .

Two rounds of data splitting were employed. In initial tests, data were split into training and test sets with a 75/25 ratio, respectively, and a random seed for reproducibility ( Fig. 1 ). After we determined that a light gradient boosted model (LGBM) produced the most accurate results, we performed additional modeling with a train, test, and validation split (80/10/10 ratio, respectively). This was done (1) to increase the size of the training set and (2) to avoid overfitting by exploring its performance in both a test and a validation set. Two sets of training data were also generated: with clinical symptoms (fever, cough, myalgia, sore throat, chills, and shortness of breath) and without (Fig. 1 ).

All major statistical analyses were performed using Python versions 3.6.12 on a 64-bit computer and 3.6.10 leveraging a GPU instance in the Azure Machine Learning ecosystem.

Continuous variables were standardized or log normalized to address skew and the influence of large values and outliers on the predictive power of trained models. Count of mental health diagnoses, comorbidities, community size, polypharmacy, and population density each had a skew of 2.58, 1.56, 28.85, 1.04, and − 0.44, respectively. Scaling did not impact the skew for any of these variables. However, log transforming community size reduced its skewness to 4.39. Categorical variables were encoded, and dummy variables were created for those variables with more than two classes. Variables were treated mostly as missing not at random (MNAR) except body mass index (BMI) and gender. Missing data for MNAR variables were coded as a separate category, e.g. "Unknown." For BMI, median imputation was used to fill in the large amount of missing data (n = 25,646 from initial participant pool, approximately 8%). Gender was analyzed as legal sex, and missing values were dropped (n = 119; 0.04% of initial participant pool).

We used a randomized search approach, with cross validation, to tune and identify critical hyperparameters for each model (Supplementary Material Table 1 ). A set of hyperparameters that produced the best area under the curve (AUC) on the training set were selected as part of the final ensemble. This was performed with a repeated, stratified k-fold cross validation with 10 splits and 3 repeats. A random seed was set for reproducibility of the cross-validation step. We chose a randomized approach due to the computationally intensive nature of the alternative, more comprehensive grid search approach. We report the best hyperparameters selected for the best model with symptoms (Supplementary  Material Table 1 ).

Most COVID-19 test results were negative. Thus, different data augmentation techniques were applied to address class imbalance by over-sampling and/or down-sampling the minority and majority class, respectively. This was done to address model bias towards the negative class (i.e., the population of persons who tested negative for COVID-19), which is important to prevent the model from learning to predict the dominant negative class. We used a synthetic minority oversampling technique (SMOTE) and case-control approach to augment the training data as part of multiple modeling experiments. SMOTE is used to create synthetic data that is close, or nearest neighbor, to the minority class in the feature space [9] . We also experimented with a case-control (CC) approach typically used in epidemiological studies to create a 1:1 match by down-sampling the majority class (COVID-19 negative) to the size of the minority class. Negative classes were selected using a simple random sample method without replacement. This strategy, unlike SMOTE techniques, uses real, non-synthetic data for model training. These approaches helped to create a 1:1 match of the negative (majority) class and the positive class. No augmentation was performed on the validation/test data set.

Twelve experiments were conducted such that at each experiment, models were fitted on the training set depending on whether data augmentation and dimensionality reduction techniques were applied to that set ( Fig. 1 ). For dimensionality reduction, we applied principal component analysis (PCA) to compute the minimal set of principal components that explained 95% of the variance in the data. Recursive feature elimination (RFE) approach was also used, as part of different experiments, to select the minimal set of predictors that were most predictive for a COVID-19 positive test. Dimensionality reduction techniques were also applied on the test/validation sets; however, no augmentation was applied to the validation/test data set. PCA was not applied to comparative logistic regression models. 

LGBM* LR

LGBM with symptoms

LGBM 

An ensemble approach was used as the predictive model for each possible experiment. Four models -logistic regression, random forest, and two gradient boosting libraries, XGBoost (XGB) and LightGBM (LGBM)were used as classifiers for training. We selected the best hyperparameters for each classifier, after hyperparameter tuning, and included these as part of the ensemble for the prediction task. We used a soft-voting ensemble due to the need to compute probabilities of a positive test or event.

Two

LGBMs were generated, one with symptoms and one without symptoms (Fig. 1) . We used the Python implementation of SHAP (SHapley Additive exPlanations) [10] to examine the key predictor variables that contribute to a patient's probability of a positive COVID-19 test result. The library computes Shapley values, which aim to demonstrate the marginal contribution of a feature to the predicted outcome of a vector or an instance [11] . This approach examines how much each feature in the model pushes the predicted value of that instance from a baseline, or average, prediction (expected value). Using the SHAP methodology provides a method for improving the interpretability of a machine learning model. SHAP values were computed using the final selected model. 

In general, models trained with CC augmented data performed better on test/validation sets than SMOTE augmented data. Area under the receiver operating characteristic curve (AUC) scores for models that included symptoms and were trained on augmented data ranged approximately from 0.756 to 0.816, while the logistic regression model trained on non-augmented data yielded an AUC of 0.767. The gradient boosting library, Light-GBM (LGBM), produced an AUC of 0.816. Because this model is computationally lightweight compared to ensembling all models, separate analyses were performed with this model on CC augmented training data split into training/testing/validation sets (80/10/10 ratio, respectively).

LGBM AUC on the training set with repeated, stratified k-fold cross validation with 10 splits and 3 repeats gave a mean AUC of 0.811 ± 0.007. AUC was approximately 0.819 on the test set and 0.814 on the validation set. When symptoms (fever, cough, myalgia, sore throat, chills, and shortness of breath) were not included as predictive variables, AUC on the training set with the same cross validation approach was acceptable, but comparatively poorer (0.735 ± 0.007). AUC on the test and validation sets was 0.734 and 0.727, respectively ( Table 2 ).

When symptoms were included as predictors of infection risk, cough and fever were the two most important predictors ( Fig. 2A) . Being a member of the Hispanic or Latino community, living in the Washington-Montana or Southern California regions, being non-English-speaking and especially Spanish-speaking, polypharmacy, and having shortness of breath were all comparable influences on the risk of a positive COVID-19 test (SHAP scores 0.10-0.30). All of these features except polypharmacy were also directly associated with risk of infection from COVID-19, while polypharmacy, co-morbidity, higher income, and tobacco or alcohol use were inversely associated with risk of infection (Fig. 2B ).

Because symptom information may not always be available for risk assessments of the population at large, a second model was developed to assess the importance of static population factors. When symptoms were removed from the predictive model, being of Hispanic/ Latino ethnicity became the most important predictor of COVID-19 infection (Fig. 3A) in this patient population. Other risk factors with at least two-fold lower SHAP scores included speaking Spanish, being from Montana or a region with housing instability, identifying with an "other" race category, using tobacco, being male, being Christian, and having an "other" BMI. Tobacco use, co-morbidity, polypharmacy, an "other" BMI category, income level, and illicit drug use were inversely associated with risk of infection, while other features were positively associated with this risk (Fig. 3B ). 

Although COVID-19 vaccines are now widely available, predicting the risk of COVID-19 infection remains critical. Unvaccinated populations and new variants of COVID-19 present an ongoing threat to disease control worldwide, and risk prediction is still needed to (1) to assist clinicians and care managers in patient education, (2) guide policy, and (3) allocate resources to the highest risk areas and populations.

Our findings indicate that, as expected, fever and cough were the strongest predictors of infection. This validates public guidance to quarantine based on symptoms alone. However, when we removed symptoms from the model to assess static (i.e., not symptom-based) features alone, the following groups in the western USA emerged with the highest risk for infection: Hispanic and Latino people, individuals in the "other" race category, non-English-speaking people (particularly Spanish-speaking people), people living in areas with housing insecurity, and people from the Washington-Montana region. Compared to previous similar projects, advantages of the current analysis are the size and geographical spread of the dataset, and the machine learning technique which allows the results to be updated in nearly real-time. We intend to update these results as the pandemic continues. Immediate recommendations based on the results of this project are as follows. Culturally literate and languageappropriate resources are needed to combat surging infection rates in Hispanic, Latino, and non-English-speaking populations in the western USA. Partnering with communities to assure broad availability of information and access Legend: Characteristics of the patient population included in this analysis a % of total is the percentage of the total N (316,599) b In-group % is the percentage of the total tested people for each row to services is critical to reducing disproportionate burden, and such partnership may increase trust in the information that is provided. Clinicians should be aware that individuals from these populations may be at higher risk and should conduct assessments and provide education accordingly. For example, clinicians may ask their patients whether they have access to masks and cleaning/disinfection supplies, or whether they need assistance accessing vaccine appointment registration systems. Individuals who are not at high risk themselves but have frequent contact with high-risk groups may require more frequent or intense training on infection control precautions. Finally, public efforts to combat the spread of COVID-19 must address issues such as access, physical proximity of vaccine clinics to high-risk infection risk. B The top 20 COVID-19 demographic predictors, without symptoms, are shown here in descending order. All other computational and graphic elements (use of dots, color coding, variable score association strength shown by horizontal axis) are identical with those used for Fig. 2a and b populations, and pro-active program development for non-English speaking groups. We have previously published modeling work on this topic [2] . The previous model employed a logistic regression (LR) model and achieved an acceptable AUC of 0.78 on the validation set. It is important to note that features selected as strong predictors can be different across different machine and statistical learning approaches. This can be due to factors such as, but not limited to, penalization or regularization methods to reduce overfitting of the model. Other factors include how the model, such as decision tree-based models like LGBM, estimate information gained from all possible splits (using predictor values), different hyperparameters (e.g., tree sizes, number of subsamples, learning rate), etc.

Nevertheless, we computed a comparative logistic regression model and report the output of the model (see Supplementary Table 2 ). Variables with a P < 0.25 were considered for the final model consistent with the previous model [2] . This model was trained on 75% of data and validated on the remaining 25%. AUC on validation data was 0.80 slightly outperforming the previous logistic regression model (AUC = 0.78). Results from the LR and LGBM models are consistent with the previous model with respect to symptoms (cough, fever, shortness of breath, and myalgia), Hispanic or Latino racial/ethnic group, non-English language (specifically Spanish), having housing insecurity, age 18 to 29, and Washington-Montana and Southern California regions being more predictive, or "associated," with a positive COVID-19 test result. Likewise, having a history of tobacco use, higher number of prescription drugs and chronic conditions were more associated with a negative COVID-19 test -also consistent with the previous model (see Supplementary Table 2) .

The new LGBM model was notably different from the previous LR model regarding age. There was a relatively small impact of being between ages 18 to 29 on the prediction of a positive test. The comparative new LR model is consistent with the previous model in that adults, 40 and older, have greater adjusted odds of contracting COVID-19 when compared to younger patients (reference group: ages 17 or younger in this LR model vs. 18 to 29 in the previous model). We also observed differences in the impact of existing comorbidities (e.g., diagnoses of diabetes, HIV/AIDS, dementia, and kidney disease) across models. The LGBM and the comparative new LR models do indicate some impact of an existing diagnosis of diabetes and dementia on the increased probability of a COVID-19 infection consistent with the previous model. Also consistent with the previous model is that the LR model shows some impact of having a history of kidney disease (OR 1.70; 95% CI 1.07-2.72, p = 0.026) on COVID-19 risk. Neither model, unlike the previous model, indicates that being immunocompromised (HIV/AIDS diagnosis) increases an individual's risk of an initial infection. Notwithstanding, we suspect that comorbidities will be significant predictors of severe illness or mortality after a COVID-19 infection.

These results differ from our previous results from the early period of the pandemic [2] . The present results did not confirm that older, immunocompromised, or Black people were at significantly greater risk of COVID-19 infection in this study population. This difference may reflect the change in technique from traditional logistic regression to a machine learning algorithm. The previous LR model was conducted on data available early in the pandemic between February 28, 2020, and April 27, 2020, with data ten times less than current data. This more sophisticated technique may have elucidated underlying factors that were not immediately apparent with logistic regression, because it focused on predictive performance rather than traditional inference about individual variables and strict cut-off thresholds based on statistical significance. It is also possible that these groups are genuinely at higher risk but became underrepresented and under-counted in the larger dataset, and thus, their risk levels may have been underestimated.

An additional explanation for the shifting results is the expansion of the window of time over which results were counted. The previous work examined data from February to April of 2020 [2] , while the present work extended the data to October of 2020, encompassing the second and early third "waves" of cases occurring between mid-June and October. During this later period, state and local public health departments instituted substantially more stringent transmission-reduction strategies including tight restrictions on public gatherings, remote school and work, universal masking requirements in public spaces, and "stay-at-home" policies. Thus, we may have captured real changes in population risk as the pandemic progressed. This may underlie the finding that young people between 18 and 29 were at higher risk, while older people were no longer at higher risk. As the pandemic progressed, older individuals may have been more compliant with stringent quarantine and isolation precautions due to well-publicized fears of mortality, while younger individuals were perhaps less cautious, and thus continued to become infected. We suspect that differences in results from current data reflect varied shifts in phased stay-at-home policies across the regions. Providence serves over time. Comparing results from both models is, nevertheless, encouraging as the new model demonstrates a stable and excellent ability to discriminate using new data as the previous model.

In the present study, we developed two predictive models that either included or excluded symptoms for different purposes. Modeling risk of infection without symptoms was done to evaluate static risk for populations in the western USA. The intention of this step was to aid in planning for disease control and prevention within the Providence St. Joseph Health system. In response to this model, Providence St. Joseph Health tailored the selection of sites for COVID testing and vaccination as well as engagement with community organizations. We recommend that other large health systems implement models of this kind to understand underlying risk factors in their patient populations and target infection control responses accordingly.

There are a number of strengths to this study. We used advanced analytic procedures and tested a variety of models seeking the optimal solution. We have a very large data set (319,599 participants) collected across a single hospital system. Our very large data set gave us the statistical power to examine many possible influences on risk of infection simultaneously. The use of a single hospital system ensures that data collection, variable coding, and data extraction was done in a consistent manner, in contrast to meta-analyses and reviews which are forced to merge data sets which can have real methodological differences. Our list of examined variables is long and comprehensive, including age, gender, education, employment, race, ethnicity, religious affiliation, relationship status, language, BMI, chronic illness conditions, drug use, COVID-19 symptoms, geographic region, and living environment. Ours may be the only paper to date which has examined all of these variables, in a single hospital system, with > 300,000 participants.

There are several limitations to this study. First, models were trained based on data that would be available to an outpatient clinician (patient medical history, sociodemographic, self-reportable symptoms, and environmental data). While this was intentional in order to make the model generalizable to various clinical settings, laboratory values such as white blood cell counts (lymphocyte, eosinophil, basophil, and neutrophil values) [12] may have improved performance of the model that included symptoms. Second, the data collection period (February-October 2020) spanned a period of rapidly evolving public health guidelines. This may have influenced some of the findings. For example, the finding that older age was not predictive of a higher risk of COVID-19 infection may reflect greater caution and compliance with stay-athome orders among older populations. Third, the study did not include the largest part of the third wave, from October 2020 to March 2021; consequently, we intend to update these findings using the same machine learning method as the pandemic continues to progress. Fourth, we suggest that that the population-level characteristics spotlighted by this model (e.g., race, ethnicity, language) are not inherent predictors of risk, but rather are proxy indicators for living conditions (housing density and ability to socially isolate) and social structures, such as systemic racism in healthcare and public policy.

Our results confirm that the following social and demographic factors increased the risk of COVID-19 infection between February and October of 2020: being Hispanic and Latino, being non-English-speaking (and especially Spanish speaking), residing in an area that had housing insecurity, or being from the region of Washington and Montana. These findings confirm that social determinants of health were major drivers of infection risk in the late part of the pre-vaccine US COVID-19 pandemic. Language-appropriate and community-based education is needed to mitigate the effects of social factors on infection risk. Additionally, providers should focus education efforts on patients who fall into high-risk categories or are frequently in contact with individuals from high-risk categories. 

What's your risk of catching COVID? These tools help you to find out

A model of disparities: risk factors associated with COVID-19 infection

Risk factors of critical & mortal COVID-19 cases: a systematic literature review and meta-analysis

Risk factors for Covid-19 severity and fatality: a structured literature review

Role of machine learning techniques to tackle the COVID-19 crisis: systematic review

A systematic review on AI/ML approaches against COVID-19 outbreak

Determinants of COVID-19 vaccine acceptance in the US

COVID-19 vaccination hesitancy in the United States: a rapid national assessment

SMOTE for imbalanced classification with Python. machine learning mastery

Welcome to the SHAP documentation -SHAP latest documentation

SHapley Additive exPlanations) | Interpretable machine learning. SHAP (SHapley Additive exPlanations)

Pre-test probability for SARS-Cov-2-related infection score: the PARIS score

The authors wish to thank Uma Kodali Bhavani and Morgan Goodwin for their incredible efforts providing the data that was critical to this work.

The authors declare no competing interests.