key: cord-0892667-vhi5tzkr authors: Huang, Y.; Radenkovic, D.; Perez, K.; Nadeau, K.; Verdin, E.; Furman, D. title: Age-dependent and Independent Symptoms and Comorbidities Predictive of COVID-19 Hospitalization date: 2020-08-16 journal: nan DOI: 10.1101/2020.08.14.20170365 sha: d9b0f126debfecef639a643f55177375930f0ad3 doc_id: 892667 cord_uid: vhi5tzkr The coronavirus disease 2019 (COVID-19) pandemic, caused by Severe Acute Respiratory Syndrome (SARS)-CoV-2, continues to burden medical institutions around the world by increasing total hospitalization and Intensive Care Unit (ICU) admissions. A better understanding of symptoms, comorbidities and medication used for pre-existing conditions in patients with COVID-19 could help healthcare workers identify patients at increased risk of developing more severe disease. Here, we have used self-reported data (symptoms, medications and comorbidities) from more than 3 million users from the COVID-19 Symptom Tracker app12 to identify previously reported and novel features predictive of patients being admitted in a hospital setting. Despite previously reported association between age and more severe disease phenotypes, we found that patient's age, sex and ethnic group were minimally predictive when compared to patient's symptoms and comorbidities. The most important variables selected by our predictive algorithm were fever, the use of immunosuppressant medication, mobility aid, shortness of breath and fatigue. It is anticipated that early administration of preventative measures in COVID-19 positive patients (COVID+) who exhibit a high risk of hospitalization signature may prevent severe disease progression. we used in all subsequent models are listed in Table 1 . All features were binary except for age and BMI, which were continuous; and shortness of breath (SOB), fatigue, race, and gender, which were categorical. For the study cohort, we extracted all users who tested positive for COVID-19 (n = 10,948). Of those COVID+ users, some cases were severe enough to require them to visit the hospital while others managed their disease at home (Fig. S1 ). We used comorbidities, demographics, and symptoms to predict patients' admission to a hospital setting. To do so, we first divided the COVID+ patients into two groups: (A) negative for hospitalization, including COVID+ patients who were strictly at home without ever having to be admitted to a hospital setting (n = 10,413) and (B) positive for hospitalization, including COVID+ users who reported being admitted to the hospital (n = 535). The average age of group A was 40.2 (Standard Deviation: 13.6) compared to 47.8 (Standard Deviation: 18.8) for group B. For group A, we used comorbidities, demographics, and symptoms recorded in the patient's last entry, and for group B, we used features recorded one entry prior to the entry where the patient indicates admission to a hospital setting (scenario 1) (see Methods). We also analyzed the data considering whether a patient ever reported a given symptom along with comorbidities, demographics, and pre-existing medications (scenario 2) with similar results to those of scenario 1. Features of symptoms, medication history, comorbidities, and demographics investigated in relations to whether a user was admitted to a hospital setting. All features were binary except for age and BMI, which were continuous, and shortness of breath, fatigue, race, and gender, which were categorical. For each feature, NA indicates not available/missing data. We performed an Elastic Net regularized regression to analyze the predictive performance of the features and used LASSO regularization to select for the most important features for the prediction of patient's admission to a hospital setting. The dataset was divided into training and test sets (ratio: 70:30). Since patients often neglect to report all available fields, we used the multiple imputations method to account for missing values, a standard procedure to predict missing data using all other features (besides the outcome) that are not missing [23] [24] [25] . Since the number of patients in group A was considerably larger than in group B (class imbalance) both undersampling of the majority cases and oversampling of minority cases was utilized to achieve a balanced training set (see Methods). Using cross-validation on the training set, parameters are tuned for the Elastic Net Regression, producing the best predictive performance and the most parsimonious number of features. We were able to predict patient hospitalization with relatively good accuracy (cross-validated area under the receiver operating curve (cvAUC) for the training set at the optimal parameters was 0.77) (Fig. 1A ). Using the features selected by this analysis (Fig. 1B) for the prediction of hospitalization on the test set, a similar accuracy was obtained (cvAUC = 78%) (Fig. 1C) . The most important variables of this signature selected by our predictive algorithm were fever, the use of immunosuppressant medication, mobility aid, shortness of breath and fatigue. Age had a relatively small regression coefficient indicating that pre-existing clinical conditions and symptom presentation are much stronger predictors of hospitalization. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint Unexpectedly, the body mass index (BMI) was not selected as a significant predictor. Finally, the female gender was negatively associated with hospitalization. or. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 16, 2020. . We used Elastic Net Regression where outcome of being admitted in a hospital setting or not was regressed on features in Table 1 We next estimated the odds ratio from logistic regression for each feature where the outcome (being admitted in a hospital setting) was regressed onto all features (Fig. S4 ). The most important features are consistent with the Elastic Net results. Elastic Net Regression was also applied to scenario 2. The prediction performance is comparable to scenario 1, and the selected features were also very similar (Fig. S3 ). The modeling from logistic regression and Elastic Net regression using scenario 1 and 2 all selected similar features that are predictive of the outcome, lending robustness to the results. To understand the age effects better given that it has small significance in predicting the outcome, we analyzed the association between age and the other features selected. We conducted an experiment where we divided all the COVID+ users into three age groups, young, middle age, and old. Running univariate logistic regression where the outcome of being admitted to a hospital setting is regressed onto each feature selected by the Elastic Net model shows that the coefficients of the features do not vary substantially between age groups (Fig. S7) . Such results suggest that the features' association to the outcome is not dependent on age. To better understand the fluctuations in the symptoms selected by the Elastic Net model, we then analyzed the eight symptoms in a longitudinal manner. We examined a window of 20 days before the patient goes to the hospital (for positive cases), and 20 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint days before the last entry (for negative cases) (Fig. 2) . For each day, we estimated the frequency of each symptom for the positive and negative groups. Day 0 for the positive group corresponds to the day when the patient was admitted to a hospital setting, and day 0 for the negative group corresponds to the last patient's entry. Fig related symptoms in users who were admitted to a hospital setting. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint SARS-CoV-2 has been shown to cause more severe diseases in older adults 26 . Even though age was not a major contributor to the prediction of COVID-19 related hospitalization, we explored whether age was associated with other features selected by the model. In conjunction, we also examined other demographic variables, such as race, BMI and gender. We conducted multivariate logistic regression models where each of the features selected by the Elastic Net model was regressed on the demographic variables analyzed (Fig. 3) . Age was associated with 10/13 of the predictive features (P < 0.01). The most age-correlated features were mobility aid, limited activity, blood pressure medication and immunosuppressant medication use. This indicates that age-related phenotypes in this cohort are associated with hospitalization due to COVID-19. This emphasizes the fact despite age, any population that expresses the features selected from our model could be susceptible to a more severe form of COVID-19. Understanding vulnerable young populations that make them biologically older than their chronological age and exhibit features that are generally associated with the older population could help identify susceptible young populations. In addition to age, being of black ethnicity was associated with a number of features selected by the Elastic Net such as a high frequency of delirium, limited activity, and blood pressure medications usage. However, whether this is associated . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint with social-economical status or an innate biological difference in people of African descent need further investigation. The gender feature was a predictor of hospitalization ( Fig. 1C ) but was not significantly correlated with any of the predictive features suggesting that the sex of an individual affects other aspects of disease severity not evaluated in this study. Multivariate logistic regression where each Elastic Net Regression selected feature is regressed onto demographic information such as age, BMI, gender, and race. Coefficients are plotted in a heatmap. Only statistically significant associations are plotted. Age has significant but weak association with many selected features. Users identified as black ethnicity in the UK have many positive associations with high coefficients. an on es ot . is e. re rs gh . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint A relatively small effect of the loss of smell feature associated with mild disease outcomes (Fig. 1C) have also been reported in recent studies 27, 28 . However, we also show that this feature is age and race associated. The female gender also had a negative correlation with hospitalization consistent with recent findings in large populations 29, 30 . In our data, gender did not correlate strongly with any features, indicating that there may be factors other than comorbidities or symptoms that make females have a better prognosis. Underlying immunological differences in females [31] [32] [33] [34] could lead to the mounting of a better immune response that could neutralize the virus more efficiently than in men. Besides the need for additional research into the mechanism behind some of the features associated with more severe disease state, time is an important variable that is not explored in depth in this paper. A Cox survival analysis would be informative, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint however, the start time of each user is inconsistent and thus, the application of Cox survival is inappropriate. Some users' first entries already indicate testing positive for COVID-19 with symptoms suggesting that they are already in the midst of the disease course, while others slowly develop symptoms and test positive for COVID-19 later in time. Age has been shown to be important in the severity of COVID-19 13 . In our results, age shows a slight positive correlation with being admitted in a hospital setting. The difference between the average age of those who were admitted to a hospital and those who did not was relatively small, consistent with age not being a strong predictor. It is possible that the older population was less likely to use a smartphone app, leading to under representations of the sick older population. The fact that age-associated variables outperform age in the prediction of patient hospitalization indicates that biological age or immunological age 36,37 could be appropriate measures in assessing an individual's prognosis. In conclusion, we identify age-dependent and independent sets of symptoms and comorbidities predictive of COVID-19 patient hospitalization. Our analyses show features that predict disease severity in advance and this can be utilized to inform severe cases of COVID-19 even in younger individuals who may not be labeled as high risk. Continued rise in the number of cases, as societies struggle to balance reopening the economy and 'flattening the curve', places an enormous burden on healthcare systems around the world. Knowing the signs of possible severe cases like the ones derived in this study could help healthcare systems devote resources to intervening in potentially severe cases before they become costly to manage. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint Of all the users who signed up for the Tracker app, we extracted all users who have indicated testing positive for COVID-19 from March 24, 2020 to June 23, 2020 (Fig. S1 ). United States users were excluded from the study to maintain homogeneity of the study cohort, reducing potential noise. Users who did not enter values for more than 90% of variables were excluded. It is extremely difficult to impute the missing values and derive any meaningful analysis from such users. From the study cohort, the outcomes or dependent variable that we are interested in is whether a user from the Tracker app is admitted to a hospital setting in any capacity or not. Since the users can enter their symptoms everyday, there are many time points we can use as features. For what we call scenario 1, for users who were admitted to a hospital setting, we used the time point right before a user indicated he/she is in the hospital and the features at that time point for analysis. For users who were always at home, we used the last time point and the features at that time point for analysis (Fig. S2A) . In what we call scenario 2, for users who were admitted to a hospital setting, if a user indicated that he/she had a feature in any of his/her entire entries before the day of being admitted in a hospital setting, we labeled that feature as positive for that user. For users who were always at home, if he/she had a feature for his/her entire entry log, we labeled that feature as positive for that user (Fig. S2B) . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint Using such methods only apply to symptoms since they can change everyday and not to comorbidities, pre-existing medication use, or demographics. Multiple imputations were used to impute missing values. Instead of imputing the missing value with a single value, multiple imputations repeatedly samples the data n times and impute the missing values n times using different methods for different data type. We used predictive mean matching for numerical data (age, BMI), polytomous regression for unordered categorical data (gender, race), proportional odds model for ordered categorical data (fatigue, SOB), and logistic regression for binary data (all other features). The variables for the logistic regression would be all other independent variables while the outcome would be the missing variable. The most stringent process would only impute the training set, but there are not enough complete instances to have both positive and negative cases, therefore, we imputed training and testing together. To account for bias, when creating the test set we assessed the pattern of missingness and sampled each pattern so that the test set is representative of all missingness patterns. Multiple imputations produce n imputations, and we pool n imputed matrices together to form a larger training set. Some variables had a large percentage of missing values as seen in Fig. S3B . A comparison of imputed distribution to the original distribution indicates that some variables would produce a wide range of distribution from one imputation to another that is too different from the original distribution (Fig. S3) . Therefore, those variables are removed from the datasets. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint Class imbalance is an issue given that users who specified they are in the hospital is 1.5% of the total entries. To balance out the training set so that Elastic Net regularization does not bias toward negative cases, we oversampled the positive cases and undersampled the negative cases until the number of positive and negative cases are equal. Two parameters can be tuned in Elastic Net, alpha and lambda. Alpha is the mixing parameter indicating how much lasso regularization and ridge regularization should contribute to the model. Lambda is the amount of shrinkage or regularization the model should apply as a whole. A series of alpha is used in each cross-validation of lambda. The alpha that produces the highest AUROC at the minimum lambda is chosen. For scenario 1, the alpha is 0.1. Two common lambdas are generally used, the lambda that gives the best performance (lambda.min) or the lambda with the fewest features selected and is within one standard error of the best performing lambda (lambda.1se). We used lambda.1se because it is the most generalizable model, avoiding overfitting and selecting the most salient variables. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint The likelihood ratio test was used to compare whether there are statistically significant differences between the slopes of positive and negative cases in the trajectory analysis. Linear regression was used to quantify the association between days and frequency of each selected symptom in positive and negative cases. The . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . Only Users tested positive for COVID-19 were included. Users with too many predictors missing were excluded. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . For users who were admitted to a hospital setting, we used the time point right before a user indicated he/she is in the hospital and the features at that time point for analysis. For users who were always at home, we used the last time point and the features at that time point for analysis (B) For users who were admitted to a hospital setting, if a user indicated that he/she had a feature in any of his/her entire entries before the day of being admitted in a hospital setting, we labeled that feature as positive for that user. For users who were always at home, if he/she had a feature for his/her entire entry log, we labeled that feature as positive for that user. Such methods only apply to symptoms since they can change everyday and not to comorbidities, pre-existing medication use, or demographics. to he at or a tal at as ge . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . Figure 4 : Estimated Odds Ratios for each potential risk factor from a logistic regression model. Error bars represent 95% confidence interval for the odds ratio. All odds ratios are adjusted for all other factors listed. Only Significant features are shown. or he nt . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint Supplementary Figure 5 . Univariate Logistic Regression of young, middle age, and old age groups. All the COVID+ users were divided into three groups of young, middle age, and old age groups. The number of positive cases (admitted in a hospital setting) and the number of negative cases (stayed home) are shown. The outcome of whether a user was admitted to the hospital was regressed onto each of the features selected by the Elastic Net Regression. The coefficients for each feature for each age group is plotted. Only significant ones are colored. The three groups have similar patterns of expression in the features selected. e, g, tal of es ge lar . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 16, 2020. . Figure 6 : Results of Elastic Net Regression using scenario 2. Scenario 2 where for each feature, if a user indicated he/she had that feature in any of his/her entire entries, we labeled that feature as positive for that user. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint Supplementary Figure 7 . Likelihood ratio test between positive and negative groups. A 20 days window was examined for positive and negative cases. For each day, the frequency of users having the feature for the positive and negative groups is plotted. Linear regression where the frequency is regressed on the days before the last day. Slope and intercepts were obtained and the likelihood ratio test was used to evaluate whether the slopes were statistically different. P-value < 0.05 indicates the positive and negative groups have statistically different slopes. ve ch is st to he . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 16, 2020. . https://doi.org/10.1101/2020.08.14.20170365 doi: medRxiv preprint Critically-ill COVID-19 patient International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity Outcomes from intensive care in patients with COVID-19: a systematic review and meta-analysis of observational studies The Potential Health Care Costs And Resource Use Associated With COVID-19 In The United States. Health Aff The COVID-19 pandemic in Brazil: analysis of supply and demand of hospital and ICU beds and mechanical ventilators under different scenarios Locally Informed Simulation to Predict Hospital Capacity Needs During the COVID-19 Pandemic Evaluation of the Anticipated Burden of COVID-19 on Hospital-Based Healthcare Services Across the United States Hospital surge capacity in a tertiary emergency referral centre during the COVID 19 outbreak in Italy Projecting hospital utilization during the COVID-19 outbreaks in the United States Lower mortality of COVID-19 by early recognition and intervention: experience from Jiangsu Province Key to successful treatment of COVID-19: accurate identification of severe risks and early intervention of disease progression Rapid implementation of mobile technology for real-time epidemiology of COVID-19 Hospitalization Rates and Characteristics of Patients Hospitalized with Laboratory-Confirmed Coronavirus Disease 2019 -COVID-NET, 14 States Age-dependent effects in the transmission and control of COVID-19 epidemics Age Related Morbidity and Mortality among Patients with COVID-19 Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia Pathophysiology of COVID-19: Why Children Fare Better than Adults? Updated understanding of the outbreak of 2019 novel coronavirus (2019 nCoV) in Wuhan A case study of the Secure Anonymous Information Linkage (SAIL) Gateway: a privacy-protecting remote access system for health-related research and evaluation The SAIL Databank: building a national architecture for e-health International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) research and evaluation The SAIL databank: linking multiple health and social care datasets Protecting health data privacy while using residence-based environment and demographic data mice: Multivariate Imputation by Chained Equations in R Missing data and multiple imputation in clinical epidemiological research Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls COVID-19 and Older Adults: What We Know Self reported olfactory loss associates with outpatient clinical course in COVID 19 Real-time tracking of self-reported symptoms to predict potential COVID-19 Gender differences in patients with COVID-19: Focus on severity and mortality Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the Sex differences in immune responses in COVID-19 Considering how biological sex impacts immune responses and COVID-19 outcomes Sex differences in immune responses to SARS-CoV-2 that underlie disease outcomes. medRxiv (2020) Sexual dimorphism in immunity: improving our understanding of vaccine immune responses in men The pathogenesis and treatment of the `Cytokine Storm' in COVID-19 A clinically meaningful metric of immune age derived from highdimensional longitudinal monitoring An Inflammatory Clock Predicts Multi-morbidity, Immunosenescence and Cardiovascular Aging in Humans This work uses data provided by participants of the COVID-19 Symptoms Study, developed by ZOE Global Limited with scientific and clinical input from King's College London. We would also like to acknowledge all data providers who made anonymised data available for research.We wish to acknowledge the collaborative partnership that enabled acquisition and access to the de-identified data, which led to this output.