key: cord-0904656-wa7zy5es authors: Dite, G. S.; Murphy, N. M.; Allman, R. title: An integrated clinical and genetic model for predicting risk of severe COVID-19 date: 2020-09-30 journal: nan DOI: 10.1101/2020.09.30.20204453 sha: 0fc9c1c05500f7a2908c1f8b78d9c2ed5fbf7b18 doc_id: 904656 cord_uid: wa7zy5es Background: Age and gender are often the only considerations in determining risk of severe COVID-19. There is an urgent need for accurate prediction of the risk of severe COVID-19 for use in workplaces and healthcare settings, and for individual risk management. Methods: Clinical risk factors and a panel of 64 single-nucleotide polymorphisms were identified from published data. We used logistic regression to develop a model for severe COVID-19 in 1,582 UK Biobank participants aged 50 years and over who tested positive for the SARS-CoV-2 virus: 1,018 with severe disease and 564 without severe disease. Model discrimination was assessed using the area under the receiver operating characteristic curve (AUC). Results: A model incorporating the SNP score and clinical risk factors (AUC=0.786) had 111% better discrimination of disease severity than a model with just age and gender (AUC=0.635). The effects of age and gender are attenuated by the other risk factors, suggesting that it is those risk factors -- not age and gender -- that confer risk of severe disease. In the whole UK Biobank, most are at low or only slightly elevated risk, but one-third are at two-fold or more increased risk. Conclusions: We have developed a model that enables accurate prediction of severe COVID-19. Continuing to rely on age and gender alone to determine risk of severe COVID-19 will unnecessarily classify healthy older people as being at high risk and will fail to accurately quantify the increased risk for younger people with comorbidities. The current COVID-19 pandemic is a dominating and urgent threat to public health and the global economy. While COVID-19 can be a mild disease in many individuals, with cough and fever the most commonly reported symptoms, up to 30% of those affected may require hospitalisation, and some will require intensive intervention for acute respiratory distress syndrome. 1, 2 Globally, public health responses have been aimed at limiting new cases by preventing community transmission through mask wearing, social distancing, curtailing non-essential services and broad travel restrictions. The economic and social impacts of these interventions have been devastating, with foundational damage to local economies 3 and unprecedented increases in mental health diagnoses being reported. 4 As the protracted strain of the pandemic increases pressure to re-open economies, there is an urgent need for tests to predict an individual's risk of severe COVID-19. In the community, a risk prediction test could enable workplaces to confidently manage employees who are at increased risk of severe disease and should work from home or avoid client-facing roles. In the healthcare setting, a risk prediction test could inform patient triage when hospital resources are limited and be useful in prioritising pathology tests and vaccination (when one becomes available). On a personal level, knowledge of individual risk can empower individuals to make informed choices about day-to-day activities. Age, gender and comorbidities are frequently cited as risk factors for severe COVID-19, 5 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . https://doi.org/10.1101/2020.09.30.20204453 doi: medRxiv preprint 4 now appeared, including an analysis of a cohort of 17 million people by Williamson et al. 6 and a prospective cohort study of 5,279 people in New York, 7 both based on the analysis of electronic health records. The analysis of human genetic variation that may affect response to viral infection has been slower, largely due to the lack of available data. Nevertheless, the COVID-19 Host Genetics Initiative has undertaken meta-analyses of the genetic determinants of COVID-19 severity and has made the summary statistics publicly available. 8, 9 In addition, Ellinghaus et al. 10 identified two loci (3p21.31 and 9q34.2) that are strongly associated with severe disease. We used the UK Biobank to develop a comprehensive model to predict risk of severe COVID-19 by integrating demographic information, comorbidity risk factors and a panel of genetic markers. The UK Biobank is a population-based prospective cohort of over 500,000 participants from England, Wales and Scotland who were aged 40 to 69 years when recruited from 2006 to 2010. 11 The UK Biobank has extensive genotyping 12 and phenotypic data obtained from baseline assessment and from linkage to hospital and primary care databases and to cancer and death registries. 11 In response to the COVID-19 pandemic, the UK Biobank made available up-to-date SARS-CoV-2 testing, hospital, primary care and death data for use in COVID-19 research by approved researchers. 13 We extracted testing and hospital records from the UK Biobank COVID-19 data portal on 15 September 2020. We extracted single-nucleotide polymorphism . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . https://doi.org/10.1101/2020.09.30.20204453 doi: medRxiv preprint 5 (SNP) and baseline assessment data from files previously downloaded as part of our approved project. At the time of data extraction, primary care data was only available for just over half of the identified participants and was therefore not used in these analyses. Eligible participants were those who had tested positive for SARS-CoV-2 and for whom SNP genotyping data and linked hospital records were available. Of the 18,221 participants with SARS-CoV-2 test results, 1,713 had tested positive and 1,582 of those had both SNP and hospital data available. We used source of test result as a proxy for severity of disease: outpatient representing nonsevere disease and inpatient representing severe disease. For participants with multiple test results, we considered the disease to be severe if at least one result came from an inpatient setting. We identified 62 SNPs from the (release 2) results of the meta-analysis of non-hospitalised versus hospitalised cases of COVID-19 conducted by the COVID-19 Host Genetics Initiative consortium. 8, 9 We used P<0.0001 as the threshold for loci selection and variants that were associated with hospitalisation in only one of the five studies in the meta-analysis were removed. We pruned for linkage disequilibrium using an r 2 threshold of 0.5 against the 1000 Genomes European populations (CEU, TSI, FIN, GBR and IBS) representing the ethnicities of the submitted populations. 14 Variants that had a minor allele frequency of ≥0.01 and beta coefficients from −1 to 1 were then retained. 15 Where possible, SNP variants were chosen over insertion-deletion variants to facilitate laboratory validation testing. We also included . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . https://doi.org/10.1101/2020.09.30.20204453 doi: medRxiv preprint 6 the two lead SNPs from the loci found by Ellinghaus et al. 10 that reached genome-wide significance. Therefore, we used a panel of 64 SNPs for severe COVID-19 in this study (Table S1 , Supplementary Appendix). While we would normally construct a SNP relative risk score using published data to calculate population-averaged risk values for each SNP and then multiplying the risks for each SNP, 16 the size of the odds ratios for the 64 SNPs meant that this approach could result in relative risks of several orders of magnitude. Therefore, for this study, we calculated the percentage of risk alleles present in the genotyped SNPs for each participant. We used the percentage rather than a count because some of the eligible participants had missing data for some SNPs (9% had all SNPs genotyped, 82% were missing 1-5 SNPs and 9% were missing 6-15 SNPs). Blood type was imputed for genotyped UK Biobank participants using three SNPs (rs505922, rs8176719 and rs8176746) in the ABO gene on chromosome 9q34.2. A rs8176719 deletion (or for those with no result for rs8176719, a T allele at rs505922) indicated haplotype O. At rs8176746, haplotype A was indicated by the presence of the G allele and haplotype B was indicated by the presence of the T allele. 17, 18 Risk factors for severe COVID-19 were identified from large epidemiological studies of electronic health records 6, 7 and advice posted on the Centers for Disease Control and Prevention website. 19 Rare monogenic diseases (thalassemia, cystic fibrosis and sickle cell disease) were not included in these analyses. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . https://doi.org/10.1101/2020.09.30.20204453 doi: medRxiv preprint Age was classified as 50-59 years, 60-69 years and 70+ years. This was based on the participants' approximate age at the peak of the first wave of infections (April 2020) and was calculated using the participants' month and year of birth. Self-reported ethnicity was classified as white and other (including unknown). The Townsend deprivation score at baseline was classified into quintiles defined by the distribution in the UK Biobank as a whole. Body mass index and smoking status were also obtained from the baseline assessment data. Body mass index was inverse transformed and then rescaled by multiplying by 10. Smoking status was defined as current versus past, never or unknown. The other clinical risk factors were extracted from hospital records by selecting records with ICD9 or ICD10 codes for the disease of interest (Table S2 , Supplementary Appendix). We used logistic regression to examine the association of risk factors with severity of COVID-19 disease. We began with a base model that included SNP score, age group and gender. We then included all the candidate variables and used backwards step-wise selection to remove those with P values >0.05. We then refined the final model by considering the addition of the removed candidate variables one at a time. Model selection was informed by examination of the Akaike information criterion and the Bayesian information criterion, with a decrease of >2 indicating a statistically significant improvement. Model calibration was assessed using the Pearson-Windmeijer goodness-of-fit test and model discrimination was measured using the area under the receiver operating characteristic curve (AUC). To compare the effect sizes of the variables in the final model, we used the odds per adjusted standard deviation 20 using dummy variables for age group and ABO blood type. Sensitivity analyses were undertaken by including participants with no hospital records. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . https://doi.org/10.1101/2020.09.30.20204453 doi: medRxiv preprint 8 We used the intercept and beta coefficients from the final model to calculate the COVID-19 risk score for all UK Biobank participants. All statistical analyses were conducted by GSD. We used Stata (version 16.1) 21 for analyses; all statistical tests were two-sided and P values <0.05 were considered nominally statistically significant. The UK Biobank has Research Tissue Bank approval (REC #11/NW/0382) that covers analysis of data by approved researchers. All participants provided written informed consent to the UK Biobank before data collection began. We conducted this research using the UK Biobank resource under Application Number 47401. Of the 1,582 UK Biobank participants with a positive SARS-CoV-2 test result and hospital and SNP data available, 564 (35.7%) were from an outpatient setting and considered not to have severe disease (controls), while 1,018 (64.3%) were from an inpatient setting and considered to have severe disease (cases). Cases ranged in age from 51 to 82 years with a Table 1 . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . https://doi.org/10.1101/2020.09.30.20204453 doi: medRxiv preprint 9 The adjusted odds ratios for the variables included in the final model are shown in Table 2 . This model included SNP score, age group, gender, ethnicity, ABO blood type, and a history of autoimmune disease (rheumatoid arthritis, lupus or psoriasis), haematological cancer, nonhaematological cancer, diabetes, hypertension or respiratory disease (excluding asthma) and was a good fit to the data (Windmeijer's H=0.02, P=0.88). The SNP score was strongly associated with severity of disease, increasing risk by 19% per percentage increase in risk alleles. The effect of age was only evident in the group aged 70 years and over, and while gender was not statistically significant (P=0.26), it was retained because it was one of the three variables considered the base model to which other variables were added. Ethnicity showed a 43% increase in risk for non-whites but was only marginally statistically significant (P=0.06). The AB blood type was protective (P=0.007), but the protective effect of blood type A and the increased risk for blood type B were not statistically significant (P=0.10 and P=0.41, respectively). Table 2 also shows the odds per adjusted standard deviation for the final model. This allows direct comparisons of the strength of the associations for each variable, regardless of the scales on which they were measured. The SNP score was, by far, the strongest predictor followed by respiratory disease and age 70 years or older. Sensitivity analyses including those with no linked hospital records did not change the conclusions presented here ( is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . https://doi.org/10.1101/2020.09.30.20204453 doi: medRxiv preprint with age and gender (χ 2 =57.97, df=1, P<0.001). The full model had an AUC of 0.786 (95% CI, 0.763 to 0.808) and was an 28% improvement over the model with clinical factors only (χ 2 =39.54, df=1, P<0.001), a 59% improvement over the SNP score (χ 2 =71.94, df=1, P<0.001), and a 111% improvement over the model with age and sex (χ 2 =113.67, df=1, P<0.001). Figure 2 illustrates the difference in the distributions of the COVID-19 risk scores in cases and controls. The median score was 3.35 for cases and 0.90 for controls, with inter-quartile ranges of 6.70 and 1.34, respectively. Sixteen per cent of cases and 53% of controls had COVID-19 risk scores of less than 1, and 18% of cases and 25% of controls had scores ≥1 and <2. COVID-19 risk scores ≥2 were more common in cases than in controls, with 13% of cases and 9% of controls having scores ≥2 and <3, 8% of cases and 4% of controls having scores ≥3 and <4, and 45% of cases and 9% of controls having scores ≥4. Figure 3 shows that the distribution of the COVID-19 risk score in the whole UK Biobank is similar to that for the controls in Figure 2b . The median COVID-19 risk score in the whole UK Biobank was 1.32 and the inter-quartile range was 1.80. Thirty-eight per cent of the UK Biobank have COVID-19 risk scores of less than 1, while 29% have scores ≥1 and <2, 13% have scores ≥2 and <3, 6% have scores ≥3 and <4, and 14% have scores of ≥4. One of the main issues of the COVID-19 pandemic is that of susceptibility to severe disease. We have shown that a comprehensive risk prediction test that quantifies the varying effects of clinical risk factors and a SNP risk score has an AUC of 0.786 and improves risk discrimination of severe COVID-19 by 111% compared with a model using age and gender (P<0.001). Examination of the odds per adjusted standard deviation ( is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . https://doi.org/10.1101/2020.09.30.20204453 doi: medRxiv preprint SNP score is the strongest risk factor for severe COVID-19. While the SNP score explains more variance in disease severity than all of the other risk factors in the model combined, the full model discriminates better than the clinical factors alone or the SNP score alone (both P<0.001). The strong associations observed in the model consisting of just age and gender (Table S4, Supplementary Appendix) are attenuated by the inclusion of other risk factors. This is due to the comorbidities in the full model being more prevalent in older people and in men, and it is the comorbiditiesnot age and genderthat are associated with severe disease. Relying on age and gender alone to determine risk of severe COVID-19 will unnecessarily classify healthy older people as being at high risk and will fail to accurately quantify the increased risk for younger people with comorbidities. Our study does have some limitations. We used source of test result as a proxy for severity of disease. Therefore, there is considerable opportunity for misclassification of disease severity but this would be likely to attenuate the magnitude of the associations. Townsend deprivation score, BMI and current smoking status were taken from the baseline assessment data and may not represent the participants' current status. This may have contributed to these variables not being statistically significant. Until mid-May, testing for COVID-19 in the UK was limited to those who had recognisable symptoms and were essential workers, contacts of known cases, hospitalised or had returned from overseas. 22 Therefore, many asymptomatic or very mild cases from the first wave of the pandemic will not have been identified in this dataset. Nevertheless, our results remain applicable to those who develop symptoms that warrant medical attention. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . https://doi.org/10.1101/2020.09.30.20204453 doi: medRxiv preprint While the vast majority of UK Biobank participants are at low or only slightly elevated risk of severe COVID-19 (Figure 3) , we can identify those who are likely to be at substantially increased risk. Our risk prediction test for severe COVID-19 in people aged 50 years or older has great potential for wide-reaching benefits in managing the risk for essential workers, in healthcare settings and in workplaces that seek to re-open safely. The test will also enable individuals to make informed choices based on their personal risk. However, key to understanding the performance of our risk prediction test will be validation in independent data sets, work that we are planning to undertake in the near future. We wish to thank Mr Lawrence Whiting for his invaluable expertise in the management of large data files from the UK Biobank. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . https://doi.org/10.1101/2020.09.30.20204453 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. Implications for COVID-19 triage from the ICNARC report of 2204 COVID-19 cases managed in UK adult intensive care units Rapid risk assessment -coronavirus disease 2019 (COVID-19) in the EU/EEA and the UK -ninth update Global economic prospects How mental health care should change as a consequence of the COVID-19 pandemic Risk factors for intensive care unit admission and in-hospital mortality among hospitalized adults identified through the U.S. Coronavirus Disease Factors associated with COVID-19-related death using OpenSAFELY The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic Genomewide association study of severe Covid-19 with respiratory failure UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age The UK Biobank resource with deep phenotyping and genomic data UK Biobank makes health data available to tackle COVID-19 LDlink: a web-based application for exploring populationspecific haplotype structure and linking correlated alleles of possible functional variants SNPnexus: assessing the functional relevance of genetic variation to facilitate the promise of precision medicine Assessment of clinical validity of a breast cancer risk model combining genetic and clinical information A genome-wide association study identifies protein quantitative trait loci (pQTLs)