key: cord-0873698-9zaz9bdm authors: Kosmicki, J. A.; Horowitz, J. E.; Banerjee, N.; Lanche, R.; Marcketta, A.; Maxwell, E.; Bai, X.; Sun, D.; Backman, J.; Sharma, D.; O'Dushlaine, C.; Yadav, A.; Mansfield, A.; Li, A.; Mbatchou, J.; Watanabe, K.; Gurski, L.; McCarthy, S.; Locke, A.; Khalid, S.; Chazara, O.; Huang, Y.; Kvikstad, E.; Nadkar, A.; O'Neill, A.; Nioi, P.; Parker, M. M.; Petrovski, S.; Runz, H.; Szustakowski, J.; Wang, Q.; Jones, M.; Balasubramanian, S.; Salerno, W.; Shuldiner, A.; Marchini, J.; Overton, J.; Habegger, L.; Cantor, M.; Reid, J.; Baras, A.; Abecasis, G. R.; Ferreira, M. A. title: Genetic association analysis of SARS-CoV-2 infection in 455,838 UK Biobank participants date: 2020-11-03 journal: medRxiv DOI: 10.1101/2020.10.28.20221804 sha: 9e0574157f0b4e3a7a5e4a54b4fea6495a98fedf doc_id: 873698 cord_uid: 9zaz9bdm Background. Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) causes Coronavirus disease-19 (COVID-19), a respiratory illness with influenza-like symptoms that can result in hospitalization or death. We investigated human genetic determinants of COVID-19 risk and severity in 455,838 UK Biobank participants, including 2,003 with COVID-19. Methods. We defined eight COVID-19 phenotypes (including risks of infection, hospitalization and severe disease) and tested these for association with imputed and exome sequencing variants. Results. We replicated prior COVID-19 genetic associations with common variants in the 3p21.31 (in LZTFL1) and 9q34.2 (in ABO) loci. The 3p21.31 locus (rs11385942) was associated with disease severity amongst COVID-19 cases (OR=2.2, P=3x10-5), but not risk of SARS-CoV-2 infection without hospitalization (OR=0.89, P=0.25). We identified two loci associated with risk of infection at P<5x10-8, including a missense variant that tags the epsilon 4 haplotype in APOE (rs429358; OR=1.29, P=9x10-9). The association with rs429358 was attenuated after adjusting for cardiovascular disease and Alzheimer's disease status (OR=1.15, P=0.005). Analyses of rare coding variants identified no significant associations overall, either exome-wide or with (i) 14 genes related to interferon signaling and reported to contain rare deleterious variants in severe COVID-19 patients; (ii) 36 genes located in the 3p21.31 and 9q34.2 GWAS risk loci; and (iii) 31 additional genes of immunologic relevance and/or therapeutic potential. Conclusions. Our analyses corroborate the association with the 3p21.31 locus and highlight that there are no rare protein-coding variant associations with effect sizes detectable at current sample sizes. Our full analysis results are publicly available, providing a substrate for meta-analysis with results from other sequenced COVID-19 cases as they become available. Association results are available at https://rgc-covid19.regeneron.com as per-sample gVCFs. These gVCFs are aggregated with GLnexus into a joint-genotyped, multi-150 sample VCF (pVCF). SNV genotypes with read depth less than seven (DP < 7) and indel genotypes 151 with read depth less than ten (DP < 10) are changed to no-call genotypes. After the application of 152 the DP genotype filter, a variant-level allele balance filter is applied, retaining only variants that 153 meet either of the following criteria: (i) at least one homozygous variant carrier or (ii) at least one 154 heterozygous variant carrier with an allele balance greater than the cutoff (AB >= 0.15 for SNVs 155 and AB >= 0.20 for indels). 156 157 Identification of low-quality variants from exome-sequencing using machine learning. Briefly, we 158 defined a set of positive control and negative control variants based on: (i) concordance in 159 genotype calls between array and exome sequencing data; (ii) mendelian inconsistencies in the 160 exome sequencing data; (iii) differences in allele frequencies between exome sequencing batches; 161 (iv) variant loadings on 20 principal components derived from the analysis of variants with a 162 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint MAF<1%. The model was then trained on 30 available WeCall/GLnexus site quality metrics, 163 including, for example, allele balance and depth of coverage. We split the data into training (80%) 164 and test (20%) sets. We then performed a grid search with 5-fold cross-validation on the training 165 set and applied the model with highest accuracy to the test set. Out of 15 million variants in the 166 exome target region, 1 million (6.5%) were identified as low-quality and excluded from the 167 analysis. Similarly, we identified and removed 6 million out of 21 million variants (28.6%) in the 168 buffer region. 169 170 Gene burden masks. Briefly, for each gene region as defined by Ensembl [26] , genotype 171 information from multiple rare coding variants was collapsed into a single burden genotype, such 172 that individuals who were: (i) homozygous reference (Ref) for all variants in that gene were 173 considered homozygous (RefRef); (ii) heterozygous for at least one variant in that gene were 174 considered heterozygous (RefAlt); (iii) and only individuals that carried two copies of the 175 alternative allele (Alt) of the same variant were considered homozygous for the alternative allele 176 (AltAlt). We did not phase rare variants; compound heterozygotes, if present, were considered 177 heterozygous (RefAlt). We did this separately for four classes of variants: (i) predicted loss of 178 function (pLoF), which we refer to as an "M1" burden mask; (ii) pLoF or missense ("M2"); (iii) 179 pLoF or missense variants predicted to be deleterious by 5/5 prediction algorithms ("M3"); (iv) 180 pLoF or missense variants predicted to be deleterious by 1/5 prediction algorithms ("M4"). The 181 five missense deleterious algorithms used were SIFT [27], PolyPhen2 (HDIV), PolyPhen2 182 (HVAR) [28] , LRT [29] , and MutationTaster [30] . For each gene, and for each of these four 183 groups, we considered five separate burden masks, based on the frequency of the alternative allele 184 of the variants that were screened in that group: <1%, <0.1%, <0.01%, <0.001% and singletons 185 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 3, 2020. of interest are also available, with LD calculated using the respective source genetic datasets. 227 The data resource supporting the COVID-19 Results Browser is built using a processed 228 version of the raw association analysis outputs. Using the RGC's data engineering toolkit based in 229 Apache Spark and Project Glow (https://projectglow.io/), association results are annotated, 230 enriched and partitioned into a distributed, columnar data store using Apache Parquet. Processed 231 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 3, 2020. Parquet files are registered with AWS Athena, enabling efficient, scalable queries on unfiltered 232 association result datasets. Additionally, "filtered" views of associations significant at a threshold 233 of p-value < 0.001 are stored in AWS RDS Aurora databases for low latency queries to service 234 primary views of top associations. APIs into RDS and Athena are managed behind the scenes such 235 that results with a p-value>0.001 are pulled from Athena as needed. 236 237 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 3, 2020. were also observed in analyses stratified by ancestry group (Table 4) . 246 247 We performed ancestry-specific GWAS for eight COVID-19-related phenotypes, using imputed 249 variants available for a subset of 455,838 individuals ( Table 5) . These phenotypes captured a 250 spectrum of disease severity, from COVID-19 cases who did not require hospitalization to those 251 with severe disease (respiratory support or death). Association results are publicly available at 252 https://rgc-covid19.regeneron.com and main findings summarized below. The genomic inflation 253 factor (lGC) was close to 1 for most analyses (Supplementary Table 1 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 3, 2020. Table 2) . 281 Therefore, we tested if the association between the APOE locus and susceptibility to COVID-19 282 could be confounded by AD or CAD case-control status. When both diseases were added as 283 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint covariates to the model, we found that the association with rs429358 was significantly attenuated 284 (OR=1.15; 95% CI=1.04-1.26; P=0.005). These results suggest that the association between 285 rs429358 in APOE and COVID-19 risk likely arose because of the enrichment of AD and CAD 286 amongst COVID-19 cases. 287 The second locus was on chromosome 19p13.11, also associated with the phenotype Genome-wide significant associations in trans-ancestry meta-analysis. Seven of the eight 295 phenotypes were tested in two or more ancestries. For these, we combined results across ancestries 296 using a fixed-effects meta-analysis, but no new loci were identified at P<5x10 -8 . 297 298 We tested the association between the same eight COVID-19-related phenotypes and exome 300 sequencing variants available for a subset of 424,183 individuals from the UKB study. We tested 301 both single variants and a burden of rare variants in protein-coding genes (see Methods). 302 303 Exome-wide association results. The lGC for common variants (MAF>0.5%) was close to 1 for 304 most analyses (Supplementary Table 3) , while for rare variants (MAF<0.5%) we observed a 305 considerable deflation of test statistics, caused by a large proportion of variants having a MAC of 306 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. Previous GWAS reported an association between risk of SARS-CoV-2 infection and 356 common variants in the 3p21.31 locus [20] [21] [22] . We confirmed this association and further showed 357 that this locus affects disease severity but not (or less so) risk of infection. We note, as have others, 358 that the lead variant rs35652899 is in high LD with a lead expression quantitative trait locus 359 (eQTL) for SCL6A20 in lung tissue [39] . The SLC6A20 gene encodes SIT1, a proline transporter 360 expressed in the small intestine, lung, and kidney [40] . SIT1 expression and function is increased 361 via interaction with angiotensin-converting enzyme 2 (ACE2), which is the SARS-CoV-2 receptor 362 [41]. One intriguing hypothesis is that increased expression of SLC6A20 in the gastrointestinal 363 tract, lung or kidney might promote viral uptake, thus leading to increased risk of severe disease 364 due to pathology in these tissues. Other candidate genes in the region include LZTFL1, which 365 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. reproducible in independent studies. First, we found no difference in representation of blood types 381 among COVID-19 cases and controls (not shown). Second, although we did observe a directionally 382 consistent and nominally significant association between risk of infection and the published lead 383 variant, when we combined results from the UK Biobank with those from the discovery cohort 384 [20], the association with this variant did not reach genome-wide significance (not shown). Third, 385 we found no evidence for an association between this locus and disease severity. Therefore, it 386 remains unclear whether variants in the ABO locus represent bona fide risk factors for COVID-19. 387 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint between risk of SARS-CoV-2 infection and a variant that tags the e4 haplotype in APOE. Common 389 variants in APOE have been previously associated with SARS-CoV-2 infection, independent of 390 CAD, dementia and other comorbidities [46] . However, in contrast to these findings, we found 391 that the association with APOE was significantly attenuated after adjusting for AD and CAD. 392 Similar results were obtained after conditioning on AD alone (not shown). This suggests that the 393 observed association between risk of SARS-CoV-2 infection and APOE in our analysis of the UK 394 Biobank was, at least partly, confounded with AD status. 395 We also identified a putative new association between common variants on chromosome 396 19p13.11 and risk of SARS-CoV-2 infection. However, this locus was not associated with 397 increased risks of hospitalization or severe disease amongst COVID-19 positive individuals. 398 Replication in independent studies is required to validate the association between 19p13.11 and 399 risk of SARS-CoV-2 infection. 400 Lastly, we analyzed exome sequence data for a subset of 424,183 individuals in the UK 401 Biobank to test the association between COVID-19 phenotypes and rare variants not captured by 402 array genotyping or imputation. We found no associations at a P<5x10 -8 with pLoF variants, 403 missense variants or in gene-burden analyses. We then concentrated on 81 genes of interest, 404 including 14 genes related to interferon signaling [14, 23], 36 genes in two GWAS loci [20] and 405 31 additional genes of immunologic relevance and/or therapeutic potential. After correcting for 406 the number of tests performed, there were no significant associations between the COVID-19 407 hospitalization phenotype and a burden of rare deleterious variants in any of these genes. We are 408 expanding our analysis of exome sequence data to include additional studies and will update results 409 accordingly. 410 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint At the outset of the pandemic, testing for SARS-CoV-2 was restricted to symptomatic individuals 412 and often performed exclusively at inpatient/outpatient care sites. Thus, this current analysis is 413 likely weighted toward cases with demonstrable COVID-19 symptoms or clinical presentation. 414 Broader analysis of seropositive individuals who were asymptomatic or had infections mild 415 enough to resolve at home will be critical to identify genetic factors that might protect from severe 416 disease, particularly among high-risk groups with comorbidities. Regardless, further genetic 417 studies across ancestry groups will shed more light on human genetic risk factors associated with 418 susceptibility to SARS-CoV-2 and may point to pathways and approaches for the treatment of 419 COVID-19. 420 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. 448 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint 10 (10.8) 6 (12) 4 (9.5) 994 (9.4) Heart Failure, n (%) 5 (5.4) 3 (6) 2 (4.7) 167 (1.6) Type 2 Diabetes, n (%) 25 (27.1) 16 (32) 9 (21.4) 2308 (21.9) Chronic kidney disease, n (%) 6 (6.5) 6 (12) . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. 468 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint involved in the etiology of SARS-CoV-2, encode therapeutic targets or have been implicated in 501 other immune or infectious diseases through GWAS. 502 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted November 3, 2020. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 3, 2020. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 3, 2020. CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.28.20221804 doi: medRxiv preprint A Novel Coronavirus from Patients with Pneumonia in China The 506 species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV 507 and naming it SARS-CoV-2 Clinical Characteristics of Coronavirus Disease 2019 in China Asymptomatic and Presymptomatic SARS-CoV-2 Infections in 511 Residents of a Long-Term Care Skilled Nursing Facility -King County Presumed Asymptomatic Carrier Transmission of COVID-19 Presenting Characteristics, Comorbidities, and Outcomes Among 516 5700 Patients Hospitalized With COVID-19 in the New York City Area Clinical course and risk factors for mortality of adult inpatients with 522 COVID-19 in Wuhan, China: a retrospective cohort study FUT2) gene provides resistance to symptomatic norovirus (GGII) infections Genomewide Association Study of Severe Covid-19 with 550 All authors/contributors are listed in alphabetical order Contribution: All authors contributed to securing funding, study design and oversight. All 616 authors reviewed the final version of the manuscript Sequencing and Lab Operations O. performed and are responsible for sample genotyping conceived and are responsible for 626 laboratory automation Contribution: All authors contributed to the development and validation of clinical phenotypes 633 used to identify study subjects and performed and are responsible for analysis needed to 640 produce exome and genotype data. G.E. and J.G.R. provided compute infrastructure 641 development and operational support. S.B., and J.G.R. provide variant and gene annotations and 642 their functional interpretation of variants Colm O'Dushlaine, 650 group. J.B. identified low-quality variants in exome sequence data using machine learning prepared 654 the analytical pipelines to perform association analyses Immune, Respiratory, and Infectious Disease Therapeutic Area Genetics 659 helped defined COVID-19 phenotypes, interpret association results and led 661 the manuscript writing group