key: cord-1034037-pa3ju50t authors: Jagoda, Evelyn; Marnetto, Davide; Montinaro, Francesco; Richard, Daniel; Pagani, Luca; Capellini, Terence D. title: Regulatory dissection of the severe COVID-19 risk locus introgressed by Neanderthals date: 2021-06-14 journal: bioRxiv DOI: 10.1101/2021.06.12.448149 sha: e97b241fc15a599d9138e5227706703b63e613f0 doc_id: 1034037 cord_uid: pa3ju50t Individuals infected with the SARS-CoV-2 virus present with a wide variety of phenotypes ranging from asymptomatic to severe and even lethal outcomes. Past research has revealed a genetic haplotype on chromosome 3 that entered the human population via introgression from Neanderthals as the strongest genetic risk factor for the severe COVID-19 phenotype. However, the specific variants along this introgressed haplotype that contribute to this risk and the biological mechanisms that are involved remain unclear. Here, we assess the variants present on the risk haplotype for their likelihood of driving the severe COVID-19 phenotype. We do this by first exploring their impact on the regulation of genes involved in COVID-19 infection using a variety of population genetics and functional genomics tools. We then perform an locus-specific massively parallel reporter assay to individually assess the regulatory potential of each allele on the haplotype in a multipotent immune-related cell line. We ultimately reduce the set of over 600 linked genetic variants to identify 4 introgressed alleles that are strong functional candidates for driving the association between this locus and severe COVID-19. These variants likely drive the locus’ impact on severity by putatively modulating the regulation of two critical chemokine receptor genes: CCR1 and CCR5. These alleles are ideal targets for future functional investigations into the interaction between host genomics and COVID-19 outcomes. Since its emergence in late 2019, the SARS-CoV-2 virus has infected more than 160 million 2! people worldwide and claimed more than 3 million lives (WHO, weekly report May 25, 2021). ! 4! 2020) that constitute a well-studied representative sample of the broader Estonian population as 1! sampled by the Estonian Biobank (EGCUT) (Leitsalu et al 2015) . These samples also have 2! available whole blood RNA-sequence data which contributed to eQTLGen, a broad whole blood 3! eQTL analysis study (Vosa et al. 2018) . By utilizing genomes that were part of the eQTL study 4! population, we can be assured that the associations between alleles and gene expression is 5! accurate, as differential linkage disequilibrium (LD) between alleles in different populations can 6! decrease the efficacy of using eQTL data from one population on another. We initially conducted the Sprime scan (Browning et al. 2018 ) using the 423 Estonians as the 9! ingroup population along with 36 African samples from the Simons Genome Diversity Project 10! (SGDP) with no evidence of European admixture (Mallick et al. 2016 ) as an outgroup (Table 11 ! S1). From this scan, we identified 175,550 likely archaically introgressed alleles across 1,678 12! segments (Table S2) . Following Browning et al. (2018) , we then identified segments as 13! confidently introgressed from Neanderthals if they had at least 30 putatively archaically 14! introgressed alleles with a match rate to the Vindija Neanderthal genome (Prüfer et al. 2017) 15! greater than 0.6 and a match rate with the Denisovan genome (Meyer et al. 2012 ) less than 0.4 16! (Browning et al. 2018) . In total, we identified 693 such segments (Table S3) , including the 17! segment containing the COVID-19 severity haplotype on chromosome 3 (see above). 18! 19! We next used the U and Q95 scan, which specifically identifies regions of introgression showing 20! evidence of positive selection (Racimo et al. 2017) . Using Africans from the 1000 genomes 21! project as an outgroup (1000 Genomes Project Consortium, 2015), we found 493 such regions 22! (Table S4) . We did not detect the introgressed COVID-19 severity haplotype in our population 23! ! 5! via this method. This suggests that the COVID-19 severity associated segment, while likely 1! introgressed from Neanderthals based on its detection in our Sprime scan and via the work of 2! others (Zeberg and Paabo 2020), was not under positive selection in the Estonian population. This is consistent with the previously reported lower frequency (8%) of the haplotype in 4! Europeans relative to South Asian populations in which the haplotype is at higher frequency 5! (30%) (Zeberg and Paabo 2020) . However, U and Q95 scans do detect this region in South Asian 6! populations (Racimo et al. 2017; Jagoda et al. 2018) , supporting positive selection on this 7! haplotype in South Asian, but not European populations. We next examined which alleles in these putatively Neanderthal introgressed regions detected 10! using these two genome wide scans also are cis-and trans-eQTLs in the eQTLGen whole blood 11! dataset (Vosa et al. 2018) . From the U and Q95 data, we identified 684 cis-eQTLs across 250 12! 40kb windows (Table S5 ). There were no trans-eQTLs detected in this set. From the Sprime 13! data, we found 27,342 cis-eQTLs from 318 segments along with four trans-eQTLs from three 14! segments (Table S6 and S7) . 15! 16! Refinement of the severe COVID-19 associated introgressed segment 17! In our Sprime scan, we identified an introgressed region containing the haplotype defined by 18! Zeberg and Paabo (2020) as both introgressed and associated with increased risk of COVID-19- 19! severity. The overall introgressed region as detected in our Estonian population spans ~811kb likely a product of incomplete lineage sorting, which is detected as seemingly introgressed tracts 1! of significantly shorter length (Huerta-Sanchez et al. 2014 ). To examine how the introgressed segment may be affecting COVID-19 severity, we began by 4! examining the LD structure within the segment and identified four major blocks defined as 5! minimum pairwise LD between Sprime-identified variants within a block (min r2 = 0.34) (SFig 6! 1). We labeled these blocks as "A" from rs13071258 to rs13068572 (chr3:45843242-46177096), 7! "B" from rs17282391 to rs149588566 (chr3:46179481-46289403), "C" rs71327065 to 8! rs79556692 (chr3:46483630-46585769) , and "D" from rs73069984 to rs73075571 9! (chr3:46593568-46649711) ( Figure 1A ; SFig 1). All Sprime alleles in the A block are 10! significantly (p < 5*10 -8 ) associated with increased risk for COVID-19 severity (The COVID-19 11! Host Initiative et al. 2021) , with the median p-value being 2.32 * 10 -26 and median effect size 12! being 0.42 ( Figure 1B) . The B block also harbors many alleles (81.2%) significantly associated 13! with COVID-19 severity, with the median p-value being 1.94*10 -9 and median effect size being 14! 0.28 ( Figure 1B ). In the "C" and "D" block no alleles are significantly associated with COVID- 15! 19 severity, suggesting that the most likely causal variants for the COVID-19 severity 16! association are found within the A or B blocks ( Figure 1B ). 17! 18! All the 361 Sprime-identified introgressed variants act as eQTLs in the whole blood (Vosa et al. 19! 2018) including for many genes that are relevant to COVID-19 infection. Strikingly, of the four 20! trans-eQTLs identified genome-wide in our Sprime scan regions in Estonians, two were located 21! on the introgressed COVID-19 severity haplotype. These two variants, rs13063635 and 22! rs13098911, have 11 and 33 response genes, respectively (Table S7) . We examined whether 23! ! 7! these response genes have any relevance to COVID-19 infection and found that 3 (27%) and 13 1! (39%) of the response genes for rs13063635 and rs13098911, respectively, are differentially 2! expressed in at least one experiment in which a lung related cell-line or tissue was infected with 3! COVID-19 or other related infections (Table S8 ). These results suggest that these two trans- 4! eQTLs may affect the lung response to COVID-19 in a way that could contribute to differential 5! severity in host response. Furthermore, all 361 variants, including the two trans-eQTLs, act as 6! cis-eQTLs in whole blood, altering the expression of 14 response genes: CCR1, CCR2, CCR3, 7! CCR5, CCR9, CCRL2, CXCR6, FLT1P1, LRRC2, LZTFL1, SACM1L, SCAP, 8! TMIE. Of these genes, 7 are chemokine receptor genes (CCR1, CCR2, CCR3, CCR5, CCR9, 9! CCRL2, CXCR6) , which are likely linked to the segment's association with COVID-19 severity. There is support in particular for an association between CCR1 and CCR5 expression and 11! COVID-19 phenotypes. For example, elevated CCR1 expression has been demonstrated in 12! neutrophils and macrophages from patients with critical COVID-19 illness (Chua et al. 2020) , in 13! biopsied lung tissues from COVID-19 infected patients (Table S8) , as well as in Calu3 cells 14! directly infected with COVID-19 (Table S8) . CCR5 expression is also elevated, though to a 15! lesser degree than CCR1, in macrophages of patients with critical COVID-19 illness (Chua et 16! al.) . Notably, some ligands for CCR1 and CCR5 (CCL15, CCL2 and CCL3) also show over- 17! expression in these patients (Chua et al.) . CCRL2, LZTFL1, SCAP, and SACM1L are also 18! differentially expressed in at least one experiment that measures differential expression of genes 19! in lung tissues and related cell lines infected with COVID-19 or other viruses that stimulate 20! similar immune responses (Table S8) . These results in general support a role for the introgressed 21! variants along this haplotype contributing to the COVID-19 severe phenotype by affecting the 22! expression of genes that play a role in host COVID-19 response. Intriguingly, when considering the effect of the cis-eQTLs for CCR1 across the entire segment, 2! we find that the majority of alleles along the introgressed haplotype within the A block are 3! associated with its down-regulation (average Z score = -12.3) ( Figure 1C ). On the other hand, the 4! majority of alleles within the B and C blocks are associated with CCR1 up-regulation (average Z 5! scores = 7.1 and 10.2, respectively) ( Figure 1C ). It is important to note that these eQTL effects 6! are determined based on whole blood from non-infected, healthy patients (Vosa et al. 2018) . When considered in the context that severe COVID-19 phenotype is characterized by increased 8! expression of CCR1 (Chua et al. 2020) , these risk-associated alleles having different directions 9! of effect suggest that a complex change to the CCR1 regulatory landscape driven by alleles 10! across the introgressed segment may be contributing to the disease phenotype. When we consider 11! CCR5 expression, it shows a more consistent pattern in which the majority of alleles within the A-C blocks are associated with its down-regulation. This result is interesting as CCR5 expression 13! in patients with severe COVID-19 illness is higher than those with more moderate cases (Chua et 14! al.) . However, given the strong LD within each of these segments, discerning the direct 15! connection between one or more alleles driving these regulatory changes and the molecular and 16! phenotypic signatures of severe COVID-19 remains difficult. 2! with!boundaries!of!the!four!LD!blocks!(ACD).!(b)!Severe!COVIDC19!GWAS!effect!sizes!from!the! 3! release!5!of!the!COVIDC19!Host!Initiative!dataset!(2021), To independently assess the regulatory impact of the alleles on this COVID-19 risk haplotype, 22! we employed a Massively Parallel Reporter Assay (MPRA) to investigate which alleles on the 23! introgressed haplotype directly affect gene expression. Alleles which have the ability to 24! modulate gene expression in this reporter assay are candidate putatively functional alleles that 25! may drive the association with COVID-19 severity by altering the expression of genes that 26! facilitate the biological response to . To ensure that we tested any potential risk the introgressed haplotype that have a confirmed Neanderthal-specific origin, but also alleles 1! along the introgressed haplotype that were either already present in the human population when 2! the haplotype was introgressed, or arose anew in humans (i.e., human-derived alleles) on the 3! introgressed haplotype following its introgression. After filtering for SNPs falling within simple 4! repeat regions (Benson et al. 1999 ), which are not compatible with the MPRA (Tewhey et al. We conducted this assay in K562 cells, a leukemia cell line that displays multipotent 11! hematopoetic biology, which allows for comparison between the MPRA data and the eQTLs 12! identified on whole blood samples (see above). Furthermore, K562 cells can be induced into 13! immune cell fates highly relevant to the COVID-19 severity phenotypes including monocyte, 14! macrophage, and neutrophils (Tabilio et al. 1983; Sutherland et al. 1986; Butler et al. 2014) . Moreover, as K562 cells robustly grow and are transfectable using MPRA reagents, they permit 16! the rapid, repeated acquisition of large numbers of cells, as observed in prior MPRAs (Ulirsch et 17! al. 2016; Ernst et al. 2016 ). Finally, the availability of other published datasets generated on 18! K562 cells (e.g. chromatin ChIP-seq data), allows for comparison between MPRA results, which 19! are episomal in nature, and the endogenous behavior of the genome in the same cell type. However, we do note that MPRA results will also be limited as they will not directly reflect the 21! response of alleles in the endogenous genome and within the in vivo tissues in which the COVID-19 response occurs. We therefore also integrated the MPRA results with datasets 23! ! 12! derived on endogenous immune tissues/cells to help improve our ability to identify biologically 1! relevant candidate driver variants. control sequences in this experiment was significantly correlated with their expression in the 13! source MPRA (Pearson's R = 0.59 , p = 0.0058) ( Figure 2C ). Any deviation between positive 14! control sequence activity in this assay and the source MPRA for the two control sets is likely due 15! to additional regulatory information in this assay for which the tested sequences are 270bp 16! compared with 170bp in the source MPRA. Moreover, for our positive control set we observed 17! that 95% of control sequences displayed activity, whereas only 14% of the negative control 18! sequences displayed activity ( Figure 2D ). Of the 613 experimental variants (1226 alleles) tested, 327 (53%) variants were within sequences 1! found to have detected effects on reporter gene expression (i.e., they are considered "active" or 2! cis-regulatory elements (CREs)) in the context of either the allele on the introgressed haplotype 3! or via its alternative variant ( Figure 2D ) (Table S9 ). Consistent with other MPRA studies 4! (Tewhey et al. 2016 , Ulirsch et al. 2016 Uebbing et al. 2021) , most active sequences showed 5! relatively small effects, with only 17.1% of active sequences showing a LFC greater than 2 6! ( Figure 3A ). To confirm that these active CRE sequences reflect endogenous K562 biology, we Figure 3B ). We next defined as "expression modulating variants" (emVars) those variants exhibiting a 14! significant difference of expression between their two allelic versions using a multiple 15! hypothesis corrected p-value less than 0.01. Using this approach, we identified 20 emVars 16! among the 613 variants we tested (Table 1, Figure 1E , Figure 2B ). Consistent with previous 17! MPRA studies (Tewhey et al. 2016 , Ulirsch et al. 2016 , the effect sizes most emVars detected 18! here are relatively modest, with only 1 emVar having an absolute LFC greater than 2 (Table 1, Figure 3C ). Because the sample size is quite small (20), CREs containing emVars do not show 20! significant enrichment within any endogenous K562 functional annotations. However, compared 21! with tested sequences that do not contain emVars, CREs containing these emVars trend towards 22! being over-represented within poised promoters, weak enhancers, DHS, insulator, repressive 23! marks, and for depletion in heterochromatin regions (See Figure 3D) . 24! ! 1! 1! ! 2! 1! We next examined the 20 emVars for additional evidence of a role in the regulatory mechanisms 2! potentially linked with COVID-19 response. Particularly, we looked for emVars that are (1) 3! significantly (p < 5*10 -8 ) associated with COVID-19 severity, (2) are eQTLs for a gene with Of the 20 emVars, we found four that meet all of these criteria (Table 1, Figure 1 ). Based on the 11! combination of eQTL and Hi-C data, the variant rs35454877 is most likely implicated in the 12! down-regulation of CCR5, while variants rs71327024, rs71327057 and rs34041956 appear to be 13! involved in the regulation of CCR1. Of these, rs35454877 and rs71327024 fall within the LD Block A, which shows the strongest GWAS associations for COVID-19 severity and are 150kb 15! and 230kb downstream of rs35044562, the originally defined tag SNP for this GWAS signal 16! (Zeberg and Paabo 2020) . Furthermore, data from other GWAS studies shows that these four 17! variants are significantly (p < 5*10 -8 ) associated with other phenotypes that could relate to the 18! COVID-19 severity phenotype including: "Monocyte count", "Granulocyte percentage of 19! myeloid white cells", and "Monocyte percentage of white cells" (Astle et al. 2016 Following our research approach, we re-examined a previously identified adaptively introgressed 2! segment in Eurasians within the context of the genome-wide signatures of introgression. We identified 613 variants within the introgressed region (chr3:45843242-46654616) and these were 4! tested for whether they are potential drivers of the association this region exhibits with increased 5! COVID-19 severity. Using MPRA, we tested these variants in a multipotent immune-related Further mining our MPRA results in concert with datasets on the epigenomic and transcriptional 13! environment of immune cells from other functional genomics sources, here we highlight four 14! emVars that have particularly strong evidence of acting as putatively causal variants and whose 15! archaic alleles are strongly implicated with CCR1 (rs71327024, rs71327057, rs34041956) and 16! CCR5 regulation (rs35454877). 17! 18! We caution that while our experimental design was optimized for detecting cis-eQTLs variants (Table S8) , 5 (45%) and 14 (41.2%) of the response genes for the two trans-eQTLs in 1! this locus rs13063635 and rs13098911, respectively, were detected as differentially expressed in 2! at least one in-vitro experiment, although neither of these variants were detected as emVars in 3! this MPRA. Therefore, we also urge additional functional studies to consider the effects of these 4! trans-eQTLs and, in general, to replicate our findings in lung epithelial and other cell types from 5! which some of the expression evidence we here build on are derived. regulators of CCR1 all showed down-regulation. Furthermore, alleles in Block A of the 13! introgressed haplotype, which exhibit the strongest GWAS associations with COVID-19 14! severity, all act as down-regulating eQTLs for CCR1 in healthy whole blood samples. We 15! hypothesize that this difference between the direction of effect of the alleles in the healthy in vivo 16! and reporter assay in vitro episomal condition relative to in the disease state reflects the fact that 17! these alleles contribute to the risk of severe COVID-19 by destabilizing the regulatory 18! mechanism of CCR1 and CCR5, such that they have decreased expression in the healthy form, 19! but are hyper-expressed upon infection. Additional work needs to be done to further explore this 20! potential mechanism as well as to uncover what therapeutic roots may be undertaken to mitigate 21! this mechanism, should it be found to be accurate. datasets, and performed different-expression analyses for COVID-19 and related RNA-seq 16! datasets, L.P. ran the population genetics analyses, provided interpretation of results and 17! participated in manuscript editing. T.D.C. conceptualized the project, provided supervision for all 18! experiments and analyses, and participated in manuscript writing and editing. The data supporting the findings of this study, as presented in all figures (Figures 1-3) , including (SFig 1) , are available within Tables S1-9 included in this publication as 23! well as from the corresponding author upon reasonable request. We have deposited all MPRA 24! data in Geo Omnibus (GSE176233). The following secure token has been created for Reviewers 25! to access the raw datasets: XXXX. We downloaded the Sprime software from https://faculty.washington.edu/browning/sprime.jar European admixture (Mallick et al. 2016 ) as an outgroup (Table S1 ). Following Browning et al. (2018), we then summed the number of Sprime alleles per segment, and for segments greater 9! than or equal to 30 alleles, we calculated the match rate to the Vindija Neanderthal genome 10! (Prüfer et al. 2017 ) and the Denisovan genome (Meyer et al. 2012) . Segments with greater than 11! 0.6 match rate to the Neanderthal genome and less than 0.4 match rate to the Denisovan genome, 12! were considered introgressed by Neanderthals. In total, we identified 693 such segments (Table 13! S3), including the segment containing the COVID-19 severity haplotype on chromosome 3 (see 14! above). U and Q95 17! Concerning U and Q95, following Racimo et al. (2017) , for every 40kb window within the 18! genome, we calculated the U score as the number of SNPs per 40kb window which had <1% (Table S1 ), had >20% frequency in 423 Estonians, and are homozygous in the Vindija Neanderthal genome (Prufer et al. 2017) . We also calculated the Q95 score as the 95% quantile validateMappings". All other parameters were left to defaults. Resulting quantification files were 8! loaded into R (version 3.6.1) (R Core Team 2017) via the tximport library (version 1.14.0) 9! (Soneson et al. 2015) with the "type" option set to "'salmon". Transcript quantifications were 10! summarized to the gene level using the corresponding hg19 transcriptome GTF file mappings 11! obtained from ENSEMBL. Count data were subsequently loaded into DESeq2 (version 1.26.0) 12! (Love et al. 2014) using the "DESeqDataSetFromTximport" function. For subsequent 13! differential-expression analysis, a low-count filter was applied prior to library normalization, 14! wherein a gene must have had a count greater than five in at least three samples (in a given 15! dataset) in order to be retained. For tenOever datasets, differential expression analysis was which is also a time-course dataset, was implemented as described previously (Banerjee et al. identified for the cis-and trans-eQTLs described in this study. We used a MPRA to determine which of these 613 variants within the introgressed segment fall 5! within active cis-regulatory elements (CREs) and whether they modulate reporter gene 6! expression relative to the other variant at the same position. We conducted this assay in K562 7! cells, a leukemia cell line that displays multipotent hematopoetic biology and which allows for 8! comparison between MPRA data and eQTL datasets derived from whole blood samples. Furthermore, K562 cells can be induced into cell fates associated with the COVID-19 10! phenotypes including monocyte and macrophage and neutrophils (Butler et al. 2014 ). Each variant was tested in the context of 270bp of the endogenous sequence centered around 13! each variant. For the SPrime alleles, if there is another allele within the span of this 270bp 14! sequence that is highly linked (r2 > 0.8) in the Estonian population and at least in 9 of the 10 15! 1KG populations it was included in the 270bp sequence. This 270bp sequence will be hereafter 16! referred to as the "tested sequence". Additionally, we included 44 control sequences from a past 17! MPRA experiment performed in K562 cells (Jagoda et al. 2021) with the 22 strongest up- 18! regulating sequences from this MPRA serving as positive controls and the 22 sequences with 19! smallest magnitude of effect on expression serving as negative controls. In the original assay, 20! these control sequences were 170bp in length, here we extended the sequences to create 270bp 21! sequences centered around the original 170bp sequence. This difference in length could account 22! for any potential regulatory discrepancy between the original experiment and this one. In total, containing a GFP open reading frame, minimal promoter and partial 3' UTR was cloned into the 13! mpra∆orf library to make the final mpra:gfp library. For each of four biological replicates, 40ug of mpra:gfp vector pool was then transfected into 10 16! million K562 cells using electroporation with the Lonza 4D-Nucleofector following the 17! manufacturer's protocol. After 24hrs, cells were collected and flash frozen in liquid N 2 . Closely 18! following Tewhey and colleagues (2016) All MPRA data analysis steps were conducted following Tewhey and colleagues (2016) . The 250bp paired end reads from the sequencing of the mpra∆orf library were merged using Flash 4! v.1.2.11 (Magoč and Salzberg, 2011) . Merged amplicon sequences were then filtered for quality 5! control such that sequences were kept if (1) there was a perfect match of 10bp on the left or right passed through these filters were aligned back to the expected sequence pool using Bowtie2 v. 2.3.4.1 with the --very-sensitive flag. Alignments that had less than 95% perfect matching with 10! the expected sequence and any alignment which had a mismatch at the variant position were 11! removed. Barcodes that matched to more than one expected sequence are unusable and therefore 12! were also removed. 13! 14! Because of our small library size, we were able to get a very high barcode yield. All oligo 15! sequences were represented and tagged with a wide diversity of barcodes, with the median 16! unique barcodes per oligo being 18,812 ( Figure S2A ). Only 5 oligos had fewer than 100 unique 17! barcodes ( Figure S2B) , with the fewest being tagged by 27 unique barcodes. This large number 18! of unique barcodes lead to extremely high reducibility between replicates (Figure 2A) such that reads were only kept if they had a maximum levenshtein distance of 4 with the constant 3! sequence within the 3' UTR of the GFP as well as a perfect match with the two base pairs 4! adjacent to the barcode. If the sequence passed through these filters, the barcodes were then 5! matched back to the oligos based on the information from the mpra∆orf library sequencing 6! described above. The counts for each barcode were summed for each oligo. This summation of 7! the counts per barcode reduces the noise that could be derived from any individual barcode 8! having a functional effect. all 4 cDNA samples and all 4 plasmid samples were passed in Deseq2 and sequencing depth was 13! normalized using the median-of-rations method (Love et al. 2014) . We then used Deseq2 to 14! model the normalized read counts for each oligo as a negative binomial distribution (NB). Deseq2 then estimates the variance for each NB by pooling all oligo counts across all the 16! samples and modeling the relationship between oligo counts and the observed dispersion across 17! all the data. It then estimates the dispersion for each individual oligo by taking this observed 18! relationship across all the data as a prior a performing a maximum posteriori estimate of the 19! dispersion for each oligo. Therefore, the bias for the dispersion estimate for each oligo is greatly 20! reduced because it relies on pooled information from all other oligos. We then used Deseq2 to 21! estimate whether an oligo sequence had an effect on transcription by calculating the log fold 22! change (LFC) between the oligo count in the cDNA replicates compared with its count in the 23! ! 11! plasmid pool. We tested whether this LFC constituted a significant difference of expression 1! using Wald's test and required a stringent Bonferroni corrected p-value of less than 0.01 for a 2! significant result. If an oligo sequence had a significant LFC with either allele, the sequence is 3! considered "active". Finally, to determine which variants are expression modulating, for oligos 4! which were determined to be active, we used Deseq2 to calculate the fold change between the 5! two versions of the oligo sequences with Wald's test to calculate the p-values. p-values were 6! then corrected using the Benjamini-Hochberg test to correct for multiple hypothesis testing. Significance was defined stringently as a multiple hypothesis corrected p-value of <0.01. 8! 9! emVar Prioritization and intersection of data from other functional sources 10! To identify which of the 20 variants identified by the MPRA as emVars are the best candidates 11! for contributing to an increased risk of severe COVID-19, we the MPRA data with data from 12! other sources specifically: 13! 14! GWAS data 15! As described above, GWAS data is from the most recent release (release 5) from the COVID-19 16! Host Genetics Institute (The COVID-19 Host Genetics Initiative, Ganna A 2021). If the p-value 17! for an emVar was less than 5 * 10 -8 (or -log10(p-value) > 7.3), it was considered significant and 18! passed through this prioritization step. 19! 20! significant contacts of targeted promoters. This dataset was generated from promoter-capture 1! assays across a number of different tissues and cell-types; given our particular interest in immune 2! cell regulation, we considered only those significant interactions (reported p-value < 0.01) 3! present in samples from lymphoblasts (GM12878.ADS and GM19240.ADS), spleen 4! (STL001.SX1 and STL003.SX3) and thymus (STL001.TH1) samples. Interacting regions, which 5! may indicate putative cis-regulatory elements, were intersected with our defined emVars using 6! bedtools intersect. If an emVar falls within a contact site for any gene, this is reported in Table 7! (1). For our prioritization, if an emVar is both within a contact site for a gene with relevance to 8! COVID-19 infection, particularly CCR1, which is differentially expressed in some SARS-COV-9! 2 infection studies analyzed here (Table S8 ) and by others (Chua et al. 2020) , and CCR5 which 10! reported as is differentially expressed in other studies (Chua et al. 2020) , and the emVar is also 11! an eQTL for this same gene (Vosa et al. 2018) , that supports its prioritization ( Figure 1, Table 1 ). Candidate cis-Regulatory Elements (CCREs) by Encode To further validate emVars for biological relevance, we downloaded all 926,535 human cCREs 15! from https://screen.encodeproject.org/ (ENCODE Project Consortium, 2020). cCREs are DNAse 16! hypersensitivity sites that are further supported by additional evidence of cis-regulatory activity 17! in the form of either histone modifications (H3K4me3 and H3K27ac) or CTCF-binding data. To intersect these human cCREs with our emVar data, we first used LiftOver (Hinrichs et al. 2! 2006) to convert the our emVar coordinates from GRCh37 to GRCh38 and then used Bedtools 3! intersect to search for emVars falling within cCREs. These intersections are reported in Table 1 . For emVars that already passed through the prior prioritization steps that overlapped a cCRE, we 5! then examined the tissue level information on the cCRE on the web browser 6! (https://screen.encodeproject.org/). emVars were prioritized if they are within cCREs that had 7! cCRE annotations in a cell at least one immune-related "class a" cell line, which is a cell line for 8! which data on all four makers (DNAse, H3K4me3, H3K27ac, CTCF-binding) is available. These 9! results are displayed in Figure 1D and Table 1 . 10! 11! 12! The Allelic Landscape of Human Blood Cell Trait 1! Variation and Links to Common Complex Disease CoV-2-Infection-Induced Activation of Type I Interferon Responses Tandem repeats finder: A program to analyze DNA sequences Drives Development of COVID-19 Analysis of Human Sequence Data 14! Reveals Two Pulses of Archaic Denisovan Admixture Human Cell-Based Artificial Antigen-Presenting Cells 18! for Cancer Immunotherapy COVID-19 severity correlates with airway epithelium-immune cell 1! interactions identified by single-cell analysis Comprehensive Transcriptomic Analysis of COVID-19 Blood, Lung, and 6! Airway Chromatin-state discovery and genome annotation with ChromHMM Genome-scale 11! high-resolution mapping of activating and repressive nucleotides in regulatory regions The UCSC Genome Browser Database: update He, 21! by introgression of Denisovan-like DNA Ensembl variation resources Database Disentangling Immediate Adaptive 8! Introgression from Selection on Standing Introgressed Variation in Humans Neanderthal adaptively introgressed 12! genetic variants regulate human immune genes in vitro A compendium of promoter-centered long-range chromatin interactions in the 17! human genome Moderated estimation of fold change and dispersion for 1! RNA-seq data with DESeq2 FLASH: fast length adjustment of short reads to improve genome 4! assemblies The Simons Genome Diversity Project: 300 genomes from 8! 142 diverse populations A high-coverage genome sequence from an archaic 15! Denisovan individual Differences in local population history at 17! the finest level: the case of the Estonian population Kingsford Salmon provides fast and bias-aware 20! quantification of transcript expression A High-Coverage Neandertal Genome from Vindija 1! Cave in Croatia (Prufer,2017) R: A Language and Environment for Statistical Computing Signatures of archaic adaptive introgression in 6! present-day 10! Cherry JM. ENCODE data at the ENCODE portal Differential analyses for RNA-seq: transcript-level 14! estimates improve gene-level inferences ! Differentiation of K562 leukemia cells along erythroid, macrophage, and megakaryocyte 18! lineages Myeloid and megakaryocytic 21! properties of K-562 cell lines Direct 1! identification of hundreds of expression-modulating variants using a multiplexed reporter 2! assay The COVID-19 Host Genetics Initiative, a global 4! initiative to elucidate the role of host genetic factors in susceptibility and severity of the 5! SARS-CoV-2 virus pandemic Mapping the human genetic 8! architecture of COVID-19 by worldwide meta-analysis Expanded encyclopaedias 11! of DNA elements in the human and mouse genomes A global 14! reference for human genetic variation Massively Parallel 18! Discovery of Human-Specific Substitutions That Alter Enhancer Activity Systematic Functional Dissection of Common Genetic Variation Affecting Red Blood Cell Traits Clinical features of 3! covid-19 Unraveling the polygenic 5! architecture of complex traits using blood eQTL meta-analysis. bioRxiv Coronavirus disease (COVID-19) Weekly Epidemiological Update -25 SREBP-Dependent Lipidomic Reprogramming as a Broad-Spectrum 12! Antiviral Target The major genetic risk factor for severe COVID-19 is inherited from 15! Neanderthals Yuxin Hou, et al. 17! "Competing Endogenous RNA Network Profiling Reveals Novel Host Dependency 18! Factors Required for MERS-CoV Propagation Clinical course and risk factors for 22! mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort