key: cord-0324469-wa48fqhl authors: Parikh, V. N.; Ioannidis, A. G.; Jimenez-Morales, D.; Gorzynski, J. E.; De Jong, H. N.; Liu, X.; Roque, J.; Cepeda-Espinoza, V. P.; Osoegawa, K.; Hughes, C.; Sutton, S. C.; Youlton, N.; Joshi, R.; Amar, D.; Tanigawa, Y.; Russo, D.; Wong, J.; Lauzon, J. T.; Edelson, J.; Montserrat, D. M.; Kwon, Y.; Rubinacci, S.; Delaneau, O.; Cappello, L.; Kim, J.; Shoura, M. J.; Raja, A. N.; Watson, N.; Hammond, N.; Spiteri, E.; Mallempati, K. C.; Montero-Martin, G.; Christle, J.; Kirillova, A.; Seo, K.; Huang, Y.; Zhao, C.; Moreno-Grau, S.; Hershman, S.; Dalton, K. P.; Zhen, J.; Kamm, J.; Bhatt, K. title: Deconvoluting complex correlates of COVID19 severity with local ancestry inference and viral phylodynamics: Results of a multiomic pandemic tracking strategy date: 2021-08-08 journal: nan DOI: 10.1101/2021.08.04.21261547 sha: 712523603472da445f4eb73136b3dcb57c843dfe doc_id: 324469 cord_uid: wa48fqhl The SARS-CoV-2 pandemic has differentially impacted populations of varied race, ethnicity and socioeconomic status. Admixture mapping and local ancestry inference represent powerful tools to examine genetic risk within multi-ancestry genomes independent of these confounding social constructs. Here, we leverage a pandemic tracking strategy in which we sequence viral and host genomes and transcriptomes from 1,327 nasopharyngeal swab residuals and integrate them with digital phenotypes from electronic health records. We demonstrate over-representation of individuals possessing Oceanian and Indigenous American ancestry in SARS-CoV-2 positive populations. Genome-wide-association disaggregated by admixture mapping reveals regions of chromosomes 5 and 14 associated with COVID19 severity within African and Oceanic local ancestries, respectively, independent of overall ancestry fraction. Phylodynamic tracking of consensus viral genomes reveals no association with disease severity or inferred ancestry. We further present summary data from a multi-omic investigation of human-leukocyte-antigen (HLA) typing, nasopharyngeal microbiome and human transcriptomics that reveal metagenomic and HLA associations with severe COVID19 infection. This work demonstrates the power of multi-omic pandemic tracking and genomic analyses to reveal distinct epidemiologic, genetic and biological associations for those at the highest risk. The SARS-CoV-2 pandemic has differentially impacted populations of varied race, ethnicity and socioeconomic status. Admixture mapping and local ancestry inference represent powerful tools to examine genetic risk within multi-ancestry genomes independent of these confounding social constructs. Here, we leverage a pandemic tracking strategy in which we sequence viral and host genomes and transcriptomes from 1,327 nasopharyngeal swab residuals and integrate them with digital phenotypes from electronic health records. We demonstrate over-representation of individuals possessing Oceanian and Indigenous American ancestry in SARS-CoV-2 positive populations. Genome-wide-association disaggregated by admixture mapping reveals regions of chromosomes 5 and 14 associated with COVID19 severity within African and Oceanic local ancestries, respectively, independent of overall ancestry fraction. Phylodynamic tracking of consensus viral genomes reveals no association with disease severity or inferred ancestry. We further present summary data from a multi-omic investigation of human-leukocyte-antigen (HLA) typing, nasopharyngeal microbiome and human transcriptomics that reveal metagenomic and HLA associations with severe COVID19 infection. This work demonstrates the power of multi-omic pandemic tracking and genomic analyses to reveal distinct epidemiologic, genetic and biological associations for those at the highest risk. Two central questions from the COVID-19 pandemic remain unresolved: who is at risk of severe disease, and why? Genome-wide-association-studies (GWAS) from the COVID-19 Host Genetics Initiative and others have identified up to 13 genetic loci associated with COVID19 infection, hospitalization and critical illness. 1, 2, 3 Among these loci is a chromosome 3 haplotype that entered the human population from Neanderthals, suggesting that genetic ancestry plays an important role in susceptibility and severity in SARS-CoV-2 infection. 4 At the same time, epidemiologic studies have shown that comorbidities, sex and socioeconomic status are strongly associated with infection prevalence and disease severity. [5] [6] [7] [8] Several groups have reported higher incidence of COVID19 infection and higher disease severity among Hispanic/Latino and African American racial and ethnic groups. 5, 7 Because social constructs of race and ethnicity covary with ancestry-related genetic markers, such associations may confound the study of COVID19 host genetic susceptibility. Genetic ancestry inference along the genome represents a powerful tool to examine genetic risk within the context of multi-ancestry genomes. These analyses are independent of the socioeconomic factors associated with race and ethnicity, due to the random shuffling of genomic ancestry segments, even between siblings, via recombination. Further, viral phylodynamics, host immunity (e.g., HLA typing), and metagenomics are important potential contributors to disease severity that can vary with ancestry and social factors and can be clarified with this omics-based approach. To build the complex database necessary for these investigations, residual viral transport media (VTM) from SARS-CoV-2 clinical diagnostic tests were randomly prospectively collected from March 2020 to July 2020 and linked to structured clinical information from the electronic health record. 938 SARS-CoV-2 positive and 389 SARS-CoV-2 negative NP swab residuals were sequenced from a total of 892 (571 positive and 321 negative) patients ( Figure 1A&B ). Host whole genome sequencing was aligned and called using methods for low pass data 9-11 (mean of mean coverages: 2.56X±2.50X(SD), Figures 1C) , followed by phasing and imputation with GLIMPSE ( Figure S1 ). 12 Shotgun RNAseq of initial samples yielded high coverage of much of the viral genome for samples with clinical test CT values <35 regardless of RNA yield, which improved with primer-based capture (Figure 1D and S1D). For digital phenotype abstraction, we generated a COVID19 clinical severity score based on the ordinal scale proposed by the World Health Organization (Table S1) . Clinical data were obtained through the STAnford Research Repository (STARR), specifically the STRIDE data management system, which is populated from patients' clinical and biospecimen data. 13 Severity score was calculated on the date of sample collection and daily for one month before and indefinitely after (Figures 1E&F and S2) . We used genetic ancestry inference to identify subpopulations highly impacted by the COVID19 pandemic. Self-reported race and ethnicity re-demonstrates prior reports of over-representation of minority race and ethnic populations within the US amongst COVID19 positive patients ( Figure S3 ). Based on genetic ancestry inference, a higher proportion of individuals of Oceanian (Pacific Islander) and Indigenous American genetic ancestry exists in COVID19 cases as compared to negative controls (Fig 2A, 44% vs. 26% for individuals having Indigenous American genetic ancestry, associated with Hispanic populations (χ 2 = 99, p<1e-10), and 4.6% vs. 2.3% for individuals having Oceanian ancestry, associated with Pacific Islander origins (χ 2 = 13.9, p=2e-4)). Evaluation of temporal trends of genetic ancestry in case positivity reveals early predominance of European ancestry followed by a significant increase in Indigenous American ancestry after May 2020. These findings are recapitulated by self-reported ethnicity, as the majority of COVID19 cases between May and July 2020 self-identified as Hispanic/Latino in the medical record, an ethnicity often associated with mixed Indigenous American and European genetic ancestry (Fig 2B) . We also investigated the role of viral phylodynamics in COVID19 severity and its potential interaction with individual ancestry and disease severity. Consensus viral genomes (10X coverage at >99% of the genome) were recovered for 255 samples from unique, unrelated individuals. The estimated time to the most recent common ancestor of observed samples is December 11, 2019 with a 95% Bayesian CI of (2019-10-27, 2020-01-12). However, the phylogenetic reconstruction ( Figure 2C ) reveals an early introduction in the area between late December 2019 and early January of 2020 with several independent introductions later in February 2020. While about 37% of the infected individuals in the sample have Indigenous American ancestry, there is no evidence of exclusive transmission amongst individuals of this ancestry. Other genetic ancestries also do not sort with clades, though a single clade from early in the pandemic had fewer Hispanic individuals, consistent with the first wave prior to May 2020, in which European genetic ancestry individuals were enriched ( Figure 2C) . We also tested the hypothesis of association between viral lineages and disease severity. No association was found between specific clades and severity score at the time of NP swab ( Figure 2C ). After adjusting for age, sex and BMI (known correlates of disease severity [14] [15] [16] ) together with overall genetic ancestry proportion, we assessed association of genetic ancestry at a given genomic position with the COVID19 severity score as an ordinal outcome (admixture mapping). Because our captured case population was enriched for non-European ancestry groups, we were able to perform admixture mapping for six ancestries (African, Native American, Oceanian, South Asian, East Asian, and European/West Asian). This analysis revealed loci in chromosomal regions of African and Oceanic ancestry that met genome-wide significance, determined as previously described by Shriner et al. 17 (Figure 3) . In segments of African ancestry, a region of chromosome 5 (5:34675449-35163719, GRCh38.p13) was associated with increased disease severity (p=0.03 after multiple test correction), while in segments of Oceanian ancestry, a region of chromosome 14 (14:44073392-44149995) was also associated with high severity COVID19 (p=1.2e-07 after correction). Two SNPs in these regions (rs13167486 and rs4608241 respectively) were reported in the same genome-wide interaction analysis of pathological hallmarks of Alzheimer's Disease 18 and map to two long noncoding (lnc)RNAs, SPRY4-AS1 and LINC02307. These lncRNAs are predicted to bind a set of miRNAs 19 whose empirically confirmed target genes are enriched in brain, as well as epithelium (e.g., lung). 20, 21 LncRNAs have an important role in antiviral immunity both via their effect in immune cells as well as direct interaction with the viral genome. 22, 23 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 8, 2021. ; We next established a resource of summary statistics for COVID19 severity versus host genotype by genetic ancestry, host HLA type, and metagenomic alignments (https://covid-omics.org/results), and have contributed host genetic summary data and viral consensus sequences to the COVID-19 Host Genetics Initiative 3 and GISAID, 24 respectively ( Figure 4A ). Using this resource, we explored the potential contribution of the nasopharyngeal microbiome and HLA-type as biological determinants of COVID19 severity. A UMAP plot of microbiome species abundance shows clustering largely independent of severity ( Figure 4B) ; however, a regression of species abundance against COVID19 severity (controlling again for age and BMI) revealed enrichment of Paracoccus yeei transcripts in high severity cases, a bacterium that causes opportunistic infections in critically ill, 25 organ transplant, 26 and dialysis patients, 27, 28 indicating an association with immune compromise and severe illness ( Figure 4C , p =7e-04 after Bonferroni correction). The HLA-B*07:02 allele (common prototype allele for the serotype B7) was associated with elevated risk of high severity score (OR 2.7 [1.4, 5.1], p=2.9e-03), whereas the HLA-C*15:02 allele (common prototype allele for the serotype Cw15) was associated with risk reduction (OR 0.12 [0.02, 0.82], p=1.41e-02) ( Figure 4D ). HLA-B*07:02 presents epitopes from the SARS-CoV-2 N gene and Orf1ab, 29 and the HLA-C*15:02 allele contains two distinctive amino acid substitutions at residues 113 and 116 located within the peptide binding groove. HLA-C*15:02 was associated with milder disease in the first SARS epidemic, 30 and is predicted to bind a SARS-CoV-2 Spike protein epitope. 31 These results represent a substantial effort to assemble host and viral genomic, transcriptomic and digital clinical data from a diverse cross-section of the socioeconomic, racial and ethnic groups hit hardest by the COVID19 pandemic. We show that ancestry inference can be used to track changes in the affected population in real-time, demonstrating that Hispanic/Latino groups (associated with Indigenous American genetic ancestry) were disproportionately affected during a second pandemic wave. This is consistent with the model that this second wave was driven not by introduction from travelers (likely the source of the first wave), but by economic pressure on essential service workers to leave their homes and family units, enabling viral spread. 32 Phylodynamic overlay on this at-risk population further supports this conclusion, demonstrating that viral clades did not differentially affect ancestral groups, nor did they confer differential disease severity during six months of prospective enrollment. Thus, the impact of introduction of viral variants on community spread was likely less than that of exposure related to essential services work. In addition to the use of ancestry inference to track the impact of the pandemic on ancestral subpopulations, genomic regions associated with COVID19 severity in the context of local African and Oceanian ancestries highlight potentially novel pathobiology. Both regions are near lncRNAs, noncoding transcripts that can interact with other small noncoding RNAs to alter viral immunity and host gene expression, one of which is expressed in lung epithelium, consistent with potential biological significance to SARS-CoV-2 infection. 19, 21 Admixture mapping and local ancestry disaggregation were necessary to reveal these markers, which would likely otherwise be masked by social and economic determinants of severity that disproportionately affect these ancestral populations. As the populations at highest risk of severe outcomes in a pandemic (due in large part to health and socioeconomic inequities) are not proportionally represented in existing datasets, the development of our real-time data collection strategy from clinical swab residuals was critical to assessing the relevance of ancestry-specific genetic variation. 33 We present a resource combining summary host and viral genomic, metagenomic and transcriptomic data with highly evolved digital phenotype abstraction from extant EHR data to help deconvolve genetic environmental and social factors while tracking spread across the community. This system can be applied in real-time to model individual and population trajectories in future emerging global infectious diseases (https://covidomics.org/results), and can help illuminate underlying biology and medical insights relevant to diverse populations that can be otherwise missed or ignored due to confounding with the social and economic factors that are also associated with race and ethnicity in diverse populations. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 8, 2021. ; (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 8, 2021. ; Genetic ancestry admixture of individuals with positive versus negative COVID19 tests in the present study. Individuals with Indigenous American ancestry are overrepresented in cases, whereas controls show more European and South Asian genetic ancestry. (B) Self-reported (top) and genetic ancestry (bottom) of enrolled COVID19+ individuals over time reveals disproportionate representation of Hispanic/Latino ethnicity and Indigenous American ancestries during summer pandemic wave, whereas the first wave is seen to have predominantly affected non-Hispanic individuals and individuals of European genetic ancestry. (C) Phylogenetic reconstruction of SARS-CoV-2 sequences. Tip colors correspond to the inferred genetic ancestry of the infected hosts, whose consensus SARS-CoV-2 sequences were isolated and used for inferring the viral phylogeny. Horizontal lines at the bottom of the phylogeny indicate host severity scores corresponding to the tips of the phylogeny. Severity score codes are displayed in Table S1 . Genetic ancestry admixture proportions All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 8, 2021. ; Figure 3 . Admixture mapping shows association of ancestry along the genomes of COVID+ patients with severity score. Ancestry-specific risk loci found in Chromosome 5 and 14 in African and Oceanian ancestries, respectively, were found to be associated with the same functional mechanism. Genetic ancestry associations were first corrected for overall ancestry proportions of patients, for BMI, and for age. Lower black dotted-lines indicate less stringent ancestry-specific multiple test correction for a p-value of 0.05, using the method of Shriner et al., while higher, more stringent blue dashed lines indicate naive Bonferroni corrections for the same. A critical task is to determine for every sampled patient the disease severity from the Electronic Health Records (EHR). To accomplish this task, we used as the "COVID-19 Clinical Severity Scale" an adapted version of the "Ordinal Scale for Clinical Improvement" proposed by the World Health Organization in the COVID-19 Therapeutic Trial Synopsis (Draft February 18, 2020, Table S1 ). This scale categorized the COVID-19 severity according to the level of care and oxygen support. Scores 1 to 2 include patients not requiring supplemental oxygen support or hospitalization. However, the WHO definition of these scores were modified due to their vague scope. Thus, score "1", originally described as "no limitation of activities", was modified to "asymptomatic patient", and score "2", from "limitation of activities'' to "symptomatic patient" (symptoms were extracted and curated from EHR billed diagnoses). Scores 3 to 4 include patients hospitalized, with score 4 assigned only to those requiring non-invasive supplemental oxygen (oxygen mask). Scores 5 to 7 are defined as "severe disease" based on level of oxygen support. Thus, "score 5" is for patients requiring high flow oxygen and "score 6" mechanical ventilation. "Score 7" includes critically ill patients requiring, in addition to ventilation, the administration of specific medications (pressors), dialysis, or extracorporeal membrane ventilation (ECMO). A custom algorithm was written that abstracted digital phenotypes from each chart (see Table S1 , "EHR annotations"). First, SARS-CoV-2 positive status was confirmed based on clinical test reports abstracted from the EHR. SARS-CoV-2 negative patients were assigned a score of zero. For SARS-CoV-2 positive patients, starting with the highest score (8, death) and working down, if criteria were met, the individual was assigned that score. If no clinical notes were available for data abstraction, then a severity score was not assigned and these individuals were not included in severity score based analyses. We calculated the score for any given date and assigned the maximum value according to the EHR annotations defined for every score (Supplementary Table 1 ). For example, patients with annotations for both ventilation and the administration of pressors received a score "7" for that day. Clinical data were obtained through the STAnford Research Repository (STARR), a Stanford Medicine's approved resource for working with clinical data for research purposes extracted from the Epic database management system used by the Stanford hospitals. Specifically, we queried the STRIDE data management system, which is populated from patients' observational clinical, research, and biospecimen data 13 . 35 ). For samples collected after May 2020, SARS-CoV-2 ARTIC V3 amplicon libraries were made from extracted total nucleic acid for whole genome sequencing using previously reported protocols [1] . Briefly, 3 ul of total nucleic acid was used as input for a randomly primed cDNA synthesis reaction. This cDNA served as input for 30 cycles of amplification with ARTIC V3 primers (github repo: articnetwork/artic-ncov2019), and was then diluted 1:100 before tagmentation. Adaptor tagmentation was performed using homebrew Tn5, and 8 cycles of index PCR was performed using unique dual barcode Nextera indices (Detailed protocol: https://protocols.io/view/artic-neb-tagmentation-protocolhigh-throughput-wh-bt66nrhe). Final libraries were pooled at equal volumes and cleaned at 0.7x (SPRI: Sample) using SPRIselect beads. Library was sequenced on Illumina Novaseq SP platform in a paired-end 2 x 150 cycle run.An incubation step with 1:10 dilution of FastSelect (Qiagen) reagent was included between the RNA fragmentation and first strand synthesis steps of the library prep to deplete highly abundant host rRNA sequences present in each sample. Equimolar pools (n=160-384 samples) of the resulting individual dual-barcoded library preps were subjected to paired-end 2 x 150bp sequence analysis on an Illumina NovaSeq 6000 (S2 or equivalent flow cell) to yield approximately 50 million reads per sample. For SARS-CoV-2 genomes, FASTQ sequences were aligned to the SARS-CoV-2 reference genome NC_045512.2 using minimap2. 36 Non-SARS-CoV-2 reads were filtered out with Kraken2, 37 using an index of human and viral genomes in RefSeq (index downloaded from https://genexa.ch/sars2bioinformatics-resources/). Spiked primers for viral enrichment were trimmed from the ends of short reads using ivar. 38 Finally, a pileup of the aligned reads was generated with samtools 39 , and consensus genomes were called with ivar. The full pipeline used is publicly available on Github (https://github.com/czbiohub/sc2-illumina-pipeline). All viral consensus sequences were uploaded to the GISAID database (https://www.gisaid.org/). Host and metagenomic RNA alignment was performed using STAR run against a combined index of the human reference genome grch38, SARS-CoV2 (SARSCoV2_NC_045512.2), and ERCC spikeins. STAR parameters were chosen to avoid bias towards GTAG eukaryotic splice signatures for both the viral RNA and host RNA analyses. Metagenomic classification of reads unmapped to both SARS-CoV2 and human was performed using KrakenUniq (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0). KrakenUniq parameters (>=100 kmers and duplication <=less kmers ) were chosen to avoid false positives. From the filtered KrakenUniq output, an abundance Variant Calling, Imputation, PCA, Kinshiship BAM files were used for an initial calling with bcftools v1.9 mpileup. 40 To account for the low-coverage sequencing we used the GLIMPSE algorithm v1.0 for imputation and phasing. 12 Briefly, this algorithm uses a reference set of haplotypes (1000 genomes project samples in our case) to compute genotype likelihoods using a Gibbs sampling procedure. The imputed data were filtered for low imputation scores (INFO>0.8), and were then merged with a reference set that contained samples from: (1) the 1000 genomes project, (2) the Human Genome Diversity Project (HGDP), 41 and (3) the Simons Genome Diversity Project (SGDP). 42 While merging these data, we set minor allele count thresholds of 5 for our data and 20 for the reference set (e.g., MAC>4 using bcftools), and a stringent call rate threshold (--geno 0.01 in PLINK2). 43, 44 The resulting VCF was loaded into PLINK2 v2.00a3LM using the following flags: dosage=DS, --import-dosage-certainty 0.8. These merged data had 4,111,339 autosomal variants that survived the filters above. PLINK2 was then used for LD pruning (--indeppairwise 500 10 0.1) and PCA (--maf 0.01 --pca). We also extracted the kinship matrix of our samples using the King algorithm (--make-king in PLINK2). 45 Genetic Ancestry Inference and assignation Genetic ancestry was determined by running supervised local ancestry inference (RFMix v2.03) 46 on the phased patient genomes using a training reference panel of single ancestry samples selected from the 1000 genomes project, HGDP, and SGDP via unsupervised genetic clustering (ADMIXTURE) 47 at K = 7. Only individuals with greater than 0.95 assignment to one of the seven unsupervised clusters in this analysis were used as references for RFMix. The unsupervised cluster labels--African, East Asian, South Asian, Oceanian, European, West Asian, and Indigenous American--were chosen to reflect the biogeographic origin of the reference samples assigned to each cluster. Local genetic ancestry assignments along the genome were then summed to create overall genetic ancestry proportions for each sample which were used for barplots, covariates for regression analyses, and for making individual genetic ancestry assignments. Individual genetic ancestry labels (e.g. for determination of enrichment in cases vs. controls (Figure 2A) , association with viral clades ( Figure 2C ) and controlling HLA associations with severe disease by genetic ancestry) were assigned based on these overall proportions via the following decision sequence: some Oceanian ancestry (Pacific Islanders) > 5%, some Indigenous American ancestry > 10%, West Asian > 50%, South Asian > 50%, East Asian > 50%, European > 50%, African > 50%. For individuals meeting none of these criteria, and ancestry label was assigned consisting of the two predominant ancestries (e.g. East Asian and European in Figure 2C ). Admixture mapping association analyses were used to regress the residual of severity of COVID symptoms for each patient--after correcting for associations with overall genetic ancestry proportion, BMI, sex, and age--against the local ancestry of each particular window of the genome for that patient. 48 With the genome subdivided into 19,474 windows for local genomic ancestry assignment, and assuming complete independence between each, a naive Bonferroni corrected p-value of 2.57*10^-6 is obtained for genome-wide significance at p=.05; however, the genomic ancestry of neighboring, linked genomic windows is not independent and depends upon the characteristic length of each ancestry segment distribution, itself a function of the time since admixture in each population. A less stringent multiple-test correction factor that incorporates this distribution was empirically determined by considering the spectral density evaluated at frequency zero of an autoregressive model of local ancestries 17 As the success of the DNA sequencing is dependent on the initial target amplification and the subsequent library preparation, the following changes were made to the manufacturer's protocol: 1) increased input DNA volume to 8.6 µl while maintaining the manufacturer's recommended multiplex PCR protocol per sample; 2) eluted DNA in 12 µl of suspension buffer after the initial amplicon purification, and proceeded to the library preparation without normalization process; 3) increased the number of thermal cycles to 17 in the final DNA library amplification; 4) eluted DNA fragments in 22 µl for DNA sequencing. 500 µl of 1.3 pM DNA sequencing library was loaded into a MiniSeq Mid Output Kit (300-cycles) (FC-420-1004), and sequenced using MiniSeq DNA sequencer (Illumina Inc., San Diego, CA). A total of 429 subjects (301 cases, 128 SARS-CoV-2 negative controls) yielded interpretable sequence reads to generate HLA genotypes. We supplemented these samples with sequencing of buffy coat or whole blood collected from high severity COVID19 patients collected from hospitalized patients (n=193). Fastq files were automatically imported into the TypeStream Visual NGS Analysis Software Version 2.0 upon the completion of DNA sequencing, and bioinformatically processed for DNA sequence assembly and HLA genotype assignments with IPD IMGT/HLA Database release version 3.39.0. 49 We modified the software setting so that a maximum of 1.5 million sequences or 750,000 paired-end sequences are used for the sequence assembly and HLA allele assignments. We visually inspected the HLA genotype calls by the software, and made corrections as needed. The approved HLA genotype results were exported in Histoimmunogenetics Markup Language (HML) format, 50 and generated comma separated value (CSV) reports for HLA genotypes, HLA serotypes including Bw4 and Bw6, KIR ligands (C1 and C2) and imputed HLA haplotypes. 51, 52 Subjects were grouped in three categories (Negative: SS_MAX = 0; Mild: SS_MAX = 1 -3; Severe: SS_MAX = 4 -8), and organized in six broad ancestry groups [European (EUR), Hispanic (HIS), Asian (ASI), African American (AFA), Native American (NAM) and Native Hawaiian/Pacific Islander (HPI) ] based on self-reported ethnicity in clinical records. When self-reported ethnicity was not available, genetic ancestry calculated from the low pass WGS in this study was used as described above. We converted the genetic ancestry information to self-reported medical record ethnicity format as follows: European and West Asia => EUR; some Indigenous American and Hispanic => HIS; East Asian and South Asian = ASI; African => AFA; fully Indigenous American => NAM; Oceanian => HPI. We compared the distribution of both HLA serotypes and alleles from COVID19 positive individuals with low disease severity (maximum severity score 1-3, n=336) to those with high disease severity (maximum severity score 4-8, n=94). HLA serotype and allele frequencies were calculated in both Mild and Severe groups, and Odd Ratio (OR: Mild vs. Severe) and p-values were calculated for each serotype and allele using Bridging ImmunoGenomic Data-Analysis Workflow Gaps (BIGDAWG). 53 Cochran-Mantel-Haenszel (CMH) tests 54 were subsequently performed for all observed HLA-A, -B, -C, -DRB1, -DQB1 and -DPB1 serotypes and alleles across three major ethnic groups (EUR, HIS and ASI) using the "mantelhaen.test" function in stats R package. Subjects with AFA, NAM and HPI ethnic groups were excluded from CMH tests, because we had only 6, 1 and 5 subjects, respectively, that yielded HLA genotypes. For Bayesian inference of the viral phylogeny, we assumed the Extended Bayesian Skyline Plot 55 prior on the effective population size and coalescent prior on the phylogeny, a fixed molecular clock with a uniform prior distribution centered at 8x10 -4 substitutions per site per year as done in 56 . We assumed the HKY mutation model 57 with default hyperparameter priors in the BEAST2 software 58 . We ran a Markov chain Monte Carlo chain to approximate the posterior distribution of the model parameters for 20 million iterations and thinned every 5000 iterations. The first 10% of samples were discarded as burn-in. We used Tracer 59 to assess the convergence and confirm that the effective sample size (ESS) was >120 for all parameters (except in 15% of effective population size parameters, estimations not shown). Finally, we used TreeAnnotator 60 to summarize the phylogeny posterior distribution and generated the maximum clade credibility tree of Figure S3 . To test the association between clade composition and binary traits, we used the R package treeSeg. 61 Genetic mechanisms of critical illness in COVID-19 Genomewide Association Study of Severe Covid-19 with Respiratory Failure & COVID-19 Host Genetics Initiative. Mapping the human genetic architecture of COVID-19 The major genetic risk factor for severe COVID-19 is inherited from Neanderthals Hospitalization and Mortality among Black Patients and White Patients with Covid-19 SARS-CoV-2 Positivity Rate for Latinos in the Community-Level Factors Associated With Racial And Ethnic Disparities In COVID-19 Rates In Massachusetts: Study examines community-level factors associated with racial and ethnic disparities in COVID-19 rates in Massachusetts Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study Low coverage whole genome sequencing enables accurate assessment of common variants and calculation of genome-wide polygenic scores Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants Scaling accurate genetic variant discovery to tens of thousands of samples Efficient phasing and imputation of low-coverage sequencing data using large reference panels STRIDE--An integrated standardsbased translational research informatics platform Coronavirus Disease 2019 (COVID-19) in Italy Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in Wuhan, China Joint ancestry and association testing in admixed individuals Genome-wide interaction analysis of pathological hallmarks in Alzheimer's disease DIANA-LncBase v2: indexing microRNA targets on non-coding transcripts Analysis of the Human Tissue-specific Expression by Genome-wide Integration of Transcriptomics and Antibody-based Proteomics miRTarBase: a database curates experimentally validated microRNA-target interactions Interplay between SARS-CoV-2 and human long non-coding RNAs Diverse roles of long non-coding RNAs in viral diseases Data, disease and diplomacy: GISAID's innovative contribution to global health First comprehensively documented case of Paracoccus yeei infection in a human Case of Paracoccus yeei infection documented in a transplanted heart Paracoccus yeei as a cause of peritoneal dialysis peritonitis in the United Kingdom Paracoccus yeei: a new unusual opportunistic bacterium in ambulatory peritoneal dialysis Unbiased Screens Show CD8 T Cells of COVID-19 Patients Recognize Shared Epitopes in SARS-CoV-2 that Largely Reside outside the Spike Protein Human-Leukocyte Antigen Class I Cw 1502 and Class II DR 0301 Genotypes Are Associated with Resistance to Severe Acute Respiratory Syndrome (SARS) Immunoinformatic approach to assess SARS-CoV-2 protein S epitopes recognised by the most frequent MHC-I alleles in the Brazilian population Racial/Ethnic Disparities In COVID-19 Exposure Risk, Testing, And Cases At The Subcounty Level In California Pan-ancestry exome-wide association analyses of COVID-19 outcomes in 586,157 individuals Sample Pooling as a Strategy to Detect Community Transmission of SARS-CoV-2 Metagenomic sequencing with spiked primer enrichment for viral diagnostics and genomic surveillance Minimap2: pairwise alignment for nucleotide sequences Improved metagenomic analysis with Kraken 2 An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar The Sequence Alignment/Map format and SAMtools A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data Insights into human genetic variation and population history from 929 diverse genomes The Simons Genome Diversity Project: 300 genomes from 142 diverse populations Second-generation PLINK: rising to the challenge of larger and richer datasets PLINK: a tool set for whole-genome association and population-based linkage analyses Robust relationship inference in genome-wide association studies RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference Enhancements to the ADMIXTURE algorithm for individual ancestry estimation Admixture mapping reveals evidence of differential multiple sclerosis risk by genetic ancestry The IPD and IMGT/HLA database: allele variant databases Histoimmunogenetics Markup Language 1.0: Reporting next generation sequencing-based HLA and KIR genotyping Six-locus high resolution HLA haplotype frequencies derived from mixed-resolution DNA typing for the entire US donor registry HLA Haplotype Validator for quality assessments of HLA typing Bridging ImmunoGenomic Data Analysis Workflow Gaps (BIGDAWG): An integrated case-control analysis pipeline Statistical aspects of the analysis of data from retrospective studies of disease Bayesian inference of population size history from multiple loci The origin and early spread of SARS-CoV-2 in Europe Dating of the human-ape splitting by a molecular clock of mitochondrial DNA BEAST 2: a software platform for Bayesian evolutionary analysis Posterior Summarization in Bayesian Phylogenetics Using Tracer 1.7 Testing for dependence on tree structures Specific Class I HLA Supertypes but Not HLA Zygosity or Expression Are Associated with Outcomes following HLA-Matched Allogeneic Hematopoietic Cell Transplant: HLA Supertypes Impact Allogeneic HCT Outcomes The HLA-C*15:02 allele is common in some Native American tribes from North, Central and South America. The alleles of HLA-C can be classified into the C2 and C1 groups according their interaction with the inhibitory receptors KIR2DL1 and KIR2DL2/3 present in NK cells, respectively; the allele HLA-C*15:02 belongs to the C2 group. The current study found no significant associations of other HLA-C alleles of the C2 group or of other HLA ligands of KIR receptors suggesting that the broad KIR/HLA ligand interactions are not major determinants of disease course. We could not confirm the risk reduction in the prototype allele HLA-B*15:01 for the serotype B62, because we observed other that determines what type of peptide residue at the second position is anchored. A recent study in outcomes of hematopoietic stem cell transplantation found that the HLA-B62 supertype was associated with decreased transplant-related mortality in the entire patient cohort. 62 It can be speculated that HLA-B62 may play a role in immune response; the associations identified in the present study lend support to the hypothesis that peptide presentation by alleles of the B62 serotype in addition to the HLA-C*15:02 allele may contribute to beneficial outcomes in SARS-CoV-2 Disease. Supplementary Table 1