key: cord-0314677-kmvdzil1 authors: Yoo, S.; Garg, E.; Elliott, L. T.; Hung, R. J.; Halevy, A. R.; Brooks, J. D.; Bull, S. B.; Gagnon, F.; Greenwood, C. M.; Lawless, J. F.; Paterson, A. D.; Sun, L.; Zawati, M. H.; Lerner-Ellis, J.; Abraham, R. J.; Birol, I.; Bourque, G.; Garant, J.-M.; Gosselin, C.; Li, J.; Whitney, J.; Thiruvahindrapuram, B.; Herbrick, J.-A.; Lorenti, M.; Reuter, M. S.; Liu, S.; Allen, U.; Bernier, F. P.; Biggs, C. M.; Cheung, A. M.; Cowan, J.; Herridge, M.; Maslove, D. M.; Modi, B. P.; Mooser, V.; Morris, S. K.; Ostrowski, M.; Parekh, R. S.; Pfeffer, G.; Suchowersky, O.; Taher, J.; Turvey, S. E.; Upton, J.; Wa, title: HostSeq: A Canadian Whole Genome Sequencing and Clinical Data Resource date: 2022-05-10 journal: nan DOI: 10.1101/2022.05.06.22274627 sha: aeebe8da52a8a84e82a96b0c15ce1564b4043a7b doc_id: 314677 cord_uid: kmvdzil1 HostSeq was launched in April 2020 as a national initiative to integrate whole genome sequencing data from 10,000 Canadians infected with SARS-CoV-2 with clinical information related to their disease experience. The mandate of HostSeq is to support the Canadian and international research communities in their efforts to understand the risk factors for disease and associated health outcomes and support the development of interventions such as vaccines and therapeutics. HostSeq is a collaboration among 13 independent epidemiological studies of SARS-CoV-2 across five provinces in Canada. Aggregated data collected by HostSeq are made available to the public through two data portals: a phenotype portal showing summaries of major variables and their distributions, and a variant search portal enabling queries in a genomic region. Individual-level data is available to the global research community through a Data Access Agreement and Data Access Compliance Office approval. Here we provide an overview of the collective project design along with summary level information for HostSeq. We highlight several statistical considerations for researchers using the HostSeq platform regarding data aggregation, sampling mechanism, covariate adjustment, and X chromosome analysis. In addition to serving as a rich data source, the diversity of study designs, sample sizes, and research objectives among the participating studies provides unique opportunities for the research community. Following exposure to SARS-CoV-2 (the virus that causes COVID- 19) , some individuals remain disease-or symptom-free while others develop a spectrum of symptoms from mild to severe with the potential for fatal outcomes (1) . This variability in response to exposure suggests that susceptibility is mediated at least in part by host genetic factors (2) . Genetic factors have been associated with acquisition and severity of other viral infections (3) (4) (5) (6) (7) , including SARS-CoV-1 (8, 9) . Initial work also suggests a role for host genetics in SARS-CoV-2. However, given the novelty of the SARS-CoV-2 virus and the challenges of identifying genetic contributors in a changing environment (2) , the specific variations contributing to infection susceptibility and illness severity remain largely unknown (10) (11) (12) (13) . These variations could include rare or common variations of any form including single nucleotide and structural variations. In 2020, several countries launched efforts to identify the genetic factors affecting COVID-19 outcomes to support diagnostics, therapy and vaccine development. However, Canada was not poised to do so because, although population-based cohorts exist (14,15), a national whole genome sequencing cohort broadly consented for research and translation, and linked to rich clinical and public health data, did not exist at the onset of the global pandemic. Here we describe the development of this national platform to address pressing questions concerning COVID-19 and other health outcomes in Canada. In April 2020, as part of the Canadian pandemic response, Genome Canada (a not-for-profit organization funded by the Government of Canada) launched the Canadian COVID-19 Genomics Network (CanCOGeN; (16) . CanCOGeN established a coordinated pan-Canadian network of studies in collaboration with Canada's national platform for genome sequencing and analysis (CGEn). Beginning June 2020, CGEn developed HostSeq: a national databank of independent clinical and epidemiological studies enrolling SARS-CoV-2-infected participants across Canada. The goal of HostSeq is to create a data repository with whole genome sequencing and harmonized clinical information, including comorbidities for 10,000 Canadians. With the launch of HostSeq, investigators can now begin to address questions of genetic susceptibility to SARS-CoV-2 infection and outcomes from the Canadian perspective. The approvals in place to link HostSeq to other local, provincial or national data resources expand the utility of the resource, including genetic susceptibility for future implications of SARS-CoV-2 infection. Further, summary statistics from association studies of HostSeq have been contributed and are aligned with international efforts including the COVID-19 Host Genetic Initiative (HGI; ((17) and COVID Human Genetic Effort https://www.covidhge.com/). Most importantly, we have established the research project infrastructure necessary for future pan-Canadian genome sequencing studies. In this resource paper introducing the HostSeq Databank, we present its design characteristics, high-level analytic considerations pertaining to it, and the research opportunities this rich resource provides. HostSeq ( Figure 1) is a project representing a consortium of investigator-initiated SARS-CoV-2related research studies across Canada. Each partner study was required to adhere to core consent elements (Table S1 ), contribute blood (or in rare cases saliva) samples for whole genome sequencing, and provide clinical information using a standardized case report form (Table S2) . Within these studies, eligible participants include individuals of any age with a positive SARS-CoV-2 test performed by any Health Canada approved method. In some studies, suspected cases with clinically assessed COVID-19-related symptoms but without a positive test diagnosis were also included. Within the primary studies, each participant consented to use of their whole genome sequence for future research (18) . Participants also consented to the update, linkage and collection of their data from medical records and charts, as well as from administrative databases, and the deposition of data in a cloud-based, access-controlled databank which can be shared with approved researchers including international and commercial researchers. Additionally, participants had the option to consent to be re-contacted for updates or additional health information, or to invitations to participate in new research. Informed consent was obtained from individuals at each of the participating study sites. For the HostSeq Databank, approval was sought from the study's Research Ethics Board (REB) for inclusion in HostSeq. The HostSeq Databank shares data with the global research community following review and approval by the HostSeq-independent Data Access Compliance Office (DACO). The DACO verifies that the proposed research has REB approval from their host institution and conforms to HostSeq's REB-approved SARS-CoV-2 and other health outcome research. DACO-approved researchers sign inter-institutional legal agreements, which outline how the shared data is to be used, stored, and privacy protected. All HostSeq samples undergo whole genome sequencing in a standardized fashion at one of the three CGEn nodes: Toronto (The Centre for Applied Genomics at The Hospital for Sick Children), Montréal (McGill Genome Centre at McGill University), and Vancouver (Canada's Michael Smith Genome Sciences Centre) on the Illumina NovaSeq6000 platform at 30X depth. Prior to sequencing, quality assurance is performed at multiple stages throughout the process (19) . Concordance of the genotyping pipeline among sequencing sites is verified using the Ashkenazi trio set from the Genome in a Bottle Consortium (20) . 1 Aspects of graphics acquired from Wikimedia Commons. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 10, 2022. Sequenced samples are analyzed jointly using an in-house pipeline encoded in Nextflow and containerized using Docker 2 . The Genome Reference Consortium human build 38 (GRCh38 assembly version GCA_000001405.15) reference genome that includes the alternative HLA decoy genes 3 is used. Genomes are processed following the Best Practices guidelines of the Genome Analysis ToolKit (GATK v4.2.0). This includes alignment of sequences to the reference genome, and the genotyping of each sample individually followed by joint-calling of all genotypes together (21) . Sequences are aligned to the reference genome using Burrows-Wheeler Aligner (BWA v0.7.17) (22) , sorted with Picard tools (v2.25.0) and bases are recalibrated using the Base Quality Score Recalibration (BQSR) of GATK. GATK HaplotypeCaller is used in Dragen mode on diploid samples for short variant discovery. Aligned sequences are thus converted to genomic Variant Calling Format (gVCF) files, which are then filtered and imported to a GATK GenomicsDB for joint-calling using the GATK GenotypeGVCFs tool. We perform HLA Class I typing using OptiType software (v1.3.1) (23); perform housekeeping with bcftools (v1.11) and samtools (v1.14) (24) ; check for sample contamination using VerifyBamID2 (v2.0.1) (25) ; check agreement between reported sex-at-birth and sex chromosome composition using PLINK software (v1.90) (26) ; and predict ancestry admixture (27) and relatedness (28) using Genetic Relationship and Fingerprinting software (GRAF v2.4). We use PLINK (v2.00) (29) for genetic analysis, R (3.6.3) (30) for data analysis, and snakemake (v7.2.1) (31) for internal workflow management. Additionally, we compare the genetic principal components of HostSeq with the 1000 Genomes Project reference populations (32, 33) following the guidelines of plinkQC (34) . Samples are excluded based on the following checks ( Figure S1 ): (i) genotyping call rate < 95%, (ii) sex chromosome composition and reported sex-at-birth mismatch, (iii) samples identified as duplicates, (iv) possibly mislabelled samples, (v) sample contamination rate > 3%, and (vi) mean coverage < 10. The whole genome sequence data are readily available in joint VCF format (aligned sequences can be made available on request). As of April 6, 2022, 13 participating studies contributed data and biospecimens to HostSeq (Table S3) . Although all 13 studies continue collecting clinical information, 6 have completed their participant recruitment. To date, we have harmonized data from 12 of those studies. The participating studies are predominantly prospective SARS-CoV-2 studies based in hospitals, and are seeking to identify genetic factors that contribute to varying COVID-19 outcomes. Here we summarize characteristics of the 12 harmonized studies. Three studies-genMARK, Alberta Childhood COVID-19 Cohort ("AB3C"), and Genomic Determinants of COVID-19 ("GD-COVID") -are using a case-control design, in which laboratory-confirmed COVID-19 cases are matched with controls (see Table S3 for matching factors and control eligibility). One study-Quebec COVID-19 Biobank ("BQC19")-collected clinical data and biospecimens from 12 hospitals in Quebec (35) . The remaining studies are case-cohorts with patients that either have a confirmed or suspected diagnosis of COVID-19. From these studies, the HostSeq Databank includes data from study subjects on demographics, comorbidities and assessment and treatment provided for COVID-19. Study-specific sample sizes currently range from 9 to 3,498. In total, the 13 studies have contributed 6,098 clinical records and submitted 9,097 samples (Table 1) . We anticipate 1,000 more samples by Summer 2022. With the exception of two studies that have recruitment across multiple provinces (CANCOV, CONCOR-Donor; n=2,202), most studies are province-specific: six studies in Ontario (GENCOV, GenOMICC, SCB, LEFT-GEN, genMARK, Understanding Immunity to Coronaviruses; n=2,539), one in Quebec (BQC19; n=3,498), two in Alberta (AB3C; AB-HGS n=262) and two in British Columbia (GD-COVID, Host Factors; n=596). Table S3 summarizes their research objectives and study designs. Detailed information for each study is also provided on the CGEn website (https://www.cgen.ca/hostseq-studies-2). Clinical data from the participating studies is systematically harmonized by the HostSeq team in an ongoing process. In the first stage, we verify the raw data by checking for missingness, consistency, inadmissible values, and aberrant values across the variables. In the second stage, we harmonize the data guided by a set of common definitions and rules, including application of uniform classification, coding, and measurement units specified in the HostSeq Codebook (and available through the HostSeq Phenotype Portal described below in HostSeq Data Portals). Any potential data errors detected in the harmonization process are communicated with the participating study teams and resolved by follow-up clarifications. The results discussed in this section are based on approximately 60% of the total expected cohort size of 10,000 participants. Although completeness varies across studies, we have achieved over 70% completeness of key variables capturing demographics, comorbidities, healthcare use, and patient outcome. Among the 6,098 currently available samples, HostSeq has 53.4% females and 44.1% males (and the remaining 2.5% are missing reported sex-at-birth), with an overall mean age (at recruitment) of 49.4 years. Distributions of sex and age vary across the studies (Table S4) . Apart from studies including pediatric participants (AB3C, SCB), mean age in the studies ranges from 36.7 years (genMARK) to 61.0 years (GenOMICC). Underlying health conditions are collected in all studies, but using a variety of collection methods (medical chart reviews, participant surveys, and patient interviews). A total of 24 comorbidity variables across cardiovascular, respiratory, immunological, neurological systems, and other pathologies are collected in HostSeq. Distributions of comorbidities across the studies are available through the HostSeq Phenotype Portal. While approximately half of the HostSeq participants were hospitalized and half were assessed in outpatient or community settings, the proportion of hospitalized versus non-hospitalized patients varied substantially across the studies. In all but one study (GenOMICC), participants presented predominantly with mild or moderate symptoms and did not require admission to intensive care units or invasive ventilation support. Of the hospitalized patients, 54.2% were . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. discharged home, 15.6% were transferred to other hospitals or healthcare settings (e.g., rehabilitation centers or long-term care facilities) and 10.6% were reported deceased ( Table 2) . HostSeq provides public access to two data portals: (1) The Phenotype Portal shows summaries for the major variables of the HostSeq harmonized clinical data; and (2) the Variant Search Portal enables queries in a genomic region to see all variants and their alleles identified in the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. ; HostSeq genomes. Both portals are static platforms that are updated periodically when a new release version of their respective data is available. The HostSeq Phenotype Portal (https://hostseq.ca/phenotypes.html) provides information for clinical variables at aggregate and study-specific levels. Users can access variables by category (e.g., demographics, comorbidities, complications) and view their distributions (categorical variables are presented as boxplots, and numerical variables are presented as histograms and violin plots). Displays are limited to variables with ≥ 70% completeness. Researchers can also find links to the HostSeq study protocol and up-to-date data dictionaries on this portal. The HostSeq Variant Search Portal (https://hostseq.ca/dashboard/variants-search) allows for queries of the HostSeq genetic data. The primary querying functionality is supported by the CanDIG-server (36), a platform enabling federated querying of genomics data. Beacon APIs (37) from the Global Alliance for Genomics and Health (GA4GH) are also built-in to allow HostSeq to join the federated Beacon network. Users can query information about a specific allele of interest. Information about the variants that can be queried includes their position and alleles and the respective internal frequencies of the alleles (minor allele frequencies are reported if they exceed 0.1). All columns in the table can be sorted and filtered. Results reported in this section are based on an interim joint-called set of 3,063 HostSeq genomes, of which 2,992 passed all quality checks (see Methods) and may not be reflective of the final HostSeq Databank. Our predicted population structure covers five major ancestry groups (Figures 2 and S2-3; 73% European, 6% Latin American, 7% East Asian, 6% South Asian, 5% African, and approximately 2% uncategorized) and closely matches self-reported ancestries (where available). Additionally, there are 64 and 77 pairs of first-and second-degree relationships, respectively. Currently HostSeq provides 114.2 million short variants consisting of single nucleotide variants and indels. We report HLA Class I haplotypes for three loci (HLA-A, HLA-B and HLA-C) with bi-allelic typing at 4-digit resolution (allele group with specific alleles). The numbers of unique alleles for HLA-A, HLA-B and HLA-C are 64, 130 and 45, respectively (the most common alleles per locus are HLA-A*02:01, HLA-B*07:02 and HLA-C*07:01). . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. ; HostSeq provides unique opportunities to explore the genetics among SARS-CoV-2 positive individuals in Canada and the facilitation of an organizational governance and oversight for researchers in Canada and beyond. Even though the participating studies in HostSeq are heterogenous with different designs and objectives (Table 3 and Table S3 ), HostSeq is an opportunity to leverage that diversity to address research questions. Several issues need to be considered when analysing HostSeq data in a given research context. For example: (1) whether . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. ; data from different studies should be analysed separately or combined (and how to combine those data); (2) the selection strategies used by the contributing studies to recruit participants; (3) adjustment of covariates for association tests with genetic variants; and (4) the details of X chromosome analysis. Whether an investigator's research question would be best answered by within-study comparisons or analyses including multiple studies will require careful consideration of . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. ; participant ascertainment criteria. For example, comorbidities might be analyzed within-study then combined via a meta-analysis to account for differences in study designs among the contributing studies. In contrast, for the disease severity indicated by hospitalization duration, it may be appropriate to jointly analyze the subset of studies that focus on in-patient recruitment. Table 3 provides details for the recruitment aspects that may frame such research questions. For example, to compare the genetics of hospitalized patients to non-hospitalized patients within the same study, data from AB3C, BQC19, CANCOV, GENCOV, genMARK, LEFT-GEN and SCB could be used. To compare ICU patients to non-ICU hospitalized patients, Host Factors, BQC19, CANCOV, GENCOV and SCB could contribute. Given the heterogeneity of the studies in HostSeq, the best approach for certain outcomes may be to analyse relevant studies individually. The feasibility of combining estimates or test results from separate studies, as in meta-analyses, depends on whether the individual studies measure and estimate the same features. The appropriateness of a joint analysis of participant data from multiple studies in an overarching model (perhaps with inclusion of study effects) also depends on whether the studies measure those same features. Although the combination of study-level estimates or tests can be as efficient as joint analysis in large samples (38) , meta-analysis of summary data can be less efficient in finite samples. When individual data are available, joint analysis is recommended, incorporating sparse-data methods for variants with low minor allele counts and outcomes with low prevalence (39, 40) . Furthermore, with study or environmental factors and other sources of heterogeneity, joint analysis can exploit gene-environment interaction (41) and give insight into sources of within-and between-study variation. Given the dynamic nature of the COVID-19 pandemic, temporal and spatial variation withinand between-studies is another source of heterogeneity that is challenging and deserves consideration. Studies with prolonged recruitment and wide variation in dates of infection may allow such factors to be examined. When looking across the participating HostSeq studies, it may be of interest to examine changes in the profiles of recruited patients as the seropositivity rates and vaccination rates changed with time across Canada and as treatments changed and improved (for example, by combining HostSeq data with serological studies). Most of the participating studies are designed to include individuals who tested positive for SARS-CoV-2 at a participating institution or individuals who volunteered to donate blood and previously had a positive test. For such participants, it can be difficult to specify exactly what population they represent. To reduce bias and improve interpretation of results, the processes by which individuals join a given study needs to be considered (42) . Here, we interpret bias relative to the effect of a variable (genetic or otherwise) in a target population. If an analysis is to involve an outcome variable (e.g., hospitalized versus not hospitalized), a genetic variable of interest and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. ; some additional covariates, then the validity of standard statistical methods is linked to how the sample inclusion depends on the outcome. Such dependence occurs in response-selective designs in which individuals are included in a study according to the values of an outcome (43) (44) (45) . Except for the simple case-control setting, weighting or conditional estimation is needed to avoid estimation bias of the genetic association. Such methods require estimation or specification of the probability of being selected for inclusion. We encourage analyses that address study sample selection mechanisms. Methods to account explicitly for selection conditions are similar to methods used for the analysis of secondary traits in case-control studies (46, 47) . From a methodological standpoint, we also encourage studies of bias and Type 1 error control when standard analyses are used (such as unweighted logistic regression). When the selection mechanism is not easily described, comparison of study samples to population or administrative data may provide insights. Finally, as HostSeq includes various ancestries, care must be taken to avoid confounding through population stratification (for example, by use of stratification, mixed models, and genetic principal components). This issue, alongside issues related to the heterogeneity of participating studies, are not unique to HostSeq, and arise in most collaborative multi-center or consortiumbased research. The choice of adjustment covariates in tests for association of outcome with a genetic variant is context dependent and open to discussion in many settings (48, 49) . In testing for genetic associations with COVID-19 outcomes, one strategy would be to adjust for factors such as age and sex that may affect selection or the outcome in question but are not associated with the genetic variant (unless it is on the sex chromosomes; as mentioned below in Sex difference and X Chromosome Analyses below). We must also consider whether to adjust for factors such as comorbidities, which may be related both to the outcome and to the variant. This is of particular importance for severe COVID-19: in the ICU, 1-year mortality outcomes increase with each additional week spent in ICU, each decade in age, and each additional comorbid illness in the Charlson score (50) . From a causal perspective, adjusting for multiple covariates without a clear conceptual framework could lead to adjustment for variables that lie on the causal pathway (51) . If there is a causal link from variant to outcome that passes through such a variable, then researchers could choose to test for either direct or indirect effects of the variant. As part of the process of learning about genetic effects on COVID-19 outcomes, we encourage analyses both with and without adjusting for such factors. For the discovery stage in genetic association studies, power considerations are important. There have been suggestions that adjusting for too many covariates decreases power (49, 52) , and that . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. ; https://doi.org/10.1101/2022.05.06.22274627 doi: medRxiv preprint two-phase strategies of genome-wide screening by simple analysis followed by targeted in-depth modelling is adequate and efficient. However, this is an area for which further study is warranted. COVID-19 displays sexual dimorphism with greater severity in males (53) (54) (55) . In addition to environmental exposures and sex-specific autosomal genetic effects, it is reasonable to hypothesize that some X chromosomal variants play a role in COVID-19 outcomes. Indeed, one gene on the X-chromosome, the angiotensin-converting enzyme 2 (ACE2, Xp22.2), has been reported to be important in SARS-Cov-2 infection and genetic analysis has demonstrated association evidence with ACE2 variants (17) . However, all published GWAS of SARS-CoV-2 susceptibility or COVID-19 severity, to the best of our knowledge, uses the traditional genotype coding (0, 1 and 2 for a female; 0 and 2 for a male) that assumes X-inactivation through a dosage compensation model (i.e., with alleles in the non-pseudo-autosomal regions being expressed exactly half of the time in genetic females; (56)). Yet, it has been reported that close to one-third of the X chromosome genes can escape Xinactivation (57, 58) ; if so, the genotype of a male should be coded 0 and 1 by convention. To robustly deal with X-inactivation uncertainty we recommend the use of recent methods for genetic analysis of SARS-CoV-2 related research questions such as model averaging and selection (59, 60) and an easy-to-implement regression model (61) . Rare X-chromosome variant analysis (62, 63) and X-inclusive polygenic risk scores also require careful consideration and further research. People living in Canada are insured under single-payer health care systems administered at the provincial or territorial level. These systems broadly cover physician and hospital services, as well as procedures. This provides a unique opportunity to conduct passive follow-up to understand the short-term and long-term outcomes related to SARS-CoV-2 infection. Administrative health data are generated through patient contact with the health care systems and maintained in multiple databases that, with the appropriate approvals, can be linked using a unique encoded identifier to study specific, patient-level data (including genetic data). These data are administrative or procedural (e.g., surgeries, emergency department visits, hospital visits, comorbidities, routine medical exams), clinical (e.g., prescription medications, cancer screening), laboratory (e.g., blood measurements), social (e.g., education, income), and environmental (e.g., rurality, walkability, food insecurity, exposure to air pollution). The participant informed consent used by HostSeq allows for linkage to these data, transforming the HostSeq dataset into a longitudinal study. Specifically, linkage to administrative provincial data will provide: 1) a . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. retrospective, longitudinal account of medical histories, health system utilization and diagnoses; and 2) prospective, longitudinal follow-up tracking the natural history of SARS-CoV-2 infection including multisystem inflammatory syndrome in children (MIS-C) and Long COVID, identifying new diagnoses (e.g., diabetes, cancer), long-term health outcomes (e.g., premature mortality), and health resource utilization. Linkage of the HostSeq study samples to provincial administrative data offers opportunities to collect additional data on risk factors and longitudinal outcomes, and opportunities to extend genetic association analyses. Administrative data can also facilitate evaluation of the representativeness of study samples and inform future study design. The limitations of HostSeq data for investigation of specific scientific questions depend on limitations of the relevant participant studies. In addition, investigations that involve combining data or results from separate participant studies may require assumptions about comparability or heterogeneity; such assumptions should be scrutinized. This infrastructure can also be used for future epidemics. The unique features of the HostSeq project highlighted here present novel opportunities to develop, evaluate, and apply statistical methods that contribute to the understanding of genetic associations with COVID-19-related morbidity and mortality, as well as other phenotypes. The augmentation and linkage of the HostSeq questionnaire and genetic databank with other data resources is made possible by broad and flexible consent and will generate a dynamic population-based resource. This will allow for study of a broad range of research questions and sustained productivity over the years to come. Whole genome sequencing of the sample and the ongoing collection of clinical data from participant's medical records/chart, administrative databases, etc. International sharing of genetic and clinical data Future health research on COVID-19 and other health outcomes Use of genetic and clinical data for commercial purposes Sharing of genetic and clinical data through a controlled-access mechanism Storage of genetic and clinical data in the HostSeq Databank, on centralized Canadian cloud servers Indefinite storage of genetic and clinical data Not possible to withdraw data that has already been distributed and used Low risk that the participant could be re-identified in the future Optional re-contact to update personal information, provide new health information, or to be invited to participate in new research projects. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. To identify the characteristics of the antibody response that result in maintained immune response and better patient outcomes; to determine impact of genetic differences on COVID-19 infection severity and immune response; to determine impact of different viral strains on antibody response and patient outcomes. All outpatients (seen in ER and assessment centres) with mild symptoms as well as hospitalized patients with severe symptoms, recruited from six hospitals in Ontario. To identify genetic determinants of severe, lifethreatening COVID-19 infections. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 10, 2022. To determine B cell and antibody immunity to coronaviruses; to identify candidate targets of the virus for vaccine development; to identify novel neutralizing antibodies to 2019-nCoV that can be used in therapeutics; to understand the innate immune response against 2019-nCoV; to identify host factors that play critical roles in developing strong immunity to coronaviruses. Individuals 18 years old or older. Those with blood samples hemoglobin < 100 g/L or pregnant or with hemophilia or platelet count < 50,000 / ul are excluded. To establish a central biobank of biological samples and related core patient level data that will serve as a single point of access that supports a unified and collaborative approach to clinical and translational research related to the current COVID-19 pandemic and future works. All patients presenting to the Hospital for Sick Children with a suspected or confirmed diagnosis of COVID-19, and family members of participants as applicable. Identify host genetic susceptibility factors associated with requirement for hospitalisation due to COVID-19. Family members who did not require hospitalisation were recruited as controls. Confirmed SARS-CoV-2 infection (by provincial nucleic acid test) Required admission to hospital (either medical units or critical care) Age <60 years Unvaccinated participants only Omicron variant cases were excluded . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 10, 2022. ; https://doi.org/10.1101/2022.05.06.22274627 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 10, 2022. ; https://doi.org/10.1101/2022.05.06.22274627 doi: medRxiv preprint Figure S1 . Quality of HostSeq genomes. (A) Missing rate < 5% (B) Contamination rate < 3% (C) Mean coverage >10. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 10, 2022. ; https://doi.org/10.1101/2022.05.06.22274627 doi: medRxiv preprint Figure S3 . Genetic distances score of HostSeq genomes. The four genetic distances (GD1-4) scores from GRAF-pop (see Methods) represent distance of each genome from several reference populations and are used to predict ancestry. Barycentric coordinates of GD1 and GD2 are used to predict admixture proportion of African, East Asian and European ancestries. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 10, 2022. ; https://doi.org/10.1101/2022.05.06.22274627 doi: medRxiv preprint COVID-19 signs, symptoms and severity of disease: A clinician guide Statistical power in COVID-19 case-control host genomic study design CCR5Δ32 mutation and HIV infection: Basis for curative HIV therapy Genetic susceptibility to human norovirus infection: An Update Role of interleukin 28-B in the spontaneous and treatment-related clearance of HCV infection in patients with chronic HBV/HCV dual infection Regulatory T cells in hibit T cell proliferation and decrease demyelination in mice chronically infected with a coronavirus MERS-CoV infection in humans is associated with a pro-inflammatory Th1 and Th17 cytokine profile Association of human-leukocyte-antigen class I (B*0703) and class II (DRB1*0301) genotypes with susceptibility and resistance to the development of severe acute respiratory syndrome Association of HLA class I with severe acute respiratory syndrome coronavirus infection Genetic mechanisms of critical illness in COVID-19 Whole genome sequencing reveals host factors underlying critical COVID-19 COVID-19 Host Genetics Initiative. Mapping the human genetic architecture of COVID-19 Human genetic and immunological determinants of critical COVID-19 pneumonia Addressing privacy concerns in sharing viral sequences and minimum contextual data in a public repository during the COVID-19 pandemic COVID-19 Host Genetics Initiative, Ganna A. Mapping the human genetic architecture of COVID-19: An update. medrxiv Modeling consent in the time of COVID-19 A distributed whole genome sequencing benchmark study Extensive sequencing of seven human genomes to characterize benchmark reference materials Genomics in the cloud: Using Docker, GATK, and WDL in Terra Fast and accurate short read alignment with Burrows-Wheeler transform OptiType: Precision HLA typing from next-generation sequencing data Twelve years of SAMtools and BCFtools Ancestry-agnostic estimation of DNA sample contamination from sequence reads Plink: A tool set for wholegenome association and population-based linkage analyses GRAF-pop: A fast distance-based method to infer subject ancetry from multiple genotype datasets without principal components analysis Quickly identifying identical and closely related subjects in large databases using genotype data Second-generation PLINK: Rising to the challenge of larger and richer datasets R: A language and environment for statistical computing Sustainable data analysis with Snakemake Quality control analysis of the 1000 Genome Prooject Omni2,5 genotypes. biorxiv The 1000 Genomees Project Consortium. A global reference for human genetic variation The Biobanque quebecoise de la COVID-19 (BQC19) -A cohort to prospectively study the clinical and biological determinants of COVID-19 clinical trajectories Federated network across Canada for multi-omic and health data discovery and analysis Federated discovery and sharing of genomic data using Beacons Proper analysis of secondary phenotype data in case-control association studies Recommended joint and meta-analysis strategies for casecontrol association testing of single low-count variants Relative efficiency of using summary versus individual data in random-effects meta-analysis Exploiting gene-environment interactions to detect genetic associations Collider bias undermines our understanding of COVID-19 disease risk and severity Analysis of sequence data under multivariate trait-dependent sampling Semiparametric methods for response-selective and missing data problems in regression Efficient association mapping of quantitative trait loci with selective genotyping Genome-wide association scans for secondary traits using casecontrol samples A flexible copula-based approach for the analysis of secondary phenotypes in ascertained samples Biased estimates of treatment effect in randomized experiments with nonlinear regression and omitted covariates Including known covariates can reduce power to detect genetic effects in case-control studies One-year outcomes in survivors of the Acute Respiratory Distress Syndrome Control of confounding and reporting of results in causal inference studies. Guidance for authors from editors of respiratory, sleep, and critical care journals Adjusting for heritable covariates can bias effect estimates in genome-wide association studies Male sex identified by global COVID-19 meta-analysis as a risk factor for death and ITU admission Sex differences in susceptibility, severity, and outcomes of coronavirus disease 2019: Cross-sectional analysis from a diverse US metropolitan area Sex differences in severity and mortality from COVID-19: Are males more vulnerable? Testing and estimation of X-chromosome SNP effects: Impact of model assumptions Landscape of X chromosome inactivation across human tissues Optimal tests for rare variant effects in sequencing association studies Selection of X-chromosome inactivation model Bayesian model averaging for the X-chromosome inactivation dilemma in genetic association study The X factor: A robust and powerful approach to X-chromosomeinclusive whole-genome association studies Pooled association tests for rare genetic variants: A review and some new results Rare-variant association analysis: Study designs and statistical tests We are very thankful for the support we received from Genome Canada, Innovation, Science and Economic Development Canada, SickKids Foundation, Toronto Academic Health Science Network (TAHSN), CFI, CIHR, COVID-19 Immunity Task Force (CITF), Ministry of Colleges and Universities COVID-19 Rapid Research Fund, Michael Smith Health Research BC, PHCRI Early Career Award, Genome Alberta and the Alberta Children's Hospital Foundation, and internal support from the Hotchkiss Brain Institute and Department of Clinical Neurosciences (University of Calgary). We also wish to express gratitude to all HostSeq project participant studies and the individual participants within these studies for their contribution.