key: cord-151532-mpv2wegm authors: Peng, Kerui; Safonova, Yana; Shugay, Mikhail; Popejoy, Alice; Rodriguez, Oscar; Breden, Felix; Brodin, Petter; Burkhardt, Amanda M.; Bustamante, Carlos; Cao-Lormeau, Van-Mai; Corcoran, Martin M.; Duffy, Darragh; Guajardo, Macarena Fuentes; Fujita, Ricardo; Greiff, Victor; Jonsson, Vanessa D.; Liu, Xiao; Quintana-Murci, Lluis; Rossetti, Maura; Xie, Jianming; Yaari, Gur; Zhang, Wei; Lees, William D.; Khatri, Purvesh; Alachkar, Houda; Scheepers, Cathrine; Watson, Corey T.; Hedestam, Gunilla B. Karlsson; Mangul, Serghei title: Diversity in immunogenomics: the value and the challenge date: 2020-10-20 journal: nan DOI: nan sha: doc_id: 151532 cord_uid: mpv2wegm With the recent advent of high-throughput sequencing technologies, and the associated new discoveries and developments, the fields of immunogenomics and adaptive immune receptor repertoire research are facing both opportunities and challenges. The majority of immunogenomics studies have been primarily conducted in cohorts of European ancestry, restricting the ability to detect and analyze variation in human adaptive immune responses across populations and limiting their applications. By leveraging biological and clinical heterogeneity across different populations in omics data and expanding the populations that are included in immunogenomics research, we can enhance our understanding of human adaptive immune responses, promote the development of effective diagnostics and treatments, and eventually advance precision medicine. analyses 2 . This limits the discovery of genetic diversity contributing to Mendelian diseases and to explore associations between genetic variants and trait variation across populations. In recent years, awareness has been increasing about the limited generalizability of findings across populations, motivating the inclusion of diverse, multiethnic populations in large-scale genomic studies 3, 4 . For example, novel single nucleotide polymorphisms (SNPs) that are clinically associated with warfarin dosing were discovered in large scale genomics studies in individuals of African descent that had not been discovered in Europeans 5,6 . Whole-genome sequencing in individuals of African descent [7] [8] [9] and whole-exome sequencing in a southern African population 10 Additional efforts have been made through international collaborations to establish reference genome datasets and recommendations for research in diverse populations; including the GenomeAsia 100K Project, the Human Heredity and Health in Africa (H3Africa) initiative, and the Clinical Genome Resource (ClinGen) Ancestry and Diversity Working Group (ADWG) [13] [14] [15] [16] . Khatri and colleagues discovered the 3-gene signature for diagnosis of tuberculosis based on transcriptome profiles of participants from 11 countries 17 , a finding that was generalized to patient populations from every inhabited continent in a span of 3 years 18, 19 . Most importantly, the 3-gene signature has been clinically translated to the point-of-care test by Cepheid 20 . The inclusion of diverse populations in genomic studies has demonstrated benefits in the discovery and interpretation of gene-trait associations. Similarly, greater diversity in immunogenomics research will enable the discovery of novel genetic traits associated with immune system phenotypes that are common across populations. Broader inclusion of diverse populations may also enable researchers to address genetic heterogeneity in the context of translational research and clinical drug development, possibly revealing clinically relevant genomic signatures that are more prevalent in some populations than others. Immunogenomics is a field in which genetic information at different levels of biological organization (epigenetics, transcriptomics, metabolomics, cells, tissues, and clinical data) has been characterized and utilized to understand the immune system and immune responses. Here Table 1 ). RNA-seq has traditionally been mapped to study entire cellular populations instead of amplifying at the specific regions. Given the complexity of the TCR and BCR genomic loci, the accurate determination of germline immune receptor genes, from bulk RNA-seq or whole-genome sequencing, has proved challenging 45 . Several computational methods show promise [46] [47] [48] , but the mapping rate and accuracy remain to be improved. Additionally, a wide-scale comparison is needed between results obtained from methods for deriving germline receptor genes from RNAseq studies, those obtained from established methods such as, targeted PCR and sequencing of genomic DNA, the sequencing and assembly of bacterial artificial chromosome (BAC) and fosmid clones 49 , and those from more recent methods such as inference from AIRR-seq repertoires 50 . Many population genetic differences have been observed in genomics studies and immunogenomics is no exception [51] [52] [53] . The current public databases of adaptive immune receptor germline genes are essential for AIRR-seq analysis and immunogenomics studies. However, the most widely used reference database for immunogenetics data, the international ImMunoGeneTics information system (IMGT) 63 , lacks a comprehensive set of human TCR and BCR alleles representing diverse populations worldwide. The same issue exists in HLA databases: over 70% of rare HLA variants from Oceania and west Asia populations were found to be absent in the 1000 Genomes Project panel 64 . There is still more uncertainty due to the fact that descriptions of sample populations in databases are often self-identified based on geography or ethnicity, rather than genetic ancestry. However, progress has been made to address this issue in immunogenomics studies. For example, the AIRR Community, an international community formed to promote high standard research in adaptive immune repertoire research, introduced the Open Germline Receptor Database (OGRDB) in September 2019 as a resource platform for germline gene discovery and validation from AIRR-seq data, to enrich the IMGT database 65 . Collaborators in our team also created VDJbase, a platform for the inferred genotypes and haplotypes from AIRR-seq data 66 . These efforts provide the opportunities to infer genetic ancestry. Nevertheless, the majority of available germline sequences either lack population-level annotations or are biased toward samples of European descent. We argue that this shortcoming must be addressed through focused efforts that seek to include more diverse populations in immunogenomics research. As an interdisciplinary group, with expertise in biomedical and translational research, population biology, computational biology, and immunogenomics, we wish to raise awareness about the value of including diverse populations in AIRR-seq and immunogenetics research. In the areas of genetic disease research and cancer genomics, enhanced genetic diversity has led to demonstrable insights 67, 68 . However, the field of immunogenomics has yet to benefit from a similar growth in diversity. At the current stage of the global COVID-19 pandemic, numerous vaccine trials are underway in many countries worldwide, offering opportunities to investigate genetic factors in vaccine responses. Yet, this will require careful clinical study designs that can effectively address confounding factors such as environmental and socio-economic differences. HIV-1 71 , Zika 72 , and SARS CoV-2 73 . We expect that vaccine and infection outcomes can also be shaped by genetic variability, including specific effects driven by immune-related genes 58 . Here, we make several recommendations for increasing diversity in immunogenomics research. First, we propose that the community should make a greater effort to include underrepresented populations in AIRR-seq and immunogenomics studies. Already, those that have conducted AIRR-seq in populations of non-European descent have uncovered evidence for extensive germline diversity. For example, in a study of South African HIV patients, Scheepers and colleagues discovered 123 IGHV alleles that were not represented in IMGT 54 . This promoted the HIV vaccine design by understanding the immunoglobulin heavy chain variable region (IGHV) profile in the South African population. In a study in the Papua New Guinea population, one novel IGHV gene and 16 IGHV allelic variants were identified from AIRR-seq data 55 . These discoveries of novel alleles indicate the need to generate population-based AIRR-seq datasets. We do not recommend generalizing AIRR-seq findings to populations that are underrepresented in research, due to missing variation and lack of validation, which limits our ability to leverage AIRR-seq datasets in biomedical applications.Therefore, increasing population diversity in immunogenomics studies can lead to improvements in a wide range of applications, including drug discovery and development, vaccine design and development. Promoting precision medicine for underrepresented populations and improving predictions for treatment outcomes will become more feasible in the future with broader participation and inclusion. Second, we argue that there are existing genomic datasets that could potentially be leveraged to augment IG/TCR germline databases, and inform the interpretation of AIRR-seq and immunogenomics studies across populations. Extraction of population immunogenomics information from existing genomic datasets could be an effective strategy, as well as carefully embracing non-targeted sequencing data (eg. RNA-seq) to focus on genetic diversity of samples. Ancestry-associated genetic markers in short-read genome sequencing may help overcome the difficulties of relying on sample metadata in AIRR-seq datasets. This may also be time-efficient relative to waiting for the availability of sufficiently diverse AIRR-seq datasets. Researchers have attempted to utilize paired-end RNA-seq data in the Cancer Genome Atlas (TCGA) to infer the complementarity determining region 3 (CDR3) of tumor-infiltrating T-cells 74 , and to apply a computational method to RNA-seq data in the Genotype-Tissue Expression Consortium (GTEx) to profile immunoglobulin repertoires 48 . Similar ideas could be adapted to the direct prediction of allelic variants from short-read genomic sequence data 75, 76 . However, challenges need to be overcome, including the high levels of copy number variation and segmental duplication in the BCR and TCR loci, and the need for protocols to validate novel allelic variants gleaned from short-read sequencing data 45, 77 Finally, we suggest the need for additional infrastructure and expertise in regions and countries with populations underrepresented in research, and to enhance collaborations between countries, which are critical in minimizing global health disparities. Online training sessions that are customized for conducting immunogenomics research in diverse populations would be beneficial to the biomedical community, perhaps especially in those regions. The content of these trainings might include participant recruitment strategies with a commitment to outreach and education to increase participation, sample collection methods, steps to running sequence experiments onsite or in collaboration with other academic institutions or commercial companies, uploading sequencing data to appropriate repositories, and performing bioinformatics analyses. Virtual learning platforms for bioinformaticians have been established by members in our group, and these could be leveraged to provide such trainings 79 . Our interdisciplinary group consists of leading researchers from 13 countries, including the US, Canada, Norway, France, Sweden, Russia, the UK, Israel, China, South Africa, Chile, Peru, and French Polynesia. We share concerns about the lack of diversity in immunogenomics and embrace the need for engaging our combined efforts to tackle this challenge. To spearhead the enterprise of fostering diversity in the field, we have formed this task force with the aim of developing a global consortium on diversity in immunogenomics. This consortium will seek to promote inclusive, international, and interdisciplinary research, supported by transparent and open-source learning materials and datasets, with the goal of enhancing representation of diverse populations in immunogenomics. The Missing Diversity in Human Genetic Studies Don't ignore genetic data from minority populations The road ahead in genetics and genomics A scientometric review of genome-wide association studies Genetic variants associated with warfarin dose in African-American individuals: a genome-wide association study Novel CYP2C9 and VKORC1 gene variants associated with warfarin dosage variability in the South African black population Assembly of a pan-genome from deep sequencing of 910 humans of African descent The African Genome Variation Project shapes medical genetics in Africa Whole-genome sequencing for an enhanced understanding of genetic variation among South Africans Whole-Exome Sequencing Reveals Uncaptured Variation and Distinct Ancestry in the Southern African Population of Botswana Multi-ancestry genome-wide gene-smoking interaction study of 387,272 individuals identifies new loci associated with serum lipids Research capacity. Enabling the genomic revolution in Africa The GenomeAsia 100K Project enables genetic discoveries across Asia H3Africa: current perspectives The clinical imperative for inclusivity: Race, ethnicity, and ancestry (REA) in genomics Genome-wide expression for diagnosis of pulmonary tuberculosis: a multicohort analysis Assessment of Validity of a Blood-Based 3-Gene Signature Score for Progression and Diagnosis of Tuberculosis, Disease Severity, and Treatment Response Concise whole blood transcriptional signatures for incipient tuberculosis: a systematic review and patient-level pooled meta-analysis Diagnostic accuracy study of a novel blood-based assay for identification of TB in people living with HIV The science and medicine of human immunology Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing High-throughput sequencing of the zebrafish antibody repertoire Individual variation in the germline Ig gene repertoire inferred from variable region gene rearrangements PD-1 blockade induces responses by inhibiting adaptive immune resistance Linking T-cell receptor sequence to functional phenotype at the single-cell level Commonality despite exceptional diversity in the baseline human antibody repertoire Augmenting adaptive immunity: progress and challenges in the quantitative engineering and analysis of adaptive immune receptor repertoires IGHV1-69 polymorphism modulates anti-influenza antibody repertoires, correlates with IGHV utilization shifts and varies by ethnicity Multi-Donor Longitudinal Antibody Repertoire Sequencing Reveals the Existence of Public Antibody Clonotypes in HIV-1 Infection Aberrant B cell repertoire selection associated with HIV neutralizing antibody breadth Exploiting B Cell Receptor Analyses to Inform on HIV-1 Vaccination Strategies. Vaccines (Basel) 8 Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire B cells, plasma cells and antibody repertoires in the tumour microenvironment High-Throughput Immunogenetics Reveals a Lack of Physiological T Cell Clusters in Patients With Autoimmune Cytopenias Analysis of the B cell receptor repertoire in six immune-mediated diseases T cell receptor β repertoires as novel diagnostic markers for systemic lupus erythematosus and rheumatoid arthritis Clonally expanded CD8 T cells patrol the cerebrospinal fluid in Alzheimer's disease Identification of Subject-Specific Immunoglobulin Alleles From Expressed Repertoire Sequencing Data Production of individualized V gene databases reveals high levels of immunoglobulin genetic diversity IMPre: An Accurate and Efficient Software for Prediction of T-and B-Cell Receptor Germline Genes and Alleles from Rearranged Repertoire Data De novo Inference of Diversity Genes and Analysis of Non-canonical V(DD)J Recombination in Immunoglobulins Automated analysis of immunosequencing datasets reveals novel immunoglobulin D genes across diverse species A Novel Framework for Characterizing Genomic Haplotype Diversity in the Human Immunoglobulin Heavy Chain Locus Ultrasensitive detection of TCR hypervariable-region sequences in solid-tissue RNA-seq data Assembly-based inference of B-cell receptor repertoires from short read RNA sequencing data with V Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation Inferred Allelic Variants of Immunoglobulin Receptor Genes: A System for Their Evaluation, Documentation, and Naming Variability in TRBV haplotype frequency and composition in Caucasian, African American, Western African and Chinese populations Sequence diversity, natural selection and linkage disequilibrium in the human T cell receptor alpha/delta locus Sequence variation in the human T-cell receptor loci Ability To Develop Broadly Neutralizing HIV-1 Antibodies Is Not Restricted by the Germline Ig Gene Repertoire Genomic screening by 454 pyrosequencing identifies a new human IGHV gene and sixteen other new IGHV allelic variants Novel substitution polymorphisms of human immunoglobulin VH genes in Mexicans The immunoglobulin heavy chain locus: genetic variation, missing data, and implications for human disease The Individual and Population Genetics of Antibody Immunity Exploring the pre-immune landscape of antigen-specific T cells The distribution of HLA DQ2 and DQ8 haplotypes and their association with health indicators in a general Danish population KIR haplotypes are associated with late-onset type 1 diabetes in European-American families IMGT®, the international ImMunoGeneTics information system® 25 years on Immune diversity sheds light on missing variation in worldwide genetic diversity panels OGRDB: a reference database of inferred immune receptor genes VDJbase: an adaptive immune receptor genotype and haplotype database Williams-Beuren syndrome in diverse populations Whole-Genome Sequencing Reveals Elevated Tumor Mutational Burden and Initiating Driver Mutations in African Men with Treatment-Naïve, High-Risk Prostate Cancer & The COVID-19 Host Genetics Initiative. The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic Genomewide Association Study of Severe Covid-19 with Respiratory Failure Identification of a CD4-Binding-Site Antibody to HIV that Evolved Near-Pan Neutralization Breadth Recurrent Potent Human Neutralizing Antibodies to Zika Virus in Brazil and Mexico Structural basis of a shared antibody response to SARS-CoV-2 Landscape of tumor-infiltrating T cell repertoire of human cancers Worldwide genetic variation of the IGHV and TRBV immune receptor gene families in humans Correction: A Database of Human Immune Receptor Alleles Recovered from Population Sequencing Data Comment on 'A Database of Human Immune Receptor Alleles Recovered from Population Sequencing Data Living in an adaptive world: Genomic dissection of the genus Homo and its immune response How bioinformatics and open data can boost basic science in countries and universities with limited resources Diseases of the National Institutes of Health under Award Number U01AI136677.We thank Dr. Nicky Mulder for the valuable comments that greatly improved the manuscript.