key: cord-0848812-9dm80o0p authors: Scott, Jamie K.; Breden, Felix title: Ms for Current Opinion in Systems Biology Special Issue on Systems Immunology The AIRR Community as a model for FAIR stewardship of big immunology data date: 2020-10-10 journal: Curr Opin Syst Biol DOI: 10.1016/j.coisb.2020.10.001 sha: 4656a6bfe4d750ff403d3f76e64bf0e990375ada doc_id: 848812 cord_uid: 9dm80o0p Systems biology involves network-oriented, computational approaches to modeling biological systems through analysis of big biological data. To contribute maximally to scientific progress, big biological data should be FAIR: findable, accessible, interoperable and reusable Here, we describe high-throughput sequencing data that characterize the vast diversity of B- and T-cell clones comprising the adaptive immune system (AIRR-seq data) and its contribution to our understanding of COVID-19. We describe the accomplishments of the “adaptive immune receptor repertoire” (AIRR) Community, a grass-roots network of interdisciplinary laboratory scientists, bioinformaticists, and policy wonks, in creating and publishing standards, software and repositories for AIRR-seq data based on the FAIR principles. High-throughput sequencing of the diverse receptors of adaptive immunity: B-cell receptors 47 (BcRs) of B lymphocytes and the antibodies (Abs) secreted by their plasma-cell progeny, and 48 the T-cell receptors (TcRs) of T lymphocytes, has allowed characterization of adaptive immune 49 receptor repertoires (AIRRs) at the level of single clones (see Box 1 for Glossary). AIRR-50 sequence (AIRR-seq) data portray the frequencies of B-and T-cell clones in a lymphocyte 51 population, and, when coupled with phenotypic data, can reflect the dynamics of clonal 52 differentiation during an immune response. AIRR-seq data can marry well with other types of 53 big biological data that characterize immune responses (i.e., flow cytometric/CYTOF data, 54 RNA-sequencing (RNA-seq) data, metabolomic, proteomic and microbiome data and digitized 55 microscopic data); together these data types are contributing to systems-level analyses of 56 immune responses. However, for such analyses to be practical, and for data from different 57 sources to be comparable, big data and their metadata should be standardized; the software used 58 to analyze them should complement those standards; and repositories used to store such data 59 should be federated into a virtual commons with gateway functions to allow simultaneous 60 querying of the entire commons for defined data and/or associated metadata. The universal 61 adoption of such standards would go a long way toward enabling large-scale, multidimensional 62 analyses. 63 64 The AIRR Community (AIRR-C) was established to bring these features to AIRR-seq data [2] . 65 The AIRR-C comprises a network of volunteer bioinformaticists, laboratory scientists, and 66 policy experts from academia and industry who support open science [3] insofar as possible, and 67 are working together to establish standards for reporting [4] , and storing and sharing [ J o u r n a l P r e -p r o o f domains of the heavy chains of BcRs/Abs, and the β or δ chains of TcRs), and one joining (J) 93 gene segment. Thus, each genetic locus that encodes a TcR or BcR chain contains clusters of 94 multiple, different "germline" V, (D) and J gene segments from which a single V, (D) and J gene 95 segment is used to produce one of the two chains of a heterodimeric BcR/Ab or TcR. Besides the 96 combinatorial diversity provided from recombination of the V, (D) and J gene clusters, and from 97 combining the two chains of the heterodimer, diversity is also produced at the junctions between 98 VD and DJ and VJ joins, in a complex process called "imprecise joining" that can even add 99 "non-templated" nucleotides at random to these junctions. The two junctions spanning a V-D-J 100 recombination and the single junction of a V-J recombination encode the most diverse region of 101 the variable domain in each chain of a BcR/Ab or TcR, its third complementarity determining 102 region (CDR3). The CDR3 of each chain of a BcR/Ab or TcR is also responsible for making the 103 dominant contacts with antigen; thus, these regions, which encode the greatest sequence 104 diversity, are also positioned in the receptor to make the greatest contribution to its specificity 105 and affinity for antigen. 106 107 Taken together, the potential diversity of the BcR/Ab and TcR repertoires is vast, far larger than 108 the number of T and B cells in one's body. This enormous receptor diversity allows the adaptive 109 immune system to respond with incredible specificity to a limitless variety of antigens. With the 110 right co-stimulation, foreign antigens will "select" "naïve" B-and T-cell clones from their 111 respective repertoires by virtue of binding directly to their BcRs, or as peptide fragments in 112 complex with MHC to their TcRs, respectively. These selected B-and T-cell clones then divide 113 and differentiate into effector cells (e.g., Ab-secreting plasma cells or cytotoxic, helper or 114 regulatory T cells) to carry out the effector functions of adaptive immunity. Some descendants of 115 these activated clones will also become long-lived memory B and T cells that await future 116 encounters with the same antigen; they form the basis of immunological memory. Thus, adaptive 117 immune responses are (i) initiated via selection by antigen of B-and T-cell clones in a repertoire 118 whose BcRs or TcRs bind antigen, and (ii) mediated by these cells' expansion and differentiation 119 into effector and memory cells. Note that a final level of diversification occurs during B-cell 120 responses: with T-cell help, the recombined V(D)J genes of BcRs (but not TcRs) accumulate 121 somatic mutations; antigen then selects "stronger-binding BcRs" out of the pool of mutants of a 122 selected B-cell clone in a process called "affinity maturation", which mediates the development 123 of high-affinity Abs. 124 125 With the advent of high-throughput cDNA sequencing came its modification to characterize 126 adaptive immune receptor (i.e, BcR/Ab and/or TcR) repertoires (i.e., AIRRs). AIRR-seq data can 127 be generated from high-throughput sequencing of bulk or single B cells, Ab-secreting plasma 128 cells, or T cells by: (i) amplification and bulk sequencing of the V(D)J rearrangements from a 129 single chain of a BcR/Ab or TcR from cDNA or genomic DNA prepared from whole-cell 130 populations that are often isolated based on cell phenotype (e.g., cell-surface markers and size) 131 [9,10], or (ii) amplifying and sequencing the "paired" V(D)J rearrangements from cDNAs 132 encoding both chains of a BcR/Ab or TcR from single cells [11, 12, 13] . Thus, clonal expansions 133 can be identified from AIRR-seqs appearing at high frequency, a hallmark of 134 "immunodominant" clones in an antigen-specific response, and when coupled with cell 135 phenotype data (from flow cytometry and/or RNA-sequencing) can provide insights into the 136 effector and memory functions of those clones. The lineages of somatically mutated B-cells 137 arising from a single clone can also be deduced from AIRR-seq data. And finally, the sequences 138 J o u r n a l P r e -p r o o f of an individual's "germline" V, D and J gene segments can be inferred from the AIRR-seq 139 repertoires The AIRR Community 162 The AIRR-C comprises an interdisciplinary group of several hundred bioinformaticists, 163 laboratory and computational scientists and policy experts who are dedicated to the development 164 of methods and standards for the generation, analysis and sharing of AIRR-seq data 165 following FAIR principles [6] . Its vision is to promote a community of AIRR-seq data 166 generators and users who share its core values of collaboration, inclusivity, transparency, and 167 data and materials sharing. To achieve these goals, the AIRR-C communicates the resources it 168 develops through publications, meetings and workshops, including standard formats for AIRR-169 seq data and metadata storage and analysis, computational tools, data repositories and websites. 170 171 Much of the work of the AIRR-C is performed by its Working Groups (WGs), which consist of 172 AIRR-C members and other interested persons who are organized around a common goal and 173 meet virtually (Box 2 In summary, AIRR-seq data have made significant contributions to our understanding of the 213 immune system, and they promise to lend an important dimension to high-dimensional systems 214 approaches in immunology research. In future, progress in systems immunology would be 215 enhanced greatly by the collective efforts of researchers, institutions, national research-funding 216 agencies, journals, scientific societies and industry to implement FAIR data practices, and open-217 source practices where possible, for all types of big biological data. The COVID-19 pandemic 218 has created an urgent need for critical diagnostics, vaccines and therapeutics, and the scientific 219 community has stepped up in sharing data freely to meet that need in a timely fashion. However, 220 given the current lack of standards for sharing many types of big biological data, impediments 221 exist to their general usability. Big biological data: Are digital data having features of (i) large volume, (ii) collection at high velocity via high-throughput technologies, (iii) data originating from variable sources, thus, (iv) requiring a means of verifying data quality. FAIR data principles: Making big biological data findable, accessible, interoperable and reusable. Systems immunology: Uses systems-biology approaches to model dynamic networks characterizing the immune system and its responses from multidimensional and big biological data. Antigen: A molecule or complex that generates a specific immune response by binding to an antibody, B-cell receptor or T-cell receptor. Adaptive immune responses: Immune responses that are initiated by antigen-mediated selection of B-and/or T-cell clones, and that mediate antigen-specific effector functions and memory. Immunological memory: A state of the immune system that produces a rapid response on reexposure to antigen (e.g., a virus), after it has been cleared from the organism. J o u r n a l P r e -p r o o f Barcode-enabled sequencing of plasmablast 278 antibody repertoires in rheumatoid arthritis 281 In-depth determination and analysis of the human paired heavy-and light-chain 282 antibody repertoire Tracking the immune response with single-cell 284 genomics Production of individualized V gene databases 287 reveals high levels of immunoglobulin genetic diversity Identification of subject-specific 290 immunoglobulin alleles from expressed repertoire sequencing data VDJbase: an adaptive immune receptor genotype and haplotype 294 database CoV-2 T cells with epitope-T-cell receptor recognition models The immunoglobulin heavy chain locus: genetic variation, 299 missing data, and implications for human disease Inferred allelic variants of immunoglobulin J o u r n a l P r e -p r o o f ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:J o u r n a l P r e -p r o o f