key: cord-0882961-wyg5ahtc authors: Zhang, Yanfang; Xu, Qingxian; Zeng, Huikun; Wang, Minhui; Zhang, Yanxia; Lan, Chunhong; Yang, Xiujia; Zhu, Yan; Chen, Yuan; Wang, Qilong; Tang, Haipei; Zhang, Yan; Wu, Jiaqi; Wang, Chengrui; Xie, Wenxi; Ma, Cuiyu; Guan, Junjie; Guo, Shixin; Chen, Sen; Chang, Changqing; Yang, Wei; Wei, Lai; Ren, Jian; Yu, Xueqing; Zhang, Zhenhai title: SARS-Cov-2-, HIV-1-, Ebola-neutralizing and anti-PD1 clones are predisposed date: 2020-08-18 journal: bioRxiv DOI: 10.1101/2020.08.13.249086 sha: 15d6e95127790774289f3f17c4a4363b8d9adc1f doc_id: 882961 cord_uid: wyg5ahtc Antibody repertoire refers to the totality of the superbly diversified antibodies within an individual to cope with the vast array of possible pathogens. Despite this extreme diversity, antibodies of the same clonotype, namely public clones, have been discovered among individuals. Although some public clones could be explained by antibody convergence, public clones in naïve repertoire or virus-neutralizing clones from not infected people were also discovered. All these findings indicated that public clones might not occur by random and they might exert essential functions. However, the frequencies and functions of public clones in a population have never been studied. Here, we integrated 2,449 Rep-seq datasets from 767 donors and discovered 5.07 million public clones – ~10% of the repertoire are public in population. We found 38 therapeutic clones out of 3,390 annotated public clones including anti-PD1 clones in healthy people. Moreover, we also revealed clones neutralizing SARS-CoV-2, Ebola, and HIV-1 viruses in healthy individuals. Our result demonstrated that these clones are predisposed in the human antibody repertoire and may exert critical functions during particular immunological stimuli and consequently benefit the donors. We also implemented RAPID – a Rep-seq Analysis Platform with Integrated Databases, which may serve as a useful tool for others in the field. infected people were also discovered. All these findings indicated that public clones might not occur by 48 random and they might exert essential functions. However, the frequencies and functions of public clones in a 49 population have never been studied. Here, we integrated 2,449 Rep-seq datasets from 767 donors and 50 discovered 5.07 million public clones -~10% of the repertoire are public in population. We found 38 51 therapeutic clones out of 3,390 annotated public clones including anti-PD1 clones in healthy people. Moreover, 52 we also revealed clones neutralizing SARS-CoV-2, Ebola, and HIV-1 viruses in healthy individuals. Our result 53 demonstrated that these clones are predisposed in the human antibody repertoire and may exert critical 54 functions during particular immunological stimuli and consequently benefit the donors. We also implemented 55 RAPID -a Rep-seq Analysis Platform with Integrated Databases, which may serve as a useful tool for others 56 in the field. 57 Keywords: antibody repertoire, public clone, neutralizing antibody, therapeutic antibody, analysis platform 58 59 Background 60 Antibody is a critical immunoglobulin complex consisting of two identical heavy and two identical light chains. 61 Each chain is encoded by selectively recombining one of the various germline gene fragments, namely variable 62 7 from published data repository or generated in our lab. These datasets contain samples from different genders, 114 various tissues, immune status, and age spans, and were generated via different amplification strategies. Thus, 115 it provided a rich source of reference for analyzing and comparing antibody repertoires. There are 7.12 billion 116 reads and 306 million clones yielded from a systematic analysis pipeline using exactly the same criteria, thus 117 making them comparable to each other 18 . The therapeutic monoclonal antibodies (mAbs) were downloaded 118 from the Thera-SAbDab database which contains 521 therapeutic mAbs of different types at various stages. 119 The 88,059 known antibodies were downloaded from multiple data repositories and carefully annotated via 120 natural language processing method as well as manual check (Supp. Fig. 1 and Materials and Methods). These 121 annotations included antibody sequences of different chain types as well as the antibodies that binding to and 122 neutralizing virus, associating to particular diseases, etc. The raw sequences, CDR3s, descriptions, and sample 123 metadata information were systematically extracted and stored in a rational database as well as FASTA format 124 files when necessary. A user-friendly interface for searching particular terms, antibodies, and CDR3s (Supp. 125 Fig 2) was also provided (https://rapid.zzhlab.org/). 126 The data analysis module allows users to streamline their own data through a versatile pipeline. Apart from the 127 general low-and high-level analyses, the RAPID also provides some helpful features as described below (Fig. 128 1b). 1) Customizing germline reference. 2) Customizing reference datasets; the users can freely select one or 129 more datasets in the platform as reference for cross-comparisons purpose. 3) Automatic antibody annotation; 130 the CDR3s from the input dataset will be automatically compared to the CDR3s in the data repository on 131 RAPID and annotated where applicable. 4) Downloadable figures and analysis result. Thus, any researcher can 132 upload their datasets and cross compare them to 2,449 datasets with 306 million clones, all therapeutic 133 antibodies, and 88,059 known antibodies and retrieve the relevant information. With the thorough antibody 134 collections and a versatile analysis platform, we believe the RAPID will be helpful for the large cadre of 135 scientists who demand analyzing antibody repertoire data as we demonstrated in this study. With this unprecedented dataset, we started in-depth inspection of the public clones. Here, we defined public 147 clones as antibodies with the same CDR3 amino acid sequence that present in more than one donor. Even with 148 this stringent criterion, we discovered 5,077,372 public clones. Typically, the public clones represent~1.23 to 149 100 percent of each individual repertoire with a peak value of 10.46 percent (Fig. 2a) . Moreover, 65 public 150 clones occur in more than 100 individuals with one clone shared by 196 individuals (25.55% of the total 151 donors) (Supp. Fig. 3 a and b) . Thus, population level study helped us find more public clones and highly 152 frequent ones. As 96.86% of the public clones are from PBMCs (Supp. Fig. 3c ), we compared their SHMs and 153 clone fractions of the clones in different groups. The SHMs for naïve, memory, and plasma groups were 154 comparable between public clones and total clones. The public clones of PBMCs and unknown samples 155 displayed mediocre SHMs between naïve and non-naïve clones (Fig. 2b, upper panel) . For clone fractions, the 156 public clones from PBMCs and unknown samples were lower than the other counterparts ( Fig. 2b , lower panel) 157 indicating they are inactivated. On the other hand, about half of the public clones were from IgM isotype ( Fig. 158 2c, lower panel). Therefore, it's reasonable to speculate that majority of these public clones were acquired 159 from naïve and lowly-mutated memory B cells. 160 We also observed that different V and J gene combinations can yield the same CDR3s. As expected, the 161 diversity of V gene usage for public clones increased when clones are shared by more donors (Supp. Fig. 4) . 162 However, when normalized to the maximum theoretical diversity with particular number of V genes (see 163 Materials and Methods), this diversity slightly decreased with the increment of sharing donors indicating 164 recombination preference of V genes (Fig. 2c , upper panel). Careful examinations of the V and J genes that 165 formed the same CDR3s showed that same J gene was always preferred while V gene was more replaceable 166 among individuals ( Fig. 2d and Supp. Fig. 5 ). Nevertheless, the substitution rates of V genes were not 167 completely influenced by their sequence similarity (Fig. 2d) . This result suggested that J genes might affect the 168 CDR3s more than V genes do. 169 Taken advantage of the rich antibody information integrated in this study, we tried to annotate these public 170 clones. Totally, 3,390 public clones have been annotated by three antibody databases including known 171 antibody, Thera-SAbDab, and Coronavirus-neutralizing antibody incorporated with 459 mAbs from CoV-172 AbDab 42 , 28 mAbs from Kreer et al. 8 , and 19 mAbs from Liu et al. 10 (Fig. 2e) . We found that 3,349 out 3,390 173 clones shared the same CDR3s amino acid sequences with known antibodies targeting specific antigens or 174 Also, many of the therapeutic antibody clones found in healthy people which are used for treatment of diseases 202 with top causes of death in the world (Table 1 ) prevailed in the population (Fig. 3, a Evolocumab targeting PCSK9 is used to treat Coronary disorders, Stroke, and Hypercholesterolaemia. Stroke 204 alone caused 5.78 million deaths worldwide in 2016 43 and this clone was found in 108 (14.1% of the total of 205 767) donors' repertoire. The CDR3 of anti-PD1 (Camrelizumab), the treatment to various cancers, was found 206 in 14 donors' repertoire. Ramucirumab targeting KDR and Enfortumab targeting PVRL4 were also found in 23 207 and 49 donors, respectively. According to the percent identities to therapeutic antibodies, most of the 208 antibodies from the same clonotype separated into at least two groups ( Fig. 3b and Supp. Fig. 6 ). Detailed 209 inspection revealed that multiple V genes involved in the recombination, again supported the diversity of V 210 genes within clones (Fig. 2d) . These antibodies might serve as therapeutic alternatives for the same disease. The upper heatmap stand for the composition of samples for these annotated public clones. Samples were 214 divided into 6 groups including allergy, autoimmune, cancer, pathogen, healthy, and others. Thus, we concluded that they are predisposed in a population. 256 We then set off to explore the maturation pathways of these neutralizing clones by analyzing the phylogenetic 257 trees of each clone built via DNAMLK (Fig. 4a-d) . The Ebola-neutralizing clones exhibited high maturation 258 rates with IgG. Interestingly, the maturation rates of three SARS-CoV-2 clones demonstrated various level of 259 SHMs. While the overall SHM rate for MT658807 clone is lower than 2.5%, some of the antibodies in 260 MT658819 and 1-20 clones displayed more than 5% mutations. Previous studies reported the general lower 261 SHM for SARS-CoV-2-neutralizing antibodies but still some clones with more mutations were identified and 262 To validate this similarity, we performed pair-wise structure comparison among antibodies the neutralizing 276 these three viruses. As shown in Fig. 4e , the RMSD scores of clone targeting the same antigen were much 277 lower than those targeting different antigens. Thus, the high similarity of Rep-seq retrieved antibodies to 278 neutralizing antibodies are reliable. 279 Inspired by the existence of anti-PD1 clones in mouse and rat, we scrutinized the Rep-seq datasets with four 281 different species, namely Macaca fascicularis, Macaca mulatta, Mus musculus, and Rattus norvegicus. We 282 found 4 SARS-CoV-1-neutralizing and 18 therapeutic clones in at least one species. Taken together, we 283 believe these clones are not randomly generated but purposely selected and disposed in vertebrates' repertoire. 284 Public clones are a specific fraction of antibodies among individuals that we know little about. By integrating 296 the largest antibody data to date, population level analyses discovered millions of public clones which 297 represent~10% or higher fraction of each individual's repertoire. However, compared to the superb diversity 298 of the antibody repertoire, the current dataset might still be smaller than demand. We believe that when more 299 datasets will be integrated, more public clones would be revealed. This is understandable since although the 300 somatic recombination may generate numerous antibodies, majority of them are eliminated during the negative 301 selection process in the bone marrow. Consequently, the once private repertoire might be public 45 . 302 How often can we find these public clones with critical functions in an individual? Are they predisposed in 303 everyone's repertoire? The current data seems to support that only some people possess them. However, we 304 found that sequencing depth is critical for public clone identification as many more public clones were 305 observed in datasets with very high depth. Currently, only a few hundred thousand to a few million reads were 306 captured in general. Compared to the theoretical number of B cells in the sample and the depth needed to 307 identify a clone confidently, much more sequencing reads are demanded. As most of the therapeutic mAbs 308 target proteins of conserved genes such as PDCD1, another helpful practice in finding functional public clones 309 might be comparing antibody repertoires between human and other vertebrates. 310 The finding of clones that can bind to PDCD1 or neutralize SARS-CoV-2, Ebola, and HIV-1 viruses 311 demonstrated that public clones might be important for the donor's health. Then discovering the functionalities 312 of the vast majority of other public clones would be critical for a deep understanding of the humoral immune 313 system. The major challenge in this regard is the lack of the light chain pair. The techniques of paired heavy 314 and light chain sequencing invented in Georgiou lab 46 and the single cell repertoire sequencing 47 showed great 315 potential in solving this problem. 316 We'll update RAPID along with the accumulation of Rep-seq datasets generated by others and our lab. We 317 believe more public clones will be identified and their functions will be illustrated along this path. 318 Rep-seq datasets enrollment 320 Method to enroll published and in-house Rep-seq datasets were described in Yang et al 18 , please refer to it for 321 detailed information. The re-analysis pipeline of these Rep-seq datasets was also included in that paper. 322 Five open access antibody databases, named abYsis (http://abysis.org/) 48 Supplementary Table 3 ). These keywords were selected from descriptions of 344 the antibody sequences are stored in the database of abYsis, bNAber, HIV-DB, EMBLIG, and IMGT/LIGM-345 DB. Furthermore, antibodies from 7 databases were pooled together and de-duplicated according to the 346 nucleotide sequence of variable region. In the end, disease information for antibodies from EMBLIG, ENA, 347 IMGT/LIGM-DB, and NCBI was annotated by TaggerOne 52 based on description, title, and abstract of 348 sequences. The sequences from abYsis were annotated as "NA", as no annotation information can be 349 downloaded. The sequences from HIV-DB and bNAber were annotated as HIV infections. 350 The web interface is implemented by Hyper Text Markup Language (HTML), Cascading Style Sheets (CSS), 352 and JavaScript (JS). It is a single page application based on the JS framework React.js, while using the React 353 component library Ant Design to unify the design style. The back end of the website uses Nginx as the HTTP 354 and reverse proxy server, develops business logic based on Node.js, uses MySQL to manage data, and uses 355 RabbitMQ to process the analysis task queues. Furthermore, the real-time notification of task progress depends 356 on the WebSocket technology. 357 Firstly, if regions from FR1 to FR4 were reported by MiXCR, we would simply join them together as variable 359 region. For sequences whose FR1 to FR4 regions were not completely reported by MiXCR, we extracted them 360 using our algorithm: I) Reads which can not be merged by MiXCR were discarded; II) The beginning of 361 variable region was acquired by pairwise alignment between germline reference of V genes and the column 362 named "targetSequence" reported by MiXCR(The function pairwise2.align.localms from Python Bio module 363 was used with parameters 2, -3, -5, and -2); III) If the column named "refPoints" in MiXCR recorded the 364 region of FR4, we would use it instead of aligning "targetSequence" to J gene to find the end of FR4.. Each phylogenetic tree was generated by the nucleotide sequences of variable regions for antibodies sharing 386 the same CDR3 sequence with MT658807, MT658819, 1-20, MK901823, and KU760937. In addition, the 387 germline V allele of validated neutralizing antibody which was set as the root and validated antibody were also 388 enrolled. Alignments were performed using Clustal W 2.1, and the maximum parsimony trees fitted using 389 DNAMLK by PHYLIP 3.698 54 . Lastly, these phylogenetic trees were displayed and annotated by iTOL 55 . 390 As some Rep-seq datasets were amplified by Multiplex PCR, variable regions for these sequences were not 392 complete. Thus, sequences lost several bases at the beginning of the FR1 due to the design of primer set were 393 padded by germline sequences from IMGT. Sequences for validated antibodies were downloaded from NCBI. 394 Variable regions without out-frame were used to predict their structures by Repertoire Builder 56 . Then PyMOL 395 was used to calculate RMSD to compare the similarity of antibody structures. 396 397 Diversity in the CDR3 Region of VH Is Sufficient for Most Antibody 399 Infectious disease antibodies for biomedical applications: A mini review of 401 immune antibody phage library repertoire Phage display-derived human antibodies in clinical 403 development and therapy The Generation of Antibody Diversity Maturation and Diversity of the VRC01-Antibody Lineage over 15 Years of Chronic 406 HIV-1 Infection High-Throughput Mapping of B Cell Receptor Sequences to Antigen Specificity Multi-Donor Longitudinal Antibody Repertoire Sequencing Reveals the Existence of 410 Public Antibody Clonotypes in HIV-1 Infection Longitudinal Isolation of Potent Near-Germline SARS-CoV-2-Neutralizing 412 Antibodies from COVID-19 Patients Potent Neutralizing Antibodies against SARS-CoV-2 Identified by High-Throughput Single-Cell Sequencing of Convalescent Patients' B Cells Potent neutralizing antibodies directed to multiple epitopes on SARS-CoV-2 spike Characterization of the B Cell Receptor Repertoire in the Intestinal Mucosa Tumor-Infiltrating Lymphocytes in Colorectal Adenoma and Carcinoma Aberrant B cell repertoire selection associated with HIV neutralizing antibody 420 breadth Memory B Cells that Cross-React with Group 1 and Group 2 Influenza A 422 Viruses Are Abundant in Adult Human Repertoires Vaccine-Induced Antibodies that Neutralize Group 1 and Group 2 Influenza A 424 Viruses Human Responses to Influenza Vaccination Show Seroconversion Signatures 426 and Convergent Antibody Rearrangements High frequency of shared clonotypes in human B cell receptor repertoires Commonality despite exceptional diversity in 430 the baseline human antibody repertoire Large-scale Analysis of 2,152 dataset reveals key features of B cell biology and the 432 antibody repertoire Convergent Antibody Signatures in Human Dengue Differential Expression of IgM and IgD Discriminates Two Subpopulations of 436 Human Circulating IgM+IgD+CD27+ B Cells That Differ Phenotypically, Functionally, and Genetically HIV-1 broadly neutralizing antibody precursor B cells revealed by germline-439 targeting immunogen Thera-SAbDab: the Therapeutic Structural Antibody Database Tools for fundamental analysis functions of TCR repertoires: a systematic 443 comparison IgBLAST: an immunoglobulin variable domain 445 sequence analysis tool IMonitor: A Robust Pipeline for TCR and BCR Repertoire Analysis IMSEQ -a fast and error aware approach to immunogenetic sequence 449 analysis LymAnalyzer: a tool for comprehensive analysis of next 451 generation sequencing data of T cell receptors and immunoglobulins MiXCR: software for comprehensive adaptive immunity profiling RTCR: a pipeline for complete and 455 accurate recovery of T cell repertoires from high throughput sequencing data TRIg: a robust alignment pipeline for non-regular T-cell receptor and immunoglobulin 457 sequences MiTCR: software for T-cell receptor sequencing data analysis THE IMGT 461 WEB PORTAL FOR IMMUNOGLOBULIN (IG) OR ANTIBODY AND T CELL RECEPTOR Decombinator: a tool for fast efficient gene assignment in T-cell receptor sequences using a finite state machine TCRklass: A New K-String -Based Algorithm for Human and Mouse PIRD: Pan Immune Repertoire Database ASAP -A Webserver for Immunoglobulin-Sequencing Analysis Pipeline VDJServer: A Cloud-Based Analysis Portal and Data Commons for Immune 472 BRepertoire: a user-friendly web server for analysing antibody repertoire data Antigen Receptor Galaxy: A User-Friendly, Web-Based Tool for Analysis and 476 Visualization of T and B Cell Receptor Repertoire Data Vidjil: A Web Platform for Analysis of High-Throughput Repertoire Sequencing bNAber: database of broadly neutralizing HIV antibodies CoV-AbDab: the Coronavirus Antibody World Health Organization, WHO methods and data sources for global causes of death Longitudinal Analysis of the Human B Cell Response to Ebola Virus Infection Private Antibody Repertoires Are Public A facile technology for the high-throughput sequencing of the paired VH:VL and 489 TCRbeta:TCRalpha repertoires Massively parallel single-cell B-cell receptor sequencing enables rapid 491 discovery of diverse antigen-reactive antibodies abYsis: Integrated Antibody Sequence and Structure-Management, Analysis, 493 and Prediction IMGT/LIGM-DB, the IMGT comprehensive database of immunoglobulin and T 497 cell receptor nucleotide sequences The European Nucleotide Archive in 2019 TaggerOne: joint named entity recognition and normalization with semi-500 Clustal W and Clustal X version 2.0 PHYGUI): adapting the functions of the graphical user interface for 503 the PHYLIP package Interactive Tree Of Life (iTOL) v4: recent updates and new developments Repertoire Builder: high-throughput structural modeling of B and T cell receptors This study was supported by the National Natural Science Foundation of China (NSFC) (31771479) NSFC Projects of International Cooperation and Exchanges of NSFC (61661146004), and the Local 512 Innovative and Research Teams Project of Guangdong Pearl River Talents Program (2017BT01S131) thank Jun Chen from MOE Laboratory of Biosystems Homeostasis & Protection and Innovation Center for Cell Signaling Network, College of Life Sciences, Zhejiang University for the valuable comments, discussions, 515 and suggestions The authors declared no competing financial interests. Fig. 1 Workflow of known antibody database construction. The first two boxes record the total number of sequences downloaded from 7 databases with Genbank and FASTA formats. Each procession on sequences is marked near arrow between intermediate results. Supp. Fig. 4 The number of V genes for public clones shared by different number of donors. The number of donors of public clones is discrete. Supp. Fig. 5 The substitution frequencies of J genes with the same CDR3aa among different donors. The darker the color, the higher the substitution frequency.Supp. Fig. 6 Identity of variable regions from FR1 to FR3 between therapeutic antibody and public clones. The X-axis means the divergence to germline reference and the Y-axis means the sequences identity. Different V genes are filled in different colors. Titles for subfigures separated by forward slash include inn id of therapeutic antibody, the number of samples and donors with such CDR3aa, and target of therapeutic antibody. Dots of therapeutic antibodies are larger than that of clones identified from Rep-seq datasets. Sub-figures are sorted according to the number of donors. Supp. Fig. 8 Maturation pathway of clones with the same CDR3aa of MT658807. Variable region sequences with the same CDR3aa as MT658807 were extracted and compared with MT658807 and its' germline reference. The germline reference was chosen as root of phylogenetic tree and MT658807 is marked by arrow. The cluster map contains four layers including similarity of sequences (the sequences extracted from the same donor were marked with the same color), V gene family, isotype, and somatic hypermutation rate from inner to outer.Supp. Fig. 9 Overlap of public clones shared by other species.