key: cord-0129289-cl53y3ga authors: Montague, Zachary; Lv, Huibin; Otwinowski, Jakub; DeWitt, William S.; Isacchini, Giulio; Yip, Garrick K.; Ng, Wilson W.; Tsang, Owen Tak-Yin; Yuan, Meng; Liu, Hejun; Wilson, Ian A.; Peiris, J. S. Malik; Wu, Nicholas C.; Nourmohammad, Armita; Mok, Chris Ka Pun title: Dynamics of B-cell repertoires and emergence of cross-reactive responses in COVID-19 patients with different disease severity date: 2020-07-14 journal: nan DOI: nan sha: b9658559855d3391cf186ffc64eb3afdc5000fe3 doc_id: 129289 cord_uid: cl53y3ga COVID-19 patients show varying severity of the disease ranging from asymptomatic to requiring intensive care. Although a number of monoclonal antibodies against SARS-CoV-2 have been identified, we still lack an understanding of the overall landscape of B-cell receptor (BCR) repertoires in COVID-19 patients. Here, we used high-throughput sequencing of BCR repertoires collected over multiple time points during an infection to characterize statistical and dynamical signatures of the B-cell response to SARS-CoV-2 in 19 patients with different disease severities. Based on principled statistical approaches, we determined differential sequence features of BCRs associated with different disease severity. We identified 34 significantly expanded rare clonal lineages shared among patients as candidates for a specific response to SARS-CoV-2. Moreover, we identified natural emergence of a BCR with cross-reactivity to SARS-CoV and SARS-CoV-2 in a number of patients. Overall, our results provide important insights for development of rational therapies and vaccines against COVID-19. The novel coronavirus SARS-CoV-2, which causes the severe respiratory disease COVID-19, has now spread to 216 countries and caused more than ten million infections with a mortality rate around 5% (WHO, 2020) . COVID-19 patients show varying disease severity ranging from asymptomatic to requiring intensive care. While epidemiological and clinical data report that many factors such as age, gender, genetic background, and preexisting conditions are associated with disease severity, host immunity against the virus infection is the crucial component of controlling disease progression (Ellinghaus et al., 2020; Guan et al., 2020; McKechnie and Blish, 2020; Vabret et al., 2020; Wu et al., 2020a) . Shedding light on signatures of a protective immune response against SARS-CoV-2 infections can help understand the nature of COVID-19 and guide the therapeutic developments as well as vaccine design and assessment. Adaptive immunity is considered as one of the core protective mechanisms in humans against infectious diseases. A vast diversity of surface receptors on B-and T-cells enables us to recognize and counter new or repeated invasions from a multitude of pathogens (Janeway et al., 2005; Nielsen and Boyd, 2018) . In particular, antibodies produced by B-cells can provide long-lasting protection against specific pathogens through neutralization or other antibody-mediated immune mechanisms (Janeway et al., 2005) . During the early phase of an infection, antigens of a pathogen are recognized by a group of naïve B-cells, which then undergo affinity maturation in a germinal center through somatic hypermutation and selection. The B-cell receptors (BCRs) of mature B-cells can strongly react to infecting antigens, resulting in B-cell stimulation, clonal expansion, and ultimately secretion of high affinity antibodies in the blood (Burnet, 1959; Burnet, 1960; Cyster and Allen, 2019) . The specificity of a BCR is determined by a number of features such as V-, (D-) or J-gene usage, length and sequence composition of the HCDR3 region. It has been found that SARS-CoV-2-specific IgG antibodies can be detected in plasma samples of COVID-19 patients starting from the first week after the onset of symptoms . These antibodies bind to different antigens including spike protein and nucleoprotein as well as other structural or non-structural proteins (Hachim et al., 2020) . In addition, multiple studies have isolated SARS-CoV-2-specific B-cells from COVID-19 patients and determined their germline origin (Brouwer et al., 2020; Cao et al., 2020; Chi et al., 2020; Ju et al., 2020; Robbiani et al., 2020; Rogers et al., 2020; Seydoux et al., 2020a; Seydoux et al., 2020b; Wu et al., 2020b; Zost et al., 2020) . However, we still lack a comprehensive view of the patients' entire BCR repertoires during SARS-CoV-2 infections. Antibody repertoire sequencing has advanced our understanding of the diversity of adaptive immune repertoires and their response to pathogens (Boyd et al., 2009; Georgiou et al., 2014; Kreer et al., 2020; Robins, 2013) . A few studies have performed BCR repertoire sequencing to characterize the statistical signatures of the immune response to SARS-CoV-2 (Galson et al., 2020; Nielsen et al., 2020; Schultheiß et al., 2020) . However, these studies have limited data on the dynamics of BCR repertoires, which could otherwise provide significant insight into specific responses to the infection. In this study, we have established a principled statistical approach to study the statistics and dynamics of BCR repertoires and to characterize the immune responses in 19 COVID-19 patients with different disease severities. Importantly, by combining information from the statistics of sequence features in BCR repertoires, the expanding dynamics of clonal lineages during infection, and sharing of BCRs among COVID-19 patients, we identified 34 clonal lineages that are potential candidates for a response to SARS-CoV-2. Moreover, we identified cross-reactive responses to SARS-CoV in some of the COVID-19 patients and a natural emergence of a previously isolated SARS-reactive antibody (Pinto et al., 2020) in three patients. We obtained total RNA from the PBMC isolated from 19 patients infected with SARS-CoV-2 and three healthy individuals (see Methods and Tables S1, S2 for details). The patients showed different severity of symptoms, forming three categories of infected cohorts: 2 patients with mild symptoms, 12 patients with moderate symptoms, and 5 patients with severe symptoms. Specimens from all patients, except patient 19, were collected over two or more time points during the course of the infection. The sampled time points for all patients are indicated in Fig. 1 and Tables S1. IgG heavy chains of B-cell repertoires were sequenced by next-generation sequencing, and the statistics of the collected BCR read data from each sample are shown in Table S1 . Statistical models were applied to analyze the length of the HCDR3 region, IGHV-or IGHJ-gene usage, and insertion/deletion patterns as well as the expansion of specific clonal lineages (Fig. 1 ). We first aimed to investigate whether cohorts with different disease severities can be distinguished by molecular features of their B-cell repertoires. Since sequence features of immune receptors (e.g. HCDR3 length, V-or J-gene usage, and insertion/deletion patterns) are often associated with their binding specificity, we used statistical methods to compare these features both at the level of clonal lineages, including the inferred receptor sequence of lineage progenitors (Figs. 2, S2), and unique sequences (Figs. S1, S2) in a repertoire. Lineage progenitors of IgG repertoires are closest to the ensemble of naïve receptors in the periphery. Features of lineage progenitors reflect receptor characteristics that are necessary for activating and forming a clonal lineage in response to an infection. Statistics of unique sequences in a repertoire, on the other hand, contain information about the size of the circulating lineages. Importantly, these two statistical ensembles are relatively robust to PCR amplification biases that directly impact read abundances (see Methods for error correction and processing of sequence reads). IGHV genes cover a large part of pathogen-engaging regions of BCRs, including the three complementarity-determining regions HCDR1, HCDR2, and a portion of HCDR3. Therefore, we investigated if there are any differences in V-gene usage across cohorts, which may indicate preferences relevant for response to a particular pathogen. We found that the variation in V-gene usage among individuals within each cohort was far larger than differences among cohorts ( Fig. 2A) . Data from unique sequences also indicated large background amplitudes due to vast differences in the sizes of lineages within a repertoire (Fig. S1) . Similarly, IGHJ-gene usage was also comparable across different cohorts (Fig. S2 ). Our results suggest that the SARS-CoV-2 V-gene specific responses are highly individualized at the repertoire level. HCDR3 is part of the variable chain of B-cell receptors and is often a crucial region in determining specificity. Importantly, HCDR3 is highly variable in its sequence content and length due to insertion and deletion of sequence fragments at the VD and DJ junctions of the germline receptor. Therefore, differential characteristics of the HCDR3 sequence in BCR repertoires of different cohorts can signal preferences for sequence features specific to a class of antigens. We find that HCDR3s in COVID-19 patients with moderate and severe symptoms are significantly longer than in the healthy control, both at the level of clonal lineages (Fig. 2B ) and unique sequences in each repertoire (Fig. S1B) . The difference between HCDR3s length in healthy individuals and patients with mild symptoms is insignificant (p-values for significance of these deviations are given in the corresponding Figure captions ). Recent studies have also identified propensity for longer HCDR3s in BCR repertoires of COVID-19 patients (Galson et al., 2020; Nielsen et al., 2020; Schultheiß et al., 2020) . We find that the longer HCDR3 length in patients with moderate and severe symptoms is primarily due to insertion of larger sequence fragments and also shorter deletions at the VD junction (Figs. 2C-D, S1C-D). Interestingly, the insertion/deletion patterns at the DJ junctions did not show a substantial difference across cohorts ( Fig. S2C -F). Longer sequence insertions in HCDR3 can introduce more sequence diversity at the repertoire level. Quantifying sequence diversity of a B-cell repertoire, however, is very sensitive to the sampling depth in each individual. Despite progress in the quality of high-throughput repertoire sequencing (RepSeq) techniques, sequenced BCRs still present a highly under-sampled view of the entire repertoire. To characterize the diversity of repertoires and the statistics of sequence features that make up this diversity, we inferred principled models of repertoire generation and selection for the entry of receptors into the periphery (Methods) (Elhanati et al., 2014; Marcou et al., 2018) . To do so, we first used data from unproductive B-cell receptors to infer the highly non-uniform baseline model that characterizes the probability gen (σ) to generate a given receptor sequence (Elhanati et al., 2014; Marcou et al., 2018) (Fig. 1 and Methods). This baseline model reflects sequence biases in generating BCRs in the bone marrow. The functional, yet pathogen-naïve BCRs that enter the periphery experience selection through processes known as central tolerance (Janeway et al., 2005) . In addition, the inferred progenitors of clonal lineages in the IgG repertoire have undergone antigen-dependent selection that led to expansion of their clonal lineage in response to an infection. These two levels of selection make sequence features of functional lineage progenitors distinct from the pool of unproductive BCRs that reflects biases of the generation process prior to any selection. To identify these distinguishing sequence features, we inferred a selection model for lineage progenitors (Methods). We characterized the probability to observe a clonal lineage ancestor in the periphery as post (σ) ∼ gen (σ) ' (:features * ( (+) , which deviates from the inferred generation probability of the receptor gen (σ) by selection factors . ( ) (Isacchini et al., 2020a; Isacchini et al., 2020b; Sethna et al., 2020) . These selection factors . ( ) depend on sequence features, including IGHV-gene and IGHJ-gene usages, HCDR3 length, and amino acid preferences at different positions in the HCDR3 (Methods) (Elhanati et al., 2014; Isacchini et al., 2020a; Isacchini et al., 2020b; Marcou et al., 2018; Sethna et al., 2020) . The distribution of the log-probability log 34 post (σ) for the inferred progenitors of clonal lineages observed in individuals from different cohorts is shown in Fig. 3A . We find an overabundance of BCR lineages with progenitors that have a low probability of entering the periphery (i.e., a lower post (σ)) in COVID-19 patients compared to healthy individuals (Fig. 3A) . Moreover, we estimated the diversity of the repertoires in each cohort by evaluating the entropy of receptor sequences generated by the respective repertoire models (see Methods). In particular, diverse repertoires that contain B-cell lineages with rare receptors (i.e., those with a lower post (σ)), should have larger entropies. Based on this analysis, we find that immune repertoires are more diverse in COVID-19 patients compared to healthy individuals ( Fig. 3A and Methods). Specifically, the entropy (i.e., diversity) of BCR repertoires grew with severity of the disease, from 38.7 bits in the healthy cohort to 40.5 bits in the mild cohort, to 41.6 bits in the moderate cohort, and to 41.3 bits in the severe cohort (see Methods). Selection factors . ( ) determine the deviation in preferences for different sequence features of BCRs in each cohort, including their HCDR3 length and composition, and IGHV-gene usages. A comparison of selection factors among cohorts can characterize their distinctive sequence features. Consistent with our previous analysis of lineage characteristics in Fig. 2A , we find that selection factors associated with V-gene usage are strongly correlated across cohorts (Fig. 3B ). However, we find a weaker correlation for selection factors associated with preference for HCDR3 length between healthy individuals and the cohorts of COVID-19 patients (Fig. 3B) , which is consistent with our observation in Fig. 2B . Moreover, there are only weak correlations for HCDR3 amino acid preferences between healthy individuals and the cohorts of COVID-19 patients (Fig. 3B) . Interestingly, the HCDR3 composition was most distinct between patients with moderate symptoms and the healthy cohort, with 73% correlation for length and 65% correlation for amino acid preferences (Fig. 3B ). The HCDR3 composition was most similar between patients with moderate and severe symptoms, with 92% correlation for length and 89% correlation for amino acid preferences (Fig. 3B ). Taken together, HCDR3 represents a molecular feature that is distinguishable among cohorts. Expansion of BCR clonal lineages over time as a proxy for response to SARS-CoV-2. Next, we examined the dynamics of BCR repertoires in the COVID-19 patients. The binding level (measured by OD450 in ELISA assays) of both IgM and IgG antibodies against the receptor-binding domain (RBD) or N-terminal domain (NTD) of SARS-CoV-2 increased in most of the COVID-19 patients in our study over the course of their infection (Figs. 4A, S3). We expected that the increase of OD450 binding level is associated with activation of specific B-cells, resulting in an increase in mRNA production of the corresponding BCRs. Detecting expansion of specific clonal lineages is challenging due to subsampling of the repertoires. In fact, only a limited overlap of BCR lineages was found if we simply compared the data between different time points or between replicates of a repertoire sampled at the same time point (Fig. S4 ). To identify expanding clonal lineages, we examined lineages only in patients whose plasma showed an increase in binding level (OD450) to the RBD of SARS-CoV-2 and compared the sequence abundance of those lineages that appear in two or more time points (Figs. 4A, S3 and Methods). Using a hypothesis test with a false discovery rate of 7.5%, as determined by analyzing replicate data (Methods, Fig. S5 ), we detected significant expansion of clonal lineages within all investigated patients. The results reflect a dynamic repertoire in all patients, ranging from 5% to 15% of lineages with significant expansion and large changes in sequence abundances over time (Figs. 4, S6) . The expanding lineages had comparable HCDR3 length to the rest of the repertoire in COVID-19 patients (Fig. S6 ). Moreover, we observed expanding lineages to show V-gene preferences comparable to those of previously identified antibodies against SARS-CoV-2 (RBD). This includes the abundance of IGHV4-59, IGHV4-39, IGHV3-23, IGHV3-53, IGH3-66, IGHV2-5, and IGHV2-70 (Brouwer et al., 2020; Ju et al., 2020; Pinto et al., 2020; Rogers et al., 2020) . However, it should be noted that these preferences in V-gene usage among expanding lineages are comparable to the overall biases in V-gene usage within patients, and expanded lineages roughly make up 25% of lineages with a given V gene (Fig. 4C ). Therefore, Our results suggest that the overall response to SARS-CoV-2 is not driven by only a specific class of IGHV gene. Despite the vast diversity of BCRs, we observe a significant number of identical progenitors of BCR clonal lineages among COVID-19 patients (Fig. 5 ) and among healthy individuals (Fig. S7) . Previous work has also identified sharing of BCRs among COVID-19 patients, which was interpreted by the authors as evidence for large-scale convergence of immune responses (Galson et al., 2020; Nielsen et al., 2020; Schultheiß et al., 2020) . Although BCR sharing can be due to convergent response to common antigens, it can also arise from convergent recombination leading to the same receptor sequence (Elhanati et al., 2018; Pogorelyy et al., 2018a) or simply from experimental biases. Therefore, it is imperative to formulate a null statistical model to identify the outliers among shared BCRs as candidates for common responses to antigens. Convergent recombination defines a null expectation for the amount of sharing within a cohort based on only the underlying biases for receptor generation within a repertoire (Elhanati et al., 2018; Pogorelyy et al., 2018a) (Methods) . Intuitively, sharing is more likely among commonly generated receptors (i.e., with a high post (σ)) and within cohorts with larger sampling (Methods). Importantly, rare receptors (i.e., with a low post (σ)) that are shared among individuals in a common disease group can signal commonality in function and a response to a common antigen, as previously observed for TCRs in response to a yellow fever vaccine (Pogorelyy et al., 2018b) , CMV and diabetes (Pogorelyy et al., 2018a) . We used the receptors' probabilities post (σ) to assess the significance of sharing by identifying a probabilistic threshold to limit the shared outliers both among the COVID-19 patients (dashed line in Fig. 5 ) and the healthy individuals (dashed line in Fig. S7 ). Out of a total of 33,660 (unique) progenitors of clonal lineages (Fig. 5A) , we found 5,112 progenitors to be shared among at least two individuals; 152 of the 5,112 were classified as rare, having a probability of occurrence below the indicated threshold (dashed line) in Fig. 5B . Moreover, we found that 439 lineages shared a common sequence ancestor in at least two individuals and have expanded in at least one of the individuals ( Fig. 5C -D). 34 of these shared, expanding lineages stemmed from rare naïve progenitors (below the dashed line in Fig. 5B , D). The sharing of these rare expanding BCRs among COVID-19 patients indicates a potentially convergent response to SARS-CoV-2 (Table S3) . Interestingly, we found that 20% of receptors in these rare shared, expanding lineages contain multiple cysteines in their HCDR3s, in contrast to only 10% of the receptors in the whole repertoire. Such sequence patterns with cysteine pairs in the HCDR3 have been associated with stabilization of the HCDR3 loop by forming disulfide bonds with particular patterns and spacings of the cysteines (Lee et al., 2014; Prabakaran and Chowdhury, 2020) . Disulfide bonds in the HCDR3 can decrease the conformational flexibility of the loop, thus decreasing the entropic cost of binding to improve the affinity of the receptor (Almagro et al., 2012) . The significantly larger fraction of multi-cysteine HCDR3s among the candidate SARS-CoV-2 responsive receptors (p-value = 10 -9 based on binomial sampling) indicates an underlying molecular mechanism for developing a potent response to SARS-CoV-2. Previous studies have identified a number of antibodies responsive to the RBD of SARS-CoV-2 (Cao et al., 2020; Pinto et al., 2020; Robbiani et al., 2020) . In repertoires of the COVID-19 patients, we found that several HCDR3s matched with SARS-CoV-2-specific monoclonal antibodies (mAbs) that were previously isolated in other studies (Table S3) . Specifically, three distinct SARS-CoV-2-specific mAbs were found to be close in sequence to HCDR3s in our data, with only one amino acid difference, and one of these mAbs also had a matching V gene (Table 1) . Among these, we found a class of antibodies containing multiple cysteines in their HCDR3s, with the potential to form disulfide bonds to enhance their affinity to antigens and paratope diversity (Almagro et al., 2012; Lee et al., 2014; Prabakaran and Chowdhury, 2020) . In addition, we found that two patients had exact HCDR3 matches to a previously identified antibody (S304) that has cross-reactivity to SARS-CoV and SARS-CoV-2 (Pinto et al., 2020) . We also observed in one patient a HCDR3 with only one amino acid difference to this antibody (Table 1) . Importantly, the plasma in these patients showed a substantial binding level (OD450) to SARS-CoV (Fig. S3 ), which indicates a possibility of cross-reactive antibody responses to SARS-CoV and SARS-CoV-2. It should be noted that all of the previously verified antibodies that we identify in the patients' repertoires have a relatively high probability post (σ) of generation and entry to the periphery (indicated by yellow diamonds in Fig. 5B ). This is not surprising as it is very unlikely to observe rare BCRs (with small post (σ)) to be shared in multiple individuals. Overall, our results are encouraging for vaccine development since they indicate that even common antibodies can confer specific responses against SARS-CoV-2. A repertoire of immune receptor sequences presents a unique snapshot of the history of immune responses in an individual (Boyd et al., 2009; Georgiou et al., 2014; Kreer et al., 2020; Robins, 2013) , and the changes in a repertoire during an infection can signal specific responses to pathogens (Horns et al., 2019; Nourmohammad et al., 2019) . Identifying signatures of a functional response to a given pathogen from the pool of mostly unspecific BCRs collected from the blood is challenging-it is a problem of finding a needle in a haystack. Therefore, principled statistical inference approaches are necessary to extract functional signal from such data. Here, we characterized the B-cell repertoire response to SARS-CoV-2 in COVID-19 patients with different disease severity by combining evidence from the overall statistics of repertoires together with dynamics of clonal lineages during infection and the sharing of immune receptors among patients. At the repertoire level, we showed that HCDR3 of BCRs in COVID-19 patients are significantly longer than HDCR3 in healthy individuals, and the amino acid composition of this receptor region varies among cohorts of patients with mild, moderate, and severe symptoms. Moreover, we observed largescale sharing of B-cell receptors among COVID-19 patients, consistent with previous findings in COVID-19 patients (Galson et al., 2020; Nielsen et al., 2020; Schultheiß et al., 2020) . Sharing of receptors among individuals can signal common immune responses to a pathogen. However, BCR sharing can also be due to convergent recombination leading to the same receptor sequence or other experimental biases that influence statistics of shared sequences. These statistical nuances can substantially sway conclusions drawn from the sharing analysis and, therefore, should be carefully accounted for. Here, we established a null expectation of BCR sharing due to convergent recombination by inferring a model of receptor generation and migration to the periphery and used this null model to identify sequence outliers. Our analysis identified a subset of rare BCRs shared among COVID-19 patients, which appears to signal convergent responses to SARS-CoV-2. In addition to the statistics of repertoires, we observed that the activity of many B-cell lineages (i.e., mRNA production) in COVID-19 patients significantly increases during infection, accompanied by an increase in the binding level (OD450) of the patients' sera to the RBD and NTD of SARS-CoV-2. Dynamics of clonal lineages during an infection provide significant insights into the characteristics of responsive antibodies (Horns et al., 2019; Nourmohammad et al., 2019) . By taking advantage of data collected at multiple time points in most patients, we identified expanded lineages shared among patients and found 34 clonal lineages that are candidates for a specific response specific to SARS-CoV-2 antigens. Further analysis would be required to determine the reactivity of these antibodies against the RBD or NTD of SARS-CoV-2. Identifying antibodies with cross-reactive neutralization abilities against viruses in the SARS family is of significant interest. Interestingly, in nine patients, we see a substantial increase in the binding level (OD450) of their sera to SARS-CoV epitopes during the course of COVID-19 infection. Moreover, in three patients, we identify a BCR identical to the heavy chain of antibody S304 (Pinto et al., 2020) , which was previously isolated from a patient who recovered from a SARS-CoV infection. This antibody was shown to be moderately cross-reactive to both SARS-CoV and SARS-CoV-2, and our results further indicate a possibility for such cross-reactive antibodies to emerge naturally in response to SARS-CoV-2 (Brouwer et al., 2020; Lv et al., 2020; Rogers et al., 2020) . Taken together, our findings provide substantial insight and strong implications for devising vaccines and therapies with a broad applicability against SARS-CoV-2. Sample collection and PBMC isolation. Specimens of heparinized blood were collected from the RT-PCR-confirmed COVID-19 patients at the Infectious Disease Centre of the Princess Margaret Hospital, Hong Kong. The study was approved by the institutional review board of the Hong Kong West Cluster of the Hospital Authority of Hong Kong (approval number: UW20-169). All study procedures were performed after informed consent was obtained. Day 1 of clinical onset was defined as the first day of the appearance of clinical symptoms. The severity of the COVID-19 cases was classified based on the adaptation of the Sixth Revised Trial Version of the Novel Coronavirus Pneumonia Diagnosis and Treatment Guidance. The severity of the patients was categorized as follows: Mild -no sign of pneumonia on imaging, mild clinical symptoms; Moderate -fever, respiratory symptoms and radiological evidence of pneumonia; Severe -dyspnea, respiratory frequency >30/min, blood oxygen saturation 93%, partial pressure of arterial oxygen to fraction of inspired oxygen ratio <300, and/or lung infiltrates >50% within 24 to 48 hours; Critical -respiratory failure, septic shock, and/or multiple organ dysfunction or failure or death. The blood samples were first centrifuged at 3000 xg for 10 minutes at room temperature for plasma collection. The remaining blood was diluted with equal volume of PBS buffer, transferred onto the Ficoll-Paque Plus medium (GE Healthcare), and centrifuged at 400 xg for 20 minutes. Peripheral Blood Mononuclear Cells (PBMC) samples were then collected and washed with cold RPMI-1640 medium for three times. The isolated PBMC samples were finally stored at cell freezing solution (10% DMSO + 90% FBS) and kept in -80 o C until used. RNA extraction and reverse transcription. Total RNA was extracted from 5 × 10 9 PBMC using the RNeasy Mini isolation kit (Qiagen) according to the manufacturer's protocol. Reverse transcription of the RNA samples was performed using the Proto-Script® II Reverse Transcriptase kit (New England Biolabs, NEB) with random hexamer primers according to the manufacturer's protocol. The thermal cycling conditions were designed as follows: 25 o C for 5 minutes, 42 o C for 60 minutes, and 80 o C for 5 minutes. The resulting cDNA samples were stored in 80 o C freezer before PCR was performed. Amplification of B cell repertoire from the samples by PCR. The cDNA samples were used as a template to amplify the antibody IgG heavy chain gene with six FR1-specific forward primers and one constant region-specific reversed primer using the Phusion® High-Fidelity DNA Polymerase. The primer sequences were the same as previously described (Wu et al., 2015) ; primer sequences are listed in Table S2 . The thermal cycling conditions were set as follows: 98 o C for 30 seconds; 30 cycles of 98 o C for 10 seconds, 58 o C for 15 seconds, and 72 o C for 30 seconds; and 72 o C for 10 minutes. Then 10 ng of the PCR product was used as a template for the next round of gene amplification with samplespecific barcode primers. The thermal cycling conditions were set as follow: 98 o C for 3 min; 30 cycles of 98 o C for 10 seconds, 58 o C for 15 seconds, and 72 o C for 15 seconds; and a final extension at 72 o C for 10 min using Phusion® High-Fidelity DNA Polymerase. The PCR product was purified by QIAquick Gel Extraction Kit (Qiagen), and quantified by NanoDrop Spectrophotometers (Thermofisher). Protein expression and purification. The receptor-binding domain (RBD, residues 319-541) and Nterminal domain (NTD, residues 14 to 305) of the SARS-CoV-2 spike protein (GenBank: QHD43416.1) as well as the RBD (residues 306-527) and NTD (residues 14-292) of SARS-CoV spike protein (GenBank: ABF65836.1) were cloned into a customized pFastBac vector (Lv et al., 2020; Wec et al., 2020) . The RBD and NTD constructs were fused with an N-terminal gp67 signal peptide and a C-terminal His6 tag. Recombinant bacmid DNA was generated using the Bac-to-Bac system (Life Technologies, Thermo Fisher Scientific). Baculovirus was generated by transfecting purified bacmid DNA into Sf9 cells using FuGENE HD (Promega, Madison, US) and subsequently used to infect suspension cultures of High Five cells (Life Technologies) at a multiplicity of infection (moi) of 5 to 10. Infected High Five cells were incubated at 28 °C with shaking at 110 rpm for 72 h for protein expression. The supernatant was then concentrated using a Centramate cassette (10 kDa molecular weight cutoff for RBD, Pall Corporation, New York, USA). RBD and NTD proteins were purified by Ni-NTA Superflow (Qiagen, Hilden, Germany), followed by size exclusion chromatography and buffer exchange to phosphate-buffered saline (PBS). A 96-well enzyme-linked immunosorbent assay (ELISA) plate (Nunc MaxiSorp, Thermo Fisher Scientific) was first coated overnight with 100 ng per well of purified recombinant protein in PBS buffer. The plates were then blocked with 100 µl of Chonblock blocking/sample dilution ELISA buffer (Chondrex Inc, Redmon, US) and incubated at room temperature for 1 h. Each human plasma sample was diluted to 1:100 in Chonblock blocking/sample dilution ELISA buffer. Each sample was then added into the ELISA plates for a two-hour incubation at 37°C. After extensive washing with PBS containing 0.1% Tween 20, each well in the plate was further incubated with the anti-human IgG secondary antibody (1:5000, Thermo Fisher Scientific) for 1 hour at 37°C. The ELISA plates were then washed five times with PBS containing 0.1% Tween 20. Subsequently, 100 µL of HRP substrate (Ncm TMB One; New Cell and Molecular Biotech Co. Ltd, Suzhou, China) was added into each well. After 15 min of incubation, the reaction was stopped by adding 50 µL of 2 M H2SO4 solution and analyzed on a Sunrise (Tecan, Männedorf, Switzerland) absorbance microplate reader at 450 nm wavelength. BCR processing and computational procedures BCR preprocessing. For initial processing of the raw reads, we used pRESTO (version 0.5.13) (Vander Heiden et al., 2014) to assemble paired-end reads, remove sequences with a mean quality score less than 30, mask primer subsequences, and collapse duplicate sequences into unique sequences. The small fraction of paired-end reads that overlapped were assumed to be anomalous and were discarded from the analysis. Additionally, after preprocessing with pRESTO, we discarded unique reads that contained ambiguous calls (N's) in their receptor sequence. We performed two rounds of error correction on sequences that passed the quality control check. In the first round, we clustered singletons and other low-frequency sequences into larger sequences if they were similar in sequence. The intent of this round was to correct for sequencing errors (e.g. from reverse transcription of mRNA to cDNA) that caused large abundance clones to be split into many similar sequences. We used two parameters: Δ ; = 1.0, the marginal Hamming distance tolerance per decade in log-ratio abundance (each log10 unit allowing Δ ; additional sequence differences), and Δ > = 1.0, the marginal abundance tolerance of clusterable sequences per decade in log-ratio abundance (each log10 unit allowing abundance Δ > higher as clusterable). For example, a sequence with abundance 3 and a Hamming distance away from a higher abundance sequence with abundance A was absorbed into the latter only if ≤ Δ ; log 34 > C > D and 3 ≤ Δ > log 34 We used the output of this first round as input for the second round of error correction, in which we more aggressively target correction of reverse transcriptase errors. In the second round, we used two different parameters to assess sequence similarity: thresh = 2.0, the Hamming distance between sequences, and thresh = 1.0 , the ratio of sequence abundances. A sequence with abundance 3 and a Hamming distance away from a sequence of larger abundance A was absorbed into the latter only if ≤ thresh and the ratio of the sequence abundances was greater than thresh , i.e. > C > D ≥ thresh . This round of error correction allows much larger abundance sequences to potentially be clustered than is possible in the first round. For both of the above steps, we performed clustering greedily and approximately by operating on sequences sorted by descending abundance, assigning the counts of the lower abundance sequence to the higher abundance one iteratively. After error correction, the sequences still contained a large number of singletons, i.e. sequences with no duplicates (Table S1 ). We discarded these singletons from all analyses that relied on statistics of unique sequences (i.e., the results presented in Figs. S1, S2C,E,F). For each individual, error-corrected sequences from all timepoints and replicates were pooled and annotated by abstar (version 0.3.5) (Briney and Burton, 2018) . We processed the output of abstar, which included the estimated V gene/allele, J gene/allele, location of the HCDR3 region, and an inferred naïve sequence (germline before hypermutation). Sequences which had indels outside of the HCDR3 were discarded. We partitioned the sequences into two sets: productive BCRs, which were in-frame and had no stop codons, and unproductive BCRs, which were out-of-frame. Unproductive BCRs. Due to a larger sequencing depth in healthy individuals, we were able to reconstruct relatively large unproductive BCR lineages. Unproductive sequences are BCRs that are generated but, due to a frameshift or insertion of stop codons, are never expressed. These BCRs reside with productive (functional) BCRs in a nucleus and undergo hypermutation during B-cell replication and, therefore, provide a suitable null expectation for generation of BCRs in immune repertoires. To identify BCR clonal lineages, we first grouped sequences by their assigned V gene, J gene, and HCDR3 length and then used single-linkage clustering with a threshold of 85% Hamming distance. A similar threshold has been suggested previously by (Gupta et al., 2017) to identify BCR lineages. Clusters of size smaller than three were discarded from our analysis, size being defined as the sum of the amount of unique sequences per time point within a lineage. For each cluster, there may have been multiple inferred naïve sequences, as this was an uncertain estimate. Therefore, the most common naïve sequence was chosen to be the naïve progenitor of the lineage. When the most common naïve sequence of a productive lineage contained a stop codon, the progenitor of the lineage was chosen iteratively by examining the next most common naïve sequence until it did not contain any stop codons. If all inferred naïve sequences in a productive lineage had a stop codon, that lineage was discarded from the analysis. Table S1 shows the statistics of constructed clonal lineages in each individual. We used IGoR (version 1.4) (Marcou et al., 2018 ) to obtain a model of receptor generation. This model characterized the probability of generation gen (σ) of a receptor dependent on the features of the receptor, including the V, D, and J genes and the deletion and insertion profiles at the VD and DJ junctions. To characterize the parameters of this model, we trained IGoR on the progenitors of all unproductive lineages, regardless of size, pooled from all cohorts. For consistency with our receptor annotations based on abstar, we used abstar's genomic templates and the HCDR3 anchors of abstar's reference genome as inputs for IGoR's genomic templates and HCDR3 anchors. We used SONIA (version 0.27) (Sethna et al., 2020) to infer a selection model for progenitors of productive clonal lineages. The SONIA model evaluated selection factors to characterize the deviation in the probability post (σ) to observe a functional sequence in the periphery from the null expectation, based on the generation probability gen (σ): post (σ) = 3 G gen (σ) ' (:features * ( (+) , where is the normalization factor and . (σ) are selection factors dependent on the sequence features . These sequence features include V-gene and J-gene usages and HCDR3 length and amino acid composition (Sethna et al., 2020) . In our analysis, we used the SONIA left-right model with independent V-and J-gene usages (Sethna et al., 2020) . We used the output from IGoR as the receptor generation model for SONIA. We trained four cohort-specific SONIA models on progenitors of all productive lineages, regardless of size, pooled from all individuals within a cohort. 150 epochs and 500,000 generated sequences were used to train each SONIA model. Fig. 3 shows the distributions for the probabilities of observing productive receptors sampled from each cohort post (σ) and the correlation of feature-specific selection factors . among cohorts. Characterizing repertoire diversity. We quantified the diversity of each cohort by evaluating the entropy of receptor sequences in each cohort. Estimates of entropy can be influenced by how each distribution is sampled. To avoid such sampling biases, we used the inferred IGoR and SONIA models to generate 500,000 synthetics receptors based on each of the cohort-specific models. We evaluated cohort entropies as the expected log-probabilities to observe a functional sequence in the respective cohort: = − ∑ post (σ) log post (σ) + . For comparison, we also evaluated the entropy estimated on the repertoire data in each cohort, which showed a similar pattern to the estimates from the generated cohorts (in the main text). Specifically, the entropy of BCR repertoires estimated from the data follows: 39.8 ± 0.3 bits in healthy individuals, 41.9 ± 0.7 bits for patients in the mild cohort, 42.6 ± 0.3 bits for patients in the moderate cohort, and 42.8 ± 0.5 for patients in the severe cohort. The error bars indicate the standard error due to differences among individuals within a cohort. Clonal Lineage expansion. We studied clonal lineage expansion of BCR repertoires in individuals that showed an increase in the binding level (OD450) of their plasma to SARS-CoV-2 (RBD) during infection (Figs. 4A, S3): patients 2, 3, 4, 5, 6, 7, 9, 10, 11, 13, 14. Other individuals showed no increase in IgG binding to SARS-CoV-2 (RBD), either due to already high levels of binding at early time points or to natural variation and noise (Fig. S3 ). Our expansion test compared two time points. Therefore, for individuals with three time points, we combined data from different time points such that the separated times coincided with larger changes in binding levels (OD450). Specifically, we combined the last two time points for patients 2 and 7 and the first two time points for patient 9. In addition, we combined replicates at the same time point and filtered out small lineages with size < 3, where size was defined as the sum of the amount of unique sequences per time-point within a lineage. To test for expansion, we compared lineage abundances (i.e., total number of reads in a lineage) between early and late time points. Many lineages appeared only in one time point due to the sparse sampling of clonal lineages and the cells that generate them (Fig. S4 ). Therefore, we tested for expansion only for lineages that had nonzero abundances at both time points. Our expansion test relied on comparing the relative abundance of a given lineage with other lineages. However, due to primer-specific amplification biases, abundances were not comparable between reads amplified with different primers. Therefore, in our analysis we only compare a lineage with all other lineages that were amplified with the same primer. We applied a hypergeometric test (Fisher's exact test) to characterize significance of abundance fold change for a focal lineage. A similar method was used to study clonal expansion in TCRs (DeWitt et al., 2015) . For each focal clonal lineage (in a given individual), we defined a 2 × 2 contingency matrix , where Y early and Y late are the abundances of the focal lineage at the early and late time, and /Y early and /Y late are the total abundances of all reads (with the same primer) minus those from lineage at the early and late times. The ratio ( Y late / Y early )/( /Y ]^_` / /Ỳ^a ]b ) describes the fold change, or odds ratio, of lineage relative to the rest of the reads in the same primer group. Based on the contingency matrix , one-sided p-values for Fisher's exact test were calculated using the "fisher.test" function in R version 4.0. Fold change and p-values are shown in Fig. S6A . To determine a significance threshold for the Fisher's exact test, we examined the replicate data from samples collected from the same time point in each individual because we did not expect any significant expansion among replicates. We calculated the expansion test on pairs of replicates (Fig. S5B,C) , and we compared the empirical cumulative distributions of the time point and replicate expansion data (Fig. S5D,E) (Storey, 2002; Storey and Tibshirani, 2003) . We chose a p-value threshold of 10 cd44 , where there were 12.3 as many significant expansions as in the replicate data, and therefore the false discovery rate was approximately 1/(1 + 12.3) = 0.075. The probability that receptor σ is shared among a given number of individuals due to convergent recombination can be evaluated based on the probability to observe the receptor in the periphery post (σ), the size of the cohort , and the size of the repertoire (sequence sample size) . First, we evaluated the probability ρ(σ; ) that receptor σ with probability post (σ) appears at least once in a sample of size , The probability that receptor σ is shared among individuals out of a cohort of individuals, each with a (comparable) sample size , follows the binomial distribution, We aimed to identify shared receptors that were outliers such that their probability of sharing is too small to be explained by convergent recombination or other biases in the data. To do so, we identified the receptors with the smallest sharing probabilities share and found a threshold of post (dashed lines in Fig. 5 and Fig. S7 ) at the 1% quantile of share in the data. Specifically, since share is a function of post and (number of individuals sharing), for each we solved for post such that share = , and tuned the constant such that only 1% of the data lay below share . This was a conservative choice to identify the rare shared outliers in the data. BCR repertoire data can be accessed through: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA645245 All codes for data processing and statistical analysis can be found at: https://github.com/StatPhysBio/covid-BCR infection (shown in center). We distinguished between productive receptors and unproductive receptors that had frameshifts due to V(D)J recombination. In each patient, we constructed clonal lineages for productive and unproductive BCRs and inferred the naïve progenitor of the lineage (Methods). Bottom: 1. We inferred a model (Marcou et al., 2018) to characterize a null probability for generation of receptors gen ( ), trained on the set of unproductive inferred naïve BCRs. We inferred a selection model (Isacchini et al., 2020b) to characterize the deviation from the null among inferred naïve productive BCRs using the probability of entry to the periphery post ( ) and characterize selection factors . ( ) dependent on sequence features to differentiate between cohort statistics. 2. Based on temporal information of sampled BCRs, we identified clonal lineages that showed significant expansion during infection. 3. We identified progenitors of clonal lineages shared among individuals and assess the significance of these sharing statistics based on the probabilities to find each receptor in the periphery. We also identified previously characterized monoclonal antibodies (mAbs) specific to SARS-CoV-2 and SARS-CoV. IGH94-39 IGH94-59 IGH96-1 IGH94-34 IGH93-23 IGH94-4 IGH94-61 IGH93-7 IGH93-74 IGH93-30 IGH92-5 IGH94-31 IGH94-38-2 IGH93-21 IGH93-48 IGH94-30-4 IGH95-51 IGH93-9 IGH93-30-3 IGH93-33 IGH91-18 IGH93-53 IGH93-11 IGH92- Single amino acid mutations among HCDR3 of verified antibodies are in cyan. The complete list of verified antibodies is given in Table S3 . Figure S3 . ELISA binding assays for IgG and IgM repertoires against SARS-CoV-2 and SARS-CoV. Plasma binding levels (measured by OD450 in ELISA) against RBD, NTD, and S2 epitopes of SARS-CoV-2 and against RBD and NTD epitopes of SARS-CoV are shown. As seen in binding assays, many individuals developed a cross-reactive response to SARS-CoV-2 and SARS-CoV. Some individuals showed no increase in IgG binding to SARS-CoV-2 RBD due to already high levels at sampling time or natural variation. For the expansion analysis (Fig. 4) , we analyzed only individuals whose IgG repertoires showed an increase in binding to SARS-CoV-2 (RBD): 2, 3, 4, 5, 6, 7, 9, 10, 11, 13, and 14 . Histogram bin size is 0.5. The counts in each bin are scaled such that the maximum is equal to one for each column. The numbers above each column indicate the total number of sequences in the respective column. Sharing of rare lineages with log 34 post below the dashed line is statistically significant (see Methods). Ppost below the dashed line in Fig. 5 ) that exhibit lineage expansion in at least one individual is shown. These receptors are indicated by red diamonds in Fig. 5 . The 34 rare expanding lineage progenitors shown here are shared among four to 12 COVID-19 patients. Characterization of a high-affinity human antibody with a disulfide bridge in the third complementarity-determining region of the heavy chain Measurement and clinical monitoring of human lymphocyte clonality by massively parallel VDJ pyrosequencing Massively scalable genetic analysis of antibody repertoires Potent neutralizing antibodies from COVID-19 patients define multiple targets of vulnerability The clonal selection theory of acquired immunity Immunity as an aspect of general biology Potent neutralizing antibodies against SARS-CoV-2 identified by high-throughput single-cell sequencing of convalescent patients' B cells A neutralizing human antibody binds to the N-terminal domain of the spike protein of SARS-CoV-2 B cell responses: cell interaction dynamics and decisions Dynamics of the cytotoxic T cell response to a model of acute viral infection Quantifying selection in immune receptor repertoires Predicting the spectrum of TCR repertoire sharing with a data-driven model of recombination Genomewide association study of severe Covid-19 with respiratory failure Deep sequencing of B cell receptor repertoires from COVID-19 patients reveals strong convergent immune signatures The promise and challenge of high-throughput sequencing of the antibody repertoire Clinical characteristics of coronavirus disease 2019 in China Hierarchical clustering can identify B cell clones with high confidence in Ig repertoire sequencing data Beyond the spike: identification of viral targets of the antibody response to SARS-CoV-2 in COVID-19 patients Signatures of selection in the human antibody repertoire: selective sweeps, competing subclones, and neutral drift SOS: online probability estimation and generation of T and B cell receptors Generative models of T-cell receptor sequences Immunobiology: the immune system in health and disease Human neutralizing antibodies elicited by SARS-CoV-2 infection Exploiting B cell receptor analyses to inform on HIV-1 vaccination strategies. Vaccines (Basel) 8 Receptor mimicry by antibody F045-092 facilitates universal binding to the H3 subtype of influenza virus Potent neutralizing monoclonal antibodies directed to multiple epitopes on the SARS-CoV-2 spike Cross-reactive antibody response between SARS-CoV-2 and SARS-CoV infections High-throughput immune repertoire analysis with IGoR The innate immune system: fighting on the front lines or fanning the flames of COVID-19? Human adaptive immune receptor repertoire analysis-past, present, and future B cell clonal expansion and convergent antibody responses to SARS-CoV-2 Fierce selection and interference in B-cell repertoire response to chronic HIV-1 Serological assays for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Cross-neutralization of SARS-CoV-2 by a human monoclonal SARS-CoV antibody Method for identification of condition-associated public antigen receptor sequences Precise tracking of vaccine-responding T cell clones reveals convergent and personalized response in identical twins Landscape of non-canonical cysteines in human VH repertoire revealed by immunogenetic analysis Convergent antibody responses to SARS-CoV-2 in convalescent individuals Immunosequencing: applications of immune repertoire deep sequencing Isolation of potent SARS-CoV-2 neutralizing antibodies and protection from disease in a small animal model Next generation sequencing of T and B cell receptor repertoires from COVID-19 patients showed signatures associated with severity of disease Population variability in the generation and thymic selection of T-cell repertoires Characterization of neutralizing antibodies from a SARS-CoV-2 infected individual Analysis of a SARS-CoV-2-infected individual reveals development of potent neutralizing antibodies with limited somatic mutation A direct approach to false discovery rates Statistical significance for genomewide studies Immunology of COVID-19: current state of the science pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires Broad sarbecovirus neutralizing antibodies define a key site of vulnerability on the SARS-CoV-2 spike protein Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China A noncompeting pair of human neutralizing antibodies block COVID-19 virus binding to its receptor ACE2 Assessment of B cell repertoire in humans Rapid isolation and profiling of a diverse panel of human monoclonal antibodies targeting the SARS-CoV-2 spike protein The authors declare no competing interests.