key: cord-0917673-awcqfpfo authors: Zaslavsky, Maxim E.; Ram-Mohan, Nikhil; Guthridge, Joel M.; Merrill, Joan T.; Goldman, Jason D.; Lee, Ji-Yeun; Roskin, Krishna M.; Cunningham-Rundles, Charlotte; Moody, M. Anthony; Haynes, Barton F.; Pinsky, Benjamin A.; Heath, James R.; James, Judith A.; Yang, Samuel; Blish, Catherine A.; Tibshirani, Robert; Kundaje, Anshul; Boyd, Scott D. title: Disease diagnostics using machine learning of immune receptors date: 2022-04-28 journal: bioRxiv DOI: 10.1101/2022.04.26.489314 sha: 0eef8ee81a4843b0acf6ee291f3518e5d2960c26 doc_id: 917673 cord_uid: awcqfpfo Clinical diagnoses rely on a wide variety of laboratory tests and imaging studies, interpreted alongside physical examination and documentation of symptoms and patient history. However, the tools of diagnosis make little use of the immune system’s internal record of specific disease exposures encoded by the antigen-specific receptors of memory B cells and T cells. We have combined extensive receptor sequence datasets with three different machine learning representations of the contents of immune repertoires to develop an interpretive framework, MAchine Learning for Immunological Diagnosis (Mal-ID), that screens for multiple illnesses simultaneously. This approach can already reliably distinguish a wide range of disease states, including specific acute or chronic infections, and autoimmune or immunodeficiency disorders, and could contribute to identifying new infectious diseases as they emerge. Importantly, many features of the model of immune receptor sequences are human-interpretable. They independently recapitulate known biology of the responses to infection by SARS-CoV-2 or HIV, and reveal common features of autoreactive immune receptor repertoires, indicating that machine learning on immune repertoires can yield new immunological knowledge. Modern medical diagnosis relies heavily on laboratory testing for cellular or molecular abnormalities in specimens from a patient, or the presence of pathogenic microorganisms 1, 2 . For autoimmune disorders like lupus or multiple sclerosis, diagnosis via a combination of clinical or imaging abnormalities, detection of autoantibodies, and exclusion of other conditions is a lengthy process that can delay treatment 3, 4 . Evolution has provided vertebrate animals with immune systems that carry out molecular surveillance for abnormal exposures, using B cells and T cells expressing diverse, randomly generated antigen receptors. In response to viruses, vaccines, and other exposures, the repertoire of B and T cell receptors changes in composition, due to clonal expansion of stimulated cells, introduction of additional somatic mutations into B cell receptor genes, and selection processes that further reshape the immune cell populations. In disregulated immunity, selfreactive lymphocytes can also clonally proliferate and cause immunological pathologies. Being able to interpret the specificities encoded in a patient's adaptive immune system could allow assessment for many infectious diseases at once, as well as offering insight into autoimmune reactions [5] [6] [7] . Tracking immune receptor repertoires has already proved useful in diagnosing lymphocyte malignancies and monitoring cancer treatment responses 8, 9 . But other clinically relevant information in immune repertoires in the wide variety of infectious disease, autoimmunity, and immunodeficiency disorders has remained largely inaccessible. At issue is the high variability of immune receptor genes due to somatic rearrangement. To overcome this challenge, we hypothesized that the B cell immune response to each disease exhibits distinct and systematic patterns across people that can be identified with machine learning techniques for biological sequence data. Using systematically collected datasets of B cell receptor (BCR) heavy chain (IgH) sequences from peripheral blood, we identify the presence of infectious and immunological diseases by developing and combining three machine learning models (Figure 1) . Previous modeling attempts relied on nearly identical receptor sequence segments as a sign that two people have the same disease [10] [11] [12] [13] . Instead, we group individuals by inferring broader functional similarities in their immune receptors. We also detect other shared characteristics of immune responses: the extent of class switching of antibody constant regions, degree of somatic mutation diversification of BCR repertoires, and effects of selection in distorting quantitative features like immunoglobulin heavy chain complementarity determining region 3 (CDR-H3) length. The machine learning process distinguishes healthy from diseased individuals, viral infections from autoimmune or immunodeficiency conditions, and different pathogen infections from each other -without prior knowledge of pathogenesis. This approach also generates interpretable rankings for disease-specific sequences, revealing that the classifier recapitulates independently discovered biological facts, including identifying neutralizing antibodies to SARS-CoV-2. It also pinpoints V(D)J rearrangement gene segment and receptor sequence length patterns in different diseases that are less well characterized in the literature. Even in patients with an active infectious disease, only a fraction of immune receptors may be devoted to the responsible pathogen. To determine an individual's immune status from BCR sequencing, a diagnostic algorithm must sift through hundreds of thousands of unique sequences to identify the rare specific ones 14 . Candidate disease-specific BCRs can be highly variable across individuals. Exact T cell receptor sequence or motif matches have previously been used to identify individuals exposed to the same disease 10, 15 , but B cell receptors show greater sequence diversity due to somatic hypermutation during B cell stimulation. We extracted sequence features with an immune language model. Each raw sequence was converted into a numerical vector. The language model can group related sequences between people. Then we trained an ensemble of machine learning models to predict disease from immune receptor sequence features. (c) We applied the ensemble of models to predict disease status of held-out test individuals. (d-f) We interpreted the language representation sequence model: (d) We found suspected antigenspecific immune receptors by ranking sequences according to their predicted disease specificity. (e) We constructed a reference map of immune receptor sequences. Each point is one sequence. (f) We visualized a held-out test patient's immune status over time by overlaying their sequences on the reference map. The immune response to disease is visible in blue. Here, we use a combination of three models to improve recognition of distinct kinds of disease states, and to identify similar receptor sequences selected for binding to disease-related antigens. Each classifier model extracts different aspects of immune repertoires ( Figure 1B ). The first model predicts whether a person's IgH repertoire differs from those seen in healthy control individuals without recent immune system stimulation, providing a means of detecting active immune response -such as those in illness or vaccination. The second predictor identifies groups of highly similar sequences across individuals. The third classifier evaluates a broader proxy for functional similarity, rather than direct sequence identity, to find more loosely related immune receptors with common antigen targets. The three models are then blended into a final prediction of immune status. The final trained program accepts an individual's collection of IgH sequences from peripheral blood B cells as input, and returns a prediction of the probability the person has each disease on record ( Figure 1C) . We applied this approach to cohorts of patients with diagnoses of Covid-19 (n=75), HIV (n=96) 11 , Systemic Lupus Erythematosus (SLE, n=17), and Common Variable Immunodeficiency (CVID, n=13) 16 , and healthy controls (n=154) -355 individuals in total. We combined new datasets with ones previously reported, all collected with a standardized protocol of IgH amplicon sequencing from peripheral blood B cells. The non-Covid-19 cohorts were collected before the emergence of SARS-CoV-2. To evaluate whether our proposed strategy can generalize to new immune repertoires, patients were strictly separated into training and testing sets; we trained separate models for each cross-validation group and report averaged classification performance. As described below, we also tested and excluded the possibility that demographic differences between cohorts could explain diagnosis accuracy. Details of the three models follow: Global immune status prediction: The first machine learning model predicts whether an individual's IgH repertoire composition differs from healthy blood donors. Previously, we identified class-switched IgH sequences with low somatic mutation (SHM) frequencies in acute Ebola or Covid-19 cases, consistent with naive B cells recently having class-switched during the response to infection [17] [18] [19] . We trained a lasso linear model with each isotype's proportion of mutated sequences and IGHV gene/IGHJ gene counts as features. These features may also represent BCR repertoire changes accumulated in chronic conditions. Certain IGHV gene segments may be more prevalent among antigen-responding V(D)J rearrangements than the general population of immune receptors. As antigen-binding B cells become clonally expanded, the distribution of IGHV gene usage across the repertoire can change 17, 18 . Other groups have piloted immune status classification using deviations in V(D)J recombination gene segment usage from healthy baseline 20, 21 . The classifier distinguished 201 Covid-19, HIV, SLE, and CVID patients from 154 healthy individuals with an area under the Receiver Operating Characteristic curve (AUC) score of 0.91. AUC is the likelihood the model ranks a randomly-chosen positive example over a negative example -representing whether the classifier tends to assign high probability to the correct class and low probability to incorrect classes 22 . (Other performance metrics are provided in Supplementary Table 1 Convergent clustering of antigen-specific sequences by edit distance: The second classifier detects highly similar CDR3 amino acid sequences shared between individuals with the same diagnosis, as we and others have previously reported 11, 15, 23 . The CDR-H3s are the highly variable heavy chain regions that often determine B cell receptor binding specificity. We clustered CDR-H3 sequences with the same IGHV gene, IGHJ gene, and CDR-H3 length, and at least 85% sequence identity -allowing for B cell receptor variability created by somatic hypermutation. As in previous work 11 , we selected clusters enriched for sequences from subjects with a particular disease. These clusters represent exemplar sequences predictive of a specific disease across individuals. A new sample's sequences can then be assigned to nearby clusters. To predict immune status, we used the number of disease-specific clusters each specimen falls into as features in a lasso linear model. This CDR-H3 sequence clustering approach predicted patient disease state with 0.82 AUC -when it was able to make a prediction. 1.4% of samples did not have any sequences fall into the clonal parameter and edit distance criteria that defined the clusters. The classifier abstained from making any prediction for these challenging samples. Language model feature extraction from B cell receptor sequences: Amino acid edit distance may not be an optimal measure of receptor similarity. CDR-H3 sequences encode complex three-dimensional structures, and small sequence changes can cause important structural changes, while different structures with divergent primary amino acid sequences can bind the same target antigen 24, 25 . While disease-associated receptors may have lexically dissimilar sequences, they may still share the function of binding to the same target. The third classifier in our framework aims to map primary amino acid sequences into a lower-dimensional space that better captures functional similarities, not just the lexical proximity represented by edit distance. Extending beyond prior research using amino acid biochemical properties, like charge and polarity, rather than edit distance alone to find receptor groups [26] [27] [28] [29] , we extracted a putative functional representation of B cell receptors. To do so, we used UniRep, one of many self-supervised protein language models shown to learn functional properties for prediction tasks with an approach borrowed from natural language processing 30, 31 . Much like words are the building blocks arranged by grammatical rules to convey meaning, protein sequences are built from amino acids composed in an order compatible with polypeptide chain folding and assuming a structure that can carry out functions, like binding to another molecule or catalyzing a chemical reaction. UniRep was trained to predict randomly masked amino acids using the unmasked amino acids in the remaining sequence context of each protein. This requires learning the short and long-range relationships between different regions of a sequence, analogous to learning natural language phrases and grammar rules to anticipate the next word in a sentence. To achieve its task, the UniRep recurrent neural network compresses each sequence into an internal, low-dimensional embedding, capturing traits that allow accurate reconstruction. If the final model can successfully un-mask protein sequences, the compression and uncompression has extracted fundamental features that summarize the input sequences. UniRep's internal representation was shown to encode fundamental properties like structural classes 30 . UniRep was originally trained on over 20 million proteins from many organisms 30 . We hypothesized that by creating a version specialized for immune receptor proteins, we might obtain improved representations for IgH repertoire classification. We continued UniRep's training procedure, but this time to better reconstruct masked B cell receptor sequences (Methods). While prior autoencoder models have enabled classification of clusters of similar sequences 32,33 , our fine-tuned language model approach combines knowledge of global patterns in proteins from many domains of life, with the specific intricacies of BCR variation. For the disease classification task, we used the low-dimensional embedding learned by the fine-tuned language model to transform each IgH sequence into a 1900-dimensional numerical feature vector, regardless of the IgH sequence length. Then we trained a lasso linear model that maps receptor sequence vectors to disease labels. By aggregating each sequence's predicted class probabilities using a trimmed mean calculation, we arrived at patient-level predictions of specific disease exposures. We chose the trimmed mean because it is a central estimate robust to noisy contamination by the many background, non-disease-related sequences that are part of any repertoire. The language model embedding strategy achieved 0.84 AUC classification performance with no abstentions, demonstrating that the language model learns the syntax of immune receptor sequences, despite their enormous diversity. Because this classifier starts with a predictor for individual receptors, then aggregates sequence calls into a patient-level prediction 26, 34 , it allows interpretation of which sequences matter most for prediction of each disease -a major advantage over black-box methods. Below, we confirmed that sequences prioritized by our predictor are enriched for disease-specific and antigen-specific IgH. Ensemble: Finally, we combined all three classifiers -the "healthy or not" global repertoire composition, CDR-H3 sequence clustering, and language model representation strategies -into an ensemble immune status predictor (Supplementary Figure 1) . We call this adaptive immune receptor analysis framework MAchine Learning for Immunological Diagnosis (Mal-ID). By blending the probabilistic outputs from multiple classifiers trained with different strategies, the meta-model exploits each predictor's strengths and can resolve mistakes 35 . (As with the other models, we trained a separate meta-model for each cross-validation group.) This ensemble approach distinguished five specific disease states in 414 specimens from 355 individuals with 0.96 AUC, compared to 0.82 AUC achieved by the above CDR-H3 clustering model alone (Supplementary Table 1 ). When the ensemble model made a final decision for a given specimen by selecting the disease state (class) with highest predicted probability, the model achieved 85% accuracy across all held-out test sets (Figure 2A) . In practice, diagnostic sensitivity, the precise threshold on the predicted probability of each disease state, can be tuned to disease prevalence and the desired tradeoff between precision and recall. The 85% of correct classifications included specimens from all disease types, even unusual immune states like CVID. The CVID cohort was limited to challenging patients measured to have moderate levels of immune disruption 16 , and the HIV cohort included individuals with widely varying breadth of neutralizing antibodies. Of the 15% of misclassified repertoires, 1.4% were abstentions carried over from the CDR-H3 sequence clustering component of the model ensemble. In the remaining ~14% of classification mistakes, the ensemble model tended to have low confidence in its prediction ( Figure 2B ). Allowing the strategy to abstain from inconclusive predictions is important to make the diagnostic robust for challenging real-world cases. Model prediction confidence for correct versus incorrect predictions, as measured by the difference between the top two predicted class probabilities. A higher difference implies that the model is more certain in its decision to predict the winning disease label, whereas a low difference suggests that the top two possible predictions were a toss-up. (c) Sequences in our data convergent to known SARS-CoV-2 binders (orange) were ranked significantly higher than other sequences by the model (one-sided Wilcoxon rank-sum test, U-statistic = 2.770e8, p <= 1e-153). Non-overlapping sequences (blue) may include additional SARS-CoV-2 binders not yet identified in the literature. To study whether extraneous covariates were confounding our results, we investigated whether we could distinguish age, sex, or ancestry of healthy immune receptor repertoires. By training new classifiers to predict these variables, we found that the sex of a healthy individual could not accurately be determined from IgH sequences (Supplementary Table 2 Figure 3C ). However, IgH sequences did carry a weak signal of ancestry, with 0.78 AUC predictive power (Supplementary Figure 3D) . This signal may also have been increased because many individuals with African ancestry in the cohorts we studied live in Africa and have potentially different environmental exposures. Healthy IgH sequence repertoires also carried a signal of age. Previous studies have tracked immune aging in gene expression, cytokine levels, and immune cell type frequencies 36, 37 . We also see a modest effect of aging through immune receptor sequences. When we dichotomized age as under or over 50 years old to cast this continuous variable as a classification problem, the prediction model achieved 0.70 AUC (Supplementary Figure 3B) . However, the signatures of age detected by the classifier may correspond to different background or environmental exposures for people over 50 versus younger individuals. For instance, circulating influenza virus types have changed following successive pandemics. The first influenza strain someone is exposed to generates a bias in their influenza responses thereafter, likely by forming memory B cell pools with specificities related to early virus exposures 38 . When we instead divided age into groups by decade, the model achieved only 0.60 AUC and abstained from prediction on almost a fifth of individuals (Supplementary Figure 3A) . This worse performance suggests that more granular aging differences are challenging to disentangle at the sequence level with the number of participants, the age ranges, and the cell sampling and sequencing depth in this study. Also, our study was restricted to somatically hypermutated IgD/IgM and class switched IgG/IgA isotypes, reflecting the populations of B cells that are shaped by antigenic stimulation and selection. Studying naive B cells may reveal additional age, sex, or ancestry effects. We also investigated whether subtle demographic differences between cohorts drove our disease classification results. For example, the age medians and ranges of the cohorts were: HIV (median 31 years, range 19-64); SLE (median 37 years, range 21-71); healthy controls (median 45 years, range 17-87); Covid-19 (median 48 years, range 21-90); CVID (median 50 years, range 23-76). The percentage of females in each cohort was 46% (CVID), 47% (healthy controls), 52% (Covid-19), 64% (HIV), and 93% (SLE). The prevalence of females in our SLE cohort matches general epidemiology 39 . The ancestries and geographical locations of participants also differed between cohorts. Most notably, at least 89% of individuals with HIV were from Africa 11 . 84% of individuals known to have Hispanic/Latino ancestry were in the Covid-19 cohort, and 74% of Caucasians were healthy controls. To show that demographic metadata are insufficient to predict disease in our dataset, we attempted to predict disease state from age, sex, or ancestry alone, without using sequence patterns at all. No classifier exceeded 0.78 AUC, far lower than the 0.946 AUC when we retrained the Mal-ID sequence prediction ensemble on individuals with known demographics only (Supplementary Table 3 , Supplementary Figure 3E -H). Finally, we also retrained the ensemble disease classification meta-model with age, sex, and ancestry effects regressed out from the ensemble feature matrix. After this correction, classification performance on the 273 individuals with full demographic information available dropped slightly from 0.946 AUC to 0.891 AUC (Supplementary Table 3 , Supplementary Figure 3I -J). The small decrement in performance after decorrelating sequence features from demographic covariates suggests that age, sex, and ancestry effects have, at most, a modest impact on disease classification. We designed our machine learning framework to identify biologically interpretable features of the immunological conditions we studied, not just provide a black box classifier. To assess the ties between the accurate machine learning classification and known biology, we examined which sequences contributed most to predictions of each disease. For example, we ranked all sequences from Covid-19 patients by the predicted probability of their relationship to SARS-CoV-2 immune response using the Mal-ID language model classifier. In discriminating between different diseases, sequences highly prioritized for Covid-19 prediction included IGHV gene segments seen in independently isolated antibodies with strong SARS-CoV-2 binding. For example, IGHV3-53, IGHV3-9, and IGHV2-70 have been implicated in spike protein receptor-binding domain binding 40, 41 , and were highly ranked ( Figure 3B) . So was IGHV1-24, found in N-terminal domain-directed antibodies 42 . We observed a similar pattern for HIV rankings. IGHV4-34, an IGHV gene previously described in HIV-specific B cell responses -with unusually high somatic hypermutation frequencies in individuals producing broadlyneutralizing antibodies 11 -was ranked highly by the model (Figure 3C ). IGHV4-38-2 was also highly ranked for HIV prediction, and was prevalent among HIV-specific B cells in another analysis 43 . However, IGHV4-38-2 gene usage is significantly more common in African populations in our data (Supplementary Figure 4) , as in prior literature 44 . The model may have especially prioritized the IGHV4-38-2 gene because our HIV cohort is predominantly of African ancestry. Other IGHV genes flagged by the model are not stratified by ancestry (Supplementary Figure 4) . The model's prioritization of IGHV5-51 for SLE prediction matches prior data 45 . But in this classification, previously reported IGHV genes, like IGHV4-34, IGHV4-39, and IGHV4-59 46 , were not distinct for SLE prediction ( Figure 3D ). IGHV4-34, in particular, has been reported to be expressed at higher frequencies in SLE patients 47 . The language model classifier made greater use of IGHV4-34 for identifying HIV and CVID infected individuals, instead. The predictor's low ranking for IGHV4-34 as a SLE-defining characteristic may seem at odds with its importance in the literature, but this discrepancy may reflect differences between secreted antibody protein concentrations and the peripheral blood B cell IGHV gene usage, and that increased usage of a particular IGHV gene segment may be seen in several clinical conditions. Clinical tests for the presence of IGHV4-34 antibodies have 70% sensitivity and 95% specificity for SLE 47 . In other words, seven of ten SLE patients are expected to have IGHV4-34 serum antibodies, while only five in 100 healthy individuals may have them 47 . Because IGHV4-34 antibodies are notable for their inherent autoreactivity, are expressed at decreased frequencies in healthy people, but are not necessarily a specific marker of lupus, they may be a general marker of several immune disorders 11, 48, 49 . Similar reasons may explain why IGHV4-39 and IGHV4-59, two other genes highlighted in prior SLE literature 46 , were ranked highly for HIV and CVID prediction. While any one of these V genes on their own is insufficient to diagnose lupus, a clearer picture emerges in combination with the other IGHV genes and specific sequence patterns prioritized by the machine learning model. The IGHV5-51 gene prioritized for SLE was also highly ranked for CVID prediction ( Figure 3A) . Surprisingly, the Mal-ID language model predictor assigned very high CVID ranking to IGHV5-10-1, which is part of a haplotype variant that can replace IGHV3-9 and IGHV1-8 44 . The IGHV5-10-1 gene has also been associated with other conditions with autoimmunity, such as autoimmune hepatitis and celiac disease, where IGHV5-51 has also been implicated 50, 51 . Together with the increased usage of IGHV5-10-1 in CVID patients, this may reflect predisposition to CVID related to germline variation in the immunoglobulin heavy chain locus. The sequence model's rankings also favored certain CDR-H3 lengths. This was notable, because there is no direct input into the model of raw CDR-H3 sequences or their length; all UniRep embedding vectors provided as input to the model have identical sizes, regardless of original sequence lengths. Shorter CDR-H3 lengths were favored by the model for the chronic diseases SLE and HIV (Figure 3C-D) , consistent with our previous observations about selection for B cell receptors with shorter CDR-H3 segments in HIV 11 . On the other hand, sequences with longer CDR-H3 lengths were favored by the sequence model for Covid-19 class prediction ( Figure 3B) . These prioritized sequences could reflect B cell clones recently derived from naive B cells that have not yet had time to undergo selection, which favors shorter CDR-H3 lengths in memory B cells. Figure 5) . To prevent isotype sampling artifacts from driving disease predictions, we designed the sequence model to apply balanced weights to all isotypes. As a result, all isotypes were included among model-prioritized sequences for prediction of each disease (Supplementary Figure 6) . For Covid-19 prediction, IgG sequences played a slightly bigger role than other isotypes, as may be expected in this infectious disease 18, [52] [53] [54] . The other models used in the Mal-ID ensemble were also designed not to be influenced by isotype sampling amounts. The repertoire composition model quantifies each isotype group separately, and the convergent clustering approach is blind to isotype information. To be sure that differences in isotype proportions between patient cohorts were not sufficient to predict disease, we also trained a separate model to predict disease from a sample's isotype balance alone -with no sequence information provided. The isotype-proportions model achieved only 0.69 AUC, far lower than Mal-ID's 0.96 AUC disease classification performance. Therefore, Mal-ID's classification is robust to data artifacts like isotype proportions. Only a small minority of peripheral blood B cell receptor sequences from Covid-19 patients are directly related to the antigen-specific immune response to SARS-CoV-2. Other naive and memory B cells continue to circulate even during acute illness 55 . The 0.96 AUC performance suggests that the ensemble model addresses this "needle in the haystack" issue. We inspected the sequences selected by our language model classifier to assess how important sequences are prioritized. We matched Covid-19 patient sequences to their nearest neighbors in the CoV-AbDab database of antibodies known to bind the virus 56 , allowing up to 15% amino acid mismatches in CDR-H3 arising from somatic hypermutation and other sources of clonal diversity. The dataset of sequences known to bind to SARS-CoV-2 was collected by orthogonal experimental methods, such as direct isolation of B cells that bind the SARS-CoV-2 receptor binding domain (RBD), followed by BCR sequencing 57 . Unlike our global repertoire scans of a limited number of patients, the external database includes larger source cohorts, meaning it may contain more Covid-19 response types than our dataset. The database is also biased towards potential therapeutic antibodies identified by isolating spike antigen-specific B cells. Despite these differences, we found sequences in our data with high sequence identity (shared IGHV and IGHJ genes, and over 85% amino acid identity in CDR-H3) to known SARS-CoV-2 antigen-specific antibodies -comprising over 10% of known binding antibodies in the CoV-AbDab database, covering all major epitopes and IGHV genes ( Supplementary Figure 2A-B) . 64% of the matching sequences in our dataset were IgG sequences, followed by IgD/M (19%) and IgA (7%), with the final 10% seen in multiple isotypes. This IgG dominance pattern reflects how sequences that class switched to IgG were stimulated by antigen, and is consistent with the isotype relationships examined above. As a negative control, we repeated the process with sequences from healthy subjects. Healthy subject-originating sequences matched 6.8% of Cov-AbDab clusters in total. Nearly 92% of the matched healthy control sequences came from the IgD/M isotypes, most likely representing naive B cells that could mount a response should SARS-CoV-2 enter the body. Matches were rare: 0.09% of unique Covid-19 patient sequences from our dataset matched any CoV-AbDab cluster, along with 0.008% of unique healthy subject sequences. This order of magnitude difference is expected because antigenic stimulation of the IgG population in Covid-19 patients results in clonal expansion and diversification through somatic hypermutation. In support of the biological relevance of Mal-ID's top-ranked sequences, many were independently validated to bind SARS-CoV-2. Covid-19 patient-originating sequences that overlapped with this known-binder database were assigned significantly higher ranks by the prediction model ( Figure 2C, Supplementary Figure 2C-D) . 84% of matches occurred in the top half of ranked sequences. These binding relationships were not known to the classifier at training time, and the CoV-AbDab sequences were not used in the model. The concordance between automatically-prioritized sequences and experimentally-validated, disease-specific sequences from separate cohorts suggests the language model classifier learned meaningful rules that recapitulate biological knowledge gained during the extraordinary international research effort in response to the COVID-19 pandemic. Additionally, sequences targeting the SARS-CoV-2 RBD made up a smaller portion of the overlapping datasets than they comprise of the full CoV-AbDab database (Supplementary Figure 2B) . This difference suggests that while RBD-targeting receptors have been prioritized in experimental studies 58 , the immune response for other epitopes is of equal or greater importance for disease classification. To evaluate further disease-specific insights from Mal-ID, we developed a novel immune repertoire visualization to convey disease status at a glance. From the training set, we created a reference twodimensional UMAP layout using receptors that the Mal-ID language model classifier learned to confidently separate into distinct groups by immune state (Figure 4A ). Since this supervised UMAP is conditioned on the disease labels assigned to sequences, any visual distortions created by the reduction into two dimensions are less likely to bias against the disease classes. We overlaid sequences that were held out from the training set onto the reference UMAP visualization. For example, we visualized the Mal-ID language model's interpretation of therapeutic monoclonal antibodies against SARS-CoV-2 59 . Six of eight antibodies were projected into an immune receptor region specific for Covid-19 response (Supplementary Figure 7) . Since Mal-ID was trained on sequence pseudo-labels created from each individual's global disease type, we do not expect perfect sequence-level classification, despite successful repertoire-level disease predictions. One of the mis-projected antibodies (bebtelovimab) had lowconfidence sequence predictions (less than 1% difference between the top two class predicted probabilities), and is distinct from the others because it maintained efficacy against circulating Omicron variants of concern 60 . With the same visualization technique, we can project repeated samples from a held-out test set patient onto the reference map, enabling immune repertoire composition monitoring over time. Unlike the therapeutic mAb visualization of only eight sequences in total, patient repertoires contain a multitude of high-confidence and low-confidence sequences for disease prediction. Sequences with low probability predictions by Mal-ID can be excluded to focus the visualization on BCRs likely to be disease-specific. As one example Covid-19 patient's infection and immune responses progressed, the visualization revealed their collection of immune receptors shifting from the Healthy/Background region (green) at day 5 after onset of symptoms, into a Covid-19 area (blue) by day 12 (Figure 4B-C) . Pathogenic exposures shape the immune system's collection of antigen-specific adaptive immune receptors, forming a record of past and present illnesses. The pathogenic immune responses of autoimmune diseases and the impaired immunity of immunodeficiency conditions are also associated with distinctive alterations in the receptors expressed by B cells and T cells. We applied a three-part machine learning analysis framework to well-characterized disease datasets derived from over 350 individuals with five distinct immunological states, classifying immune responses of study participants with performance of 0.96 AUC. Faced with highly diverse repertoires containing hundreds of thousands of unique sequences, the Mal-ID ensemble of classifiers learned disease-specific patterns and chose meaningful sequences for prediction of viral infections, an autoimmune disorder, and an immune deficiency. These signatures of disease and specific pathogens overrode more modest differences detectable between individuals differing by sex, age, or ancestry. More importantly, the model's interpretability enables testing hypotheses about antigen-specific human immune cell receptors in different illnesses. One key innovation in this study is the development of a language model that translates immune receptors into compressed vectors capturing important properties. The disease predictor is not a black box; we can trace its decisions back to the original sequences. We assessed which sequences drove predictions of each disease, in an effort to identify and validate antigen-specific receptors. The highest priority sequences for Covid-19 disease prediction matched virus-binding sequences previously identified from other patients in a variety of studies spanning different laboratories and countries 56 . We also visualized immune response changes during disease progression, highlighting the potential of longitudinal immune repertoire monitoring. The language model we developed for B cell receptor sequences could be applied to further tasks beyond identifying disease exposures. This approach should also be amenable to detecting evidence of more distant prior exposures to pathogens or other immune stimuli, using data from memory immune cell receptor repertoires. Conventional serology tests may only be positive for recent infections or vaccinations, should antibody levels wane soon after exposure. Memory B and T cells can be longer-lived, so an immune repertoire deconvolution strategy may detect distant exposures in individuals who have become seronegative. Further analysis of past exposures could shed light on why some patients are more susceptible to conditions such as the lingering post-infection symptoms of "long Covid", or could help test hypotheses implicating prior viral infections in the initiation of autoimmune diseases 61, 62 . It is possible that the model will fail to detect disease exposures in the distant past if very low frequencies of specific B or T cell receptors are present at the time of testing, but such negative findings could have clinical relevance: memory B cell frequencies that are too low to detect may correlate with susceptibility to re-infection. While the proof of concept in this study provides promising results, it is limited to five immune states and cohorts of only several hundred individuals. The Mal-ID framework appears to capture fundamental principles of immune responses, but additional testing in separate clinical cohorts will be needed to further assess its generalizability. The biological validation against known SARS-CoV-2 binders already revealed some limits to the model's grasp of the ultra-high-dimensional receptor repertoire space: 16% of sequences that matched the external database had language-model-assigned ranks under 50% (Figure 2C ). Since our specificity group detection approach is focused on common response patterns shared across individuals, it is less likely to be able to interpret immune responses unique to a single individual. We assembled immune receptor repertoires from 75 Covid-19, 96 chronic HIV-1, 17 Systemic Lupus Erythematosus (SLE), and 13 Common Variable Immunodeficiency (CVID) patients, along with 154 healthy controls. We excluded mild Covid-19 cases and samples prior to seroconversion. Additionally, we only included SLE patients who had multiple auto-antibodies detected. These filters limited model training data to peak-disease specimens to improve our chances of learning patterns for the disease-specific minority of receptor sequences, which we can then search for in more challenging mild cases. However, we wanted to avoid creating an artificially simple classification problem from filtering to trivially separable immune states. To this end, we restricted our CVID cohort to "group II": only patients without an extreme reduction in isotype-switched B cells, as defined in our previous work, were included 16 . We reasoned that these intermediate CVID cases would be challenging to differentiate from other immune states, whereas CVID "group I" patients may be easy to identify as extreme outliers. (Additionally, we had insufficient group I patients for a thorough analysis.) For a similar reason, our HIV cohort included patients regardless of whether they generated broadly neutralizing antibodies to HIV. Had we instead restricted our analysis to HIV-infected individuals who produce broadly neutralizing antibodies, we may have created a more-easily separable HIV class, due to the unusual characteristics of those antibodies 11 . Across these diverse immune states, over 10 million B cell receptors were sampled, PCR amplified with immunoglobulin gene primers, and sequenced as previously described 11 . Briefly, we amplified each immunoglobulin heavy chain isotype in a separate PCR reaction using random hexamer-primed cDNA templates, and performed paired-end Illumina MiSeq sequencing. To reduce the potential for batch effects, data collection followed a consistent protocol. We annotated V, D, and J gene segments with IgBLAST v1.3.0, keeping productive rearrangements only 63 . Using IgBLAST's identification of mutated nucleotides, we calculated the fraction of the IGHV gene segment that was mutated in any particular sequence; this is the somatic hypermutation rate (SHM) of that B cell receptor. We restricted our dataset to CDR-H3 segments with eight or more amino acids; otherwise the CDR-H3 clustering method below might group short but unrelated sequences. Then we grouped nearly identical sequences within the same person into clones. For each individual, we grouped all nucleotide sequences from all specimens (including specimens at different timepoints) across all isotypes, and ran single-linkage clustering, requiring clustered sequences to have matching IGHV genes, IGHJ genes, and CDR-H3 lengths, and at least 90% CDR-H3 sequence identity by string substitution distance 11 . We kept only class-switched IgG or IgA isotype sequences, and non-class-switched but still antigen-experienced IgD or IgM sequences with at least 1% SHM. By restricting the IgD and IgM isotypes to somatically hypermutated BCRs only, we ignored any unmutated cells that had not been stimulated by an antigen and were irrelevant for disease classification. The selected non-naive IgD and IgM receptor sequences were combined into an IgM/D group. Finally, we deduplicated the dataset. For each sample from a patient, we kept one copy of each clone per isotype -choosing the sequence with the highest number of RNA reads. On average, any two patients had less than 0.001% sequence overlap, underscoring the enormous diversity of B cell receptor sequences. We divided individuals into three stratified cross-validation folds, each split into a training set and a test set (Supplementary Figure 8) . Stratified cross-validation preserved the global imbalanced disease class distribution in each fold. We also carved out a validation set from each training set, to use for several tasks described below: language model fine-tuning, classifier hyperparameter optimization, and ensemble metamodel training. In all folds, we observed less than 0.1% of sequences shared between any pair of the train, validation, and test sets. Since any single repertoire contains many clonally related sequences, but is very distinct from other people's immune receptors, we made sure to place all sequences from an individual person into only the training, validation, or the test set, rather than dividing a patient's sequences across the three groups. Otherwise, the prediction strategies evaluated here could appear to perform better than they actually would on brand-new patients. Given the chance to see part of someone's repertoire in the training procedure, a prediction strategy would have an easier time of scoring other sequences from the same person in a held-out set. Had we not avoided this pitfall, models may also have overfit to the particularities of training patients. We created IgG, IgA, and IgM/D summary feature vectors for each specimen. For a specimen's subset of sequences belonging to each isotype, we first calculated the median sequence somatic hypermutation rate, and the proportion of sequences that are somatically hypermutated (with at least 1% SHM). We also tallied IGHV gene and IGHJ gene usage, counting each clone once. To account for different total clone counts across samples, we normalized total counts to sum to one per specimen. Then we log-transformed and Z-scored the matrix representing how counts are distributed across V-J gene pairs. Finally, we performed a PCA to reduce the count matrix to fifteen dimensions. All transformations were computed on each training set and applied to the corresponding test set. We fit a binary lasso linear model with L1 regularization on the 51-dimensional (17 x 3 isotypes) feature vectors from each specimen to predict whether a specimen was from a healthy control or from an actively ill patient 64 . Features were standardized to zero mean and unit variance. We repeated this feature engineering and model training procedure on each cross-validation fold separately, then combined results from all test folds. All analyses were performed and plotted with python v3.7.10, numpy v1.19.1, pandas v1.2.5, scipy v1.6.2, scikitlearn v1.0, umap-learn v0.5.1, matplotlib v3.3.2, and seaborn v0.11.2. We performed single-linkage clustering on CDR-H3 sequences from B cells with identical IGHV genes, IGHJ genes, and CDR-H3 lengths, as described previously 11 . Nearest-neighbor clusters were iteratively merged if all cross-cluster pairs had high sequence identity (at least 85%), as measured by string substitution distance. Filter to disease-specific clusters: We kept clusters with sequences from three or more individuals, as long as at least 80% of those individuals were positive for some disease. For each remaining predictive cluster, we created a cluster centroid -a single consensus sequence. Recall that each cluster member is a clone from which only the most abundant sequence was sampled. Rather than having each cluster member contribute equally to the consensus centroid sequence, contributions at each position were weighted by clone size: the number of unique BCR sequences originally part of each clone. Compute feature vector for each specimen: Sequences from a sample were then matched to these predictive cluster centroids. In order to be assigned, a sequence must have the same IGHV gene, IGHJ gene, and CDR-H3 length as the candidate cluster, and must have at least 85% sequence identity with the consensus sequence representing the cluster's centroid. After assigning sequences to clusters, we counted cluster memberships across all sequences from each specimen. We found these cluster memberships for training set samples, then computed a feature vector for each specimen. A specimen's score for a particular disease was defined as the number of disease-predictive clusters into which some sequences from the specimen were matched. This featurization captures the presence or absence of convergent immunoglobulin sequences, without regard for isotypes. Fit and evaluate model: Features were standardized, then used to fit a linear logistic regression with L1 regularization and balanced class weights (inversely proportional to input class frequencies), using scikit-learn v1.0. The featurization and model were fitted on each training set and applied to the corresponding test set. Predicted labels from all test sets were then concatenated for accuracy evaluation. We abstained from prediction if a sample had no sequences fall into a predictive cluster. Abstentions hurt accuracy scores, but were not included in the AUC calculation, since no predicted class probabilities are available for abstained samples. Fewer than 1.5% of samples resulted in abstention (Supplementary Table 1 ). We combined the CDR-H1, CDR-H2, and CDR-H3 segments of each receptor sequence, then embedded the concatenated amino acid strings with the UniRep neural network, using the jax-unirep v2.1.0 implementation 65 . A final 1900-dimensional vector representation was calculated by averaging UniRep's hidden state over the original protein's length dimension 30 . To embed sequences, we used either the default UniRep weights, or weights fine-tuned on a subset of each cross-validation fold's training set -yielding a total of three fine-tuned models, one per fold. We chose the weights that minimized cross-entropy loss on a subset of the held-out validation set. (We used subsets here due to computational resource constraints.) The fine-tuning procedure was unsupervised. Besides the raw CDR1+2+3 sequence, no disease or other class labels were provided during fine-tuning. As a result, the fine-tuned language models are specialized to B cell receptor patterns, but not hyper-specialized to the disease classification problem. They can be applied to other immune sequence prediction tasks. The analysis pipeline for classifying disease with language model embeddings of sequences is complex, but necessarily so because it aggregates individual sequence data to generate patient-level predictions. Sequence-level disease classifier: First, we trained lasso models to map sequences to disease labels -one model per fold. As input data, we used UniRep embeddings (standardized to zero mean and unit variance), along with categorical dummy variables representing the IGHV gene and isotype of each sequence. Making predictions for individual sequences before aggregating to a patient-level prediction has interpretation benefits, but the two-stage approach introduces a new challenge. The available ground truth data associates patients, not sequences, with disease states. We do not know which of their sequences are truly disease related. To train the individual-sequence-level model, we provided noisy sequence labels derived from patient global immune status. But this transfer creates very noisy labels: even at the peak-disease timepoints in our dataset, disease-specific immune receptor patterns nevertheless represent just a small subset of a patient's vast immune receptor repertoire. Our approach must account for unreliable sequence labels and choose the right subset of sequences to make a patient-level decision. We used highly regularized statistical models equipped to withstand the noisy training labels created by transferring patient labels to the sequence-level prediction task. Models were trained with the scikit-learn v1.0 implementation of logistic regression with lasso regularization and multinomial loss, using balanced class weights and default hyperparameters. The lasso's L1 penalty encouraged sparsity among the ~2000 input features 64 . Because isotype use varies from person to person, we trained the sequence-level model with isotype weights to account for this imbalance. Aggregate sequence predictions to specimen prediction: Since we have no true sequence labels, we cannot evaluate classification performance for the sequence-level classifier. Instead, we aggregated sequence predictions into a patient specimen-level prediction. Using the predicted disease class probabilities for each sequence belonging to a specimen, we computed the trimmed mean for each class across the sequences. That is, we removed the top and bottom 10% of outlying scores, then computed the mean of the remainder, weighing sequences inversely proportional to their isotype's overall usage in the specimen. (This way, minority isotype signal is not drowned out.) Then we renormalized disease class probabilities to sum to one for each specimen. Tune class decision thresholds: To complete this specimen-level classifier based on aggregating sequence predictions, we tuned class decision thresholds against the held-out validation set. Specifically, we reweighted class probabilities to optimize the Matthews correlation coefficient, a classification performance metric that is meaningful even under class imbalance 66 . Before applying class weights, the winning label for each specimen was chosen based on the class with highest predicted probability. If a class then had its probabilities reweighted by ⅕, for example, the model must be five times more confident to choose that class label. Importantly, these weights were applied only in the choice of a final predicted label for each specimen. This procedure affected the confusion matrix, accuracy, and other metrics based on predicted labels, but the AUC did not change. Evaluate classifier: Finally, we evaluated the specimen sequence-prediction-aggregating predictor on the test set. Each test specimen's sequences were scored, then combined with a trimmed mean as above. The resulting disease class probabilities for each specimen were reweighted by the global class weights found above, to arrive at final predicted specimen labels. Ground truth specimen disease status is known, so we could evaluate classification performance here, unlike at the sequence-level prediction stage. After training healthy-vs-sick, CDR-H3 clustering, and language model embedding and aggregation models on each fold's training set, we combined the classifiers with an ensemble strategy. For each fold, we ran all trained base classifiers on the validation set, and concatenated the resulting predicted class probability vectors from each base model. We carried over any specimen abstentions from the CDR-H3 clustering model (the other models do not abstain). Finally, we trained a multiclass linear support vector machine (SVM) to map the combined predicted probability vectors to validation set specimen disease labels. The model was trained in a "one-vs-rest" fashion. We evaluated this metamodel on the held-out test set. We repeated the above process to predict age, sex, or ancestry instead of disease. Input data was limited to healthy controls to avoid learning any disease-specific patterns. To cast this as a classification problem, age was discretized either into deciles or as a binary "under 50 years old" / "50 or older" variable. Notably, the oldest individual among the healthy control participants was 87 years old, with only five people over 70. Therefore, our data do not assess IgH repertoire changes at more extreme older ages. We excluded any healthy individuals over 70 years old, as well as those under 20 years old, from the analysis. For each of the three tasks, we trained a sequence-level lasso-regularized linear model starting from language model embeddings, and aggregated sequence predictions to specimen predictions as above. Similarly, we trained a CDR-H3 edit distance-based clustering model for each task, and an overall repertoire composition model. In this case, the repertoire composition model did not predict whether a specimen was overall healthy or sick; instead it predicted the age, sex, or ancestry class directly (we optimized multinomial loss). Finally, the three trained models were combined into a new ensemble meta-model for each prediction problem. We note that we did not explicitly introduce data from allelic variant typing in germline IGHV, IGHD, or IGHJ gene segments into our models, but such data could be expected to increase detection of ancestry in such datasets. We retrained the entire Mal-ID disease-prediction set of models on the subset of individuals with known age, sex, and ancestry. (As above, we excluded any individuals over 70 years old or under 20 years old.) Additionally, we regressed out those demographic variables from the feature matrix used as input to the ensemble step. Specifically, we fit a linear regression for each column of the feature matrix, to predict the column's values from age, sex, and ancestry. The feature matrix column was then replaced by the fitted model's residuals. This procedure orthogonalizes or decorrelates the metamodel's feature matrix from age, sex, and ancestry effects. We regressed out covariates at the metamodel stage because it is a specimen-level, not sequence-level model, and age/sex/ancestry demographic information is tied to specimens rather than sequences. Separately, we also trained models to predict disease from either age, sex, or ancestry information encoded as categorical dummy variables. Here, no sequence information was provided as input. The best-performing model in each case ranged from a linear SVM, to a linear logistic regression model with elastic net regularization, to a random forest model. In each test set, we scored Covid-19 patient-originating sequences with the sequence-level classifier based on language model embeddings. Predicted Covid-19 class probabilities were combined for all sequences across folds. Some sequences were seen in multiple people, appearing in more than one test fold and thus receiving a different predicted probability from each fold's model. We deduplicated these sequences by choosing the copy with highest predicted disease class probability, to capture just how disease-related the sequence could be. Then sequences were ranked by their predicted probability, and ranks were rescaled from 0 to 1 (highest original probability). We repeated this process for other diseases. Using these ranked sequence lists, we examined the relationship between rank and sequence properties like CDR-H3 length, isotype, and IGHV gene segment. For the IGHV gene usage comparison (Figure 3) , IGHV genes with very low prevalence were removed. To set a prevalence threshold, we found the greatest proportion each IGHV gene ever comprises of any cohort, and took the median of these proportions (Supplementary Figure 9) . IGHV genes that failed to make up at least 0.5% of any disease cohort were removed. Specifically, this half of IGHV genes was filtered out: V1-45, V1-58, V1-68, V1-f, V1/OR15-1, V1/OR15-2, V1/OR15-3, V1/OR15-4, V1/OR15-5, V2-26, V2-70D, V3-16, V3-19, V3-22, V3-35, V3-38, V3-43D, V3-47, V3-52, V3-64D, V3-71, V3-72, V3-73, V3-NL1, V3-d, V3-h, V3/OR16-10, V3/OR16-13, V3/OR16-8, V3/OR16-9 , V4-28, V4-55, V4/OR15-8, V5-78, V7-81, VH1-17P, VH1-67P, VH3-41P, VH3-60P, VH3-65P, and VH7-27P. Most IGHV genes remaining after this filter had consistent, balanced prevalence across cohorts (Supplementary Figure 10) . We downloaded the Jan 31, 2022 version of CoV-AbDab 56 , filtering to antibody sequences known to bind to SARS-CoV-2 (including weak binders). Further, we selected sequences from human patients or human antibody libraries, and removed any IGHV genes that were never present in our dataset, as these sequences would never be matched. We clustered the remaining SARS-CoV-2 binders from CoV-AbDab with identical IGHV gene, IGHJ gene, and CDR-H3 lengths and at least 95% sequence identity. Several related sequences were combined and replaced by a consensus sequence. Then we found overlapping sequences between our dataset and CoV-AbDab. First, in case of sequences in our dataset originating from different isotypes but sharing the same IGHV gene, IGHJ gene, and CDR-H3 sequence, we kept the copy with highest predicted Covid-19 probability, in order to assess the strength of a sequence's relationship to the disease. Then we assigned each sequence originating from a Covid-19 patient in our dataset (from any isotype) to the nearest CoV-AbDab cluster centroid, as long as they had the same IGHV gene, IGHJ gene, and CDR-H3 length and at least 85% sequence identity. Iterating over sequences in model-ranked order, starting with highest confidence sequences, we counted the cumulative number of matches to a cluster in the known binder database. Finally, we calculated enrichment of these observed counts versus expected hits if sequences were ordered at random. The number of draws to sample a certain number of known binders without replacement from a pool of sequences follows the negative hypergeometric distribution. With N total sequences containing n