key: cord-0615675-e2iarlhb
authors: Chau, Kevin Ng; George, Jason T.; Onuchic, Jos'e N.; Lin, Xingcheng; Department, Herbert Levine Physics; University, Northeastern; Physics, Center for Theoretical Biological; University, Rice; Chemistry, Department of; Technology, Massachusetts Institute of; Department, Bioengineering
title: Contact map dependence of a T cell receptor binding repertoire
date: 2021-12-30
journal: nan
DOI: nan
sha: 60be09abe980edfb3b3e3bd1027c9947d3e2c7ba
doc_id: 615675
cord_uid: e2iarlhb

The T cell arm of the adaptive immune system provides the host protection against unknown pathogens by discriminating between host and foreign material. This discriminatory capability is achieved by the creation of a repertoire of cells each carrying a T cell receptor (TCR) specific to non-self antigens displayed as peptides bound to the major histocompatibility complex (pMHC). The understanding of the dynamics of the adaptive immune system at a repertoire level is complex, due to both the nuanced interaction of a TCR-pMHC pair and to the number of different possible TCR-pMHC pairings, making computationally exact solutions currently unfeasible. To gain some insight into this problem, we study an affinity-based model for TCR-pMHC binding in which a crystal structure is used to generate a distance-based contact map that weights the pairwise amino acid interactions. We find that the TCR-pMHC binding energy distribution strongly depends both on the number of contacts and the repeat structure allowed by the topology of the contact map of choice; this in turn influences T cell recognition probability during negative selection, with higher variances leading to higher survival probabilities. In addition, we quantify the degree to which neoantigens with mutations in sites with higher contacts are recognized at a higher rate.

One of the major components of the human immune system consists of a large repertoire of T lymphocytes (or T cells). Each T cell carries a particular T cell receptor (TCR) capable of binding to a specific antigen in the form of a peptide (p) displayed by major histocompatibility complex (MHC) molecules (shortened as pMHC) on the surface of host cells [1] [2] [3] [4] . The activation of the T cell response depends on the strength [5] , and possibly kinetics [6] , of this TCR-pMHC binding [7, 8] . A typical repertoire of a healthy individual consists of ∼ 10 7 distinct clonotypes, each with a unique TCR [9] . A growing body of research has been focused on understanding the systems-level interactions between the T cell repertoire and its recognition of peptide landscapes indicating foreign or cancer threats.

A critical feature of a properly functioning immune system is its ability to discriminate healthy cells of the host from those infected by pathogens, reacting to the latter ones while tolerating the former ones. In order to achieve the aforementioned discrimination, T cells must survive a rigorous selection process in the thymus before being released into the bloodstream. The first step in this process, called positive selection, ensures that TCRs in thymocytes (developing T cells) can adequately interface with pMHCs. Positive selection occurs in the thymic cortex, where cortical epithelial cells present self-peptides to thymocytes. As long as a thymocyte is able to interface with some presented pMHC, it receives a survival signal and migrates inward to the thymic medulla. This step ensures that the thymocyte has a properly functioning TCR, a rare event as only 5% of thymocytes survive this step. In the inner medulla, they encounter thymic medullary epithelial cells. Here, surviving immature T cells are again presented with a diverse collection of ∼ 10 4 self peptides [10, 11] representing a variety of organ types. T cells binding too strongly to any self peptide die off in a process known as negative selection [12, 13] .

As already pointed out, a key ingredient in the aforementioned process as well as in any subsequent recognition of a foreign antigen by a T cell is the molecular interaction of the TCR and the pMHC molecules. Crystal structures of TCR bound to pMHC show that the interface of the TCR-pMHC interaction is complex, with TCR complementarity determining regions 1 and 2 (CDR1 and CDR2, respectively) primarily binding to the MHC molecule, whereas the CDR3 complex mainly contacts the peptide in the MHC's cleft [14, 15] . The CDR3 complex is comprised of two loops, CDR3α and CDR3β; Baker et al. showed these loops can exhibit spatial and molecular flexibility during the TCR-pMHC binding process [16] ; moreover, the same TCR can bind to different pMHCs [17] , for example to a pMHC with point-mutated peptide [15] . This can involve subtle changes in the CDR3 complexes' spatial conformation. It is clear then that the intricacies of the TCR binding to the pMHC as a dynamic process remain as yet to be fully understood.

In lieu of a complete first-principles understanding, several groups have pioneered the idea of employing relatively simple models so as to get a sense of how negative selection affects the T cell repertoire. In the original set of models, TCRs and peptides were represented as strings of amino acids (AAs) which interacted in a manner that did not incorporate any structural information. In one such set of models, each AA in the pMHC binding pocket interacted with and only with the complementary AA in the TCR CDR3 complex. This interaction was described by either one or a set of 20x20 matrices [18] [19] [20] [21] [22] . These works indeed have provided a framework for describing how selection shapes the discrimination ability of the T cell repertoire, and have been applied to understanding HIV control [23] and for assessing the detectability of cancer neoantigens [21] . In a more recent study, Chen et al. [24] introduced nonuniform interaction profiles that translated into some AAs in the TCRs having a more pronounced effect in pMHC recognition, but did not consider how these non-uniformities could vary between TCRs, as shown by existing crystal structures.

In this paper, we introduce the idea of a crystalstructure dependent contact map that weights the binding energies based on the distance separating the residues on the AAs. A contact map can be thought of as a specific template for a class of TCR-pMHC interactions, which then will yield an actual binding energy once we specify the specific AA strings on the two molecules. To focus attention on the role of the contact map, we use a simple random energy model which assigns a fixed random energy to each of the possible AA pairs. Our model, described in detail below, can be thought of a more realistic version of the the Random Interaction Between Cell Receptor and Epitope (RICE) model [21] , in which contact map effects were simply assumed to decorrelate pair energies at different sites along a uniform binding surface.

The paper is structured as follows. In section II, we present the model description along with how crystalstructure dependent contact maps are created and also discuss the choice of energy matrix in the model. In Section III, we analyze how the variance of the TCR-pMHC binding energy PDF is impacted by the choice of contact map, including the roles of the total number of contacts and the topology of the contact map. We then present two applications of the model that are affected by the choice of contact map: in Section IV we focus on the negative selection recognition probability, and in Section V we discuss the point-mutant recognition probability by T cells that have survived negative selection. We present our closing remarks in Section VI.

Our goal is to analyze a model of negative selection in which the TCR-pMHC interaction exhibits antigenspecificity of T cells dependent both on the AA occurrence and on the spatial conformation of TCR and pMHC, while retaining enough simplicity so that it can be studied analytically and with feasible computations. We represent a TCR t via its CDR3 loops in form of a sequence of k t AAs, t = {t(i)} kt i=1 , and a pMHC q as a sequence of k q AAs, q = {q(j)} kq j=1 . A symmetric energy coefficient matrix of size 20 × 20, E = (E nm ), has entries E nm that represent the pairwise binding coefficients between AAs n and m. The binding energy contributions are then assumed to be the product of a contact map W = (W ij ), containing the weights W ij for the interaction between t and q in a given structure, and the coefficient corresponding to the amino acid interaction. In detail,

where U c represents the contribution of the TCR's CDR1 and CDR2 complexes interacting with the MHC molecule, as discussed in [18] [19] [20] 23] . This form of the binding energy in (1) explicitly separates the effects on CDR3-pMHC interaction due to spatial configuration from the effects due to the rest of the pair-dependent factors, assigning the former ones to W and coarsely accounting for the latter ones in E. The particular choices for the contact map W will depend on the specific TCR-pMHC being used as a template. Also, this formulation does not pre-suppose any specific choice for E. We discuss in detail specific choices of E and W in the sections below.

Crystal structures of TCRs bound to pMHCs show a variety of spatial configurations. Each one of these can be thought of as defining a binding template which can be used to determine the energy of a set of possible pairs. In general, we expect there to be a small number of possible templates, as a specific template would presumably be valid for a subset of all pairs; even then, we must necessarily ignore the small structural changes seen between the same TCR-pMHC systems that differ e.g. by a single AA mutation [15, [25] [26] [27] . We expect, based on a recent computational study [28] , that this approach will be reasonable if we stick to a fixed MHC allele, as structures with different alleles can look very different. We will see this directly in Fig. 1 below. In the calculations reported in this paper, we typically restrict ourselves to one template.

To derive a contact map from a crystal structure, we utilize the associative memory, water mediated, structure and energy model (AWSEM) [29] , developed in the context of protein folding. We use the position of C β (C α in the case of glycine) atoms to characterize the position of the residues of the AAs in both the TCRs and pMHCs, and to use AWSEM's negative-sigmoid switching function as the screening weight W ij in computing the interaction energy

Here, r ij is the distance separating the residues at positions i and j, r max acts like a cutoff and is the inflection point of W ij after which the function vanishes rapidly for r ij > r max , and η controls how rapidly this vanishing occurs. We use crystal structures (see Fig.  1a ) of TCR bound to pMHC deposited in the Protein Data Bank to determine a list of AAs in the TCR t and in the pMHC q, and to calculate each distance r ij , i = 1, · · · , k t , j = 1, · · · , k q . We then compute the corresponding weights W ij from (2) and construct the contact map W = (W ij ). Given that both CDR3α and CDR3β loops of the TCR interface with the peptide, we construct a separate contact map for each of these CDR3loop-pMHC interactions.

To show how the proposed screening weight given by (2) derives from different TCR-pMHC crystal structures, we choose r max = 9.5 A and η = For simplicity, we will refer to specific crystal structures by their PDB IDs unless further details need to be more precisely mentioned about the TCR or the pMHC. Note that 3QIB and 3QIU represent different TCRs bound to the same pMHC complex, whereas 3QIU and 3QIW represent the same TCR bound to two pMHCs that differ by a single AA mutation in the peptide sequence. In addition, 3QIB, 3QIU, and 3QIW share the same mouse MHC class-II restriction and indeed the same I-E k MHC-II allele, whereas the 5C0A TCR-pMHC system is presented on the human HLA A * 02 MHC class-I allele.

As defined here, contact maps are sensitive to the choice of distance cutoff. Clearly, the number of contacts in a contact map for a given crystal structure increases with increasing r max values. The contact map of 3QIB's CDR3α-pMHC interface is plotted at four different r max values, from 6.5 to 9.5 A in 1 A increments, while keeping η = 1 A −1 fixed (see SI Fig. S1 ). The contact profile gradually forms with ever-increasing number of contacts from about 5 AA pairs in contact at r max = 6.5 A, to about 22 AA pairs in contact at r max = 9.5 A. For the remainder of this paper, all contact maps are calculated with r max = 9.5 A and η = 1 A −1 .

The contact maps in figure 1b correspond to CDR3α-pMHC interfaces (top row) and CDR3β-pMHC interfaces (bottom row) from crystal structures 3QIB, 3QIU, 3QIW, and 5C0A. The contact profiles of CDR3α-pMHC are different from the CDR3β-pMHC contact profiles, as these parts of the TCR contact different residues on the displayed peptide. The contact maps consistently represent the physical proximity of a particular CDR3 loop to a specific portion of the pMHC, as can be seen in 3QIB's crystal structure shown in figure 1a, wherein the CDR3α loops primarily contact AAs 2-8, whereas CDR3β loops primarily contact AAs 7-12. The detailed differences among the first three contact maps do capture slight changes in position-dependent interfacing, even when comparing contact maps for the same TCR bound to two pMHCs diverging by peptide single-AA mutation. Different weights of, for example, position pairs (i, j) = (4, 4), (4, 8) , (6, 4) and (7, 6) are observed when comparing contact maps of 3QIU and 3QIW in figure 1b (coordinates in AA pairs are labeled as (i, j) for t(i) and q(j)). But, clearly from a more coarse-grained perspectives, these three can be considered to fall within one template. Conversely, the fourth map is very different, as should be expected because it is based on a different MHC molecule. Our conclusion is that we can use a single map for a class of possible parings and thereby learn about a significant set of contributors to the T cell repertoire. We include more contact maps from other crystal structures in the SI to further support our findings (Figs. S2-4). In the remainder of this paper, we will explore the segment of the repertoire that depends on one template and its corresponding contact map, and determine how the features of that map affect repertoire properties.

As discussed above, we propose for the recognition of an antigen by a T cell an affinity-based criterion in which the TCR-pMHC binding energy U (t, q) given in (1) equates to recognition (evasion) if U (t, q) is above (below) a particular energy threshold U n . Thus, we need to specify a symmetric energy coefficient matrix E = (E nm ). The first example of matrix choice was one based primarily on hydrophobicity, as developed by Miyazawa-Jerningan (MJ) [30] and used in studies of thymic selection [18, 24] . More recent efforts have focused on developing immune-specific energy matrices [31] . A recent study [28] used machine learning to derive the optimal matrix separating strong from weak binders within a single contact map template; this optimization approach would lead to a different such matrix for each assumed template. Here, our interest is in the role of the contact map and so we have opted for the expedient choice of a random model where all matrix elements are chosen to be independent, mean-zero, unit-variance normally distributed random variables, E mn ∼ N (µ = 0, σ 2 = 1). Note the assumption that the n-m interaction coefficient has the same value independently of the AAs' location in the TCR or the pMHC sequences. Thus, our model is distinct from the RICE approach [21] which assumed that the spatial location of the amino acid directly affected the energy coefficient.

The TCR-pMHC binding energy U (t, q) is the indicator of the affinity between a T cell and an antigen. When assuming the pairwise AAs interaction energies to be independent Gaussian random variables, U (t, q) in (1) becomes a weighted sum of these variables with weights given by the contact map W. Hence, U (t, q) is also a normally distributed random variable, and since its mean is automatically zero, knowledge of the variance σ 2 tq of its PDF allows us to fully characterize how U (t, q) varies as we vary the particular realization of E. The contactmap dependence of U (t, q) has a two-fold impact on the variance of its PDF when compared to the case of the addition of equal variance random variables (as in the RICE approach from [21] ). On one hand, the total number of non-vanishing contacts W ij given by the contact map directly determines the number of random energies E ij contributing to U (t, q), thus increasing σ 2 tq as the number of non-vanishing W ij 's increases. On the other hand, the particular repeat structure of AAs in the TCR sequence and in the pMHC sequence also influences σ 2 tq , as a particular pair of AAs that appears multiple times in the energy summation gives rise to a variance increase. In this section we explore how the variance of the PDF of U (t, q) depends on the two aforementioned factors.

Before proceeding, we must discuss various statistical ensembles of interest here. So far, we have focused on varying the coefficient matrix, thus generating an ensemble values for each specific t, q. However, we imagine that the biophysical problem is defined by a fixed E, which may be chosen (as done here) in a random fashion but, as mentioned above, may be learned from the data as done in other work [28] . Thus, we are actually interested in the distribution of binding energies as we vary either the peptide (fixing the TCR), the TCR (fixing the peptide) or both, as these are what is necessary to determine the effects of negative selection. To see how to determine these distributions, we return to the basic equation

where we have limited ourselves to one class of MHC molecule and hence U c becomes an irrelevant constant. Also, we will assume for the purpose of our analysis that W ij is either 0 or 1; this is true for all but a very small number of possible pairs. Finally, we will assume take the distribution over AA to be uniform, although it might be useful in future work to use the known AA distribution in the human proteome. With these number of assumptions, the mean value of U (t, q) sampled over peptide sequence and/or TCR sequence constrained to have no repeats is just the sample mean of drawing a number of values from a mean zero, variance σ 2 Gaussian distribution. This number is very much peaked around zero. Similarly, the mean value of U 2 will be strongly peaked around the variance times the contact number N c . Perhaps not surprisingly, these are the same answers we get when averaging over E; in other words, as long as we average over sufficient numbers of sequence choices, the results for all choices of coefficient matrices are the same; see the SI (Section S8) for a more complete discussion.

Let us now extend this analysis to the more general case. We introduce the following notation: A pair repeat structure is denoted as C p = (l r1 1 , l r2 2 , · · · , l r N N ), with r i · l i = N C , where l i denotes the number of times an amino acid pair is repeated in different contacts and r i denotes how many such l i repetitions there are. For example, for a total of 20 contacts, if there are three contacts with the same AA pair and two set of two contacts with the same AA pair, this would be denoted as C p = (3, 2 2 , 1 13 ). An extension of the previous argument allows us to determine the most likely value of the mean energy and its variance, averaged over all possible peptide and TCR sequences that do not change the class. The mean is still zero and the variance now becomes

Again, this is exactly the same as the result obtained when averaging over energy coefficient matrices. A more precise version of this correspondence is presented in the SI (S5 and S6). If one wants to find the total variance, we have to average over different choices of C weighted by their respective probabilities of occurrence given the assumed uniform distribution of residue choice.

It is clear from the previous analysis that the variance in the binding energy distribution increases with N C the total number of contacts. It is easy to see from the above that there are bounds on the total variance

The lower bound comes from the case where all pairs are distinct whereas the upper bound arises from assuming that all contacts are the same AA pair, i.e. C = (N C ). From the size of the AA alphabet |A|, the total number of AA pairs (irrespective of ordering) is M = |A|+1

. Now, we have just seen that the precise value of the variance depends on the exact repeat structure of the peptide (q) and TCR (t) AA sequences, together with the contact map. In the case where we wish to obtain the variance of the PDF obtained by varying both t and q, we can obtain a useful approximation of this variance by ignoring the exact configuration of W and instead simply counting the number of times each of the M AA pairs is selected with equal probability, where there are N c total opportunities. In this case, the number of times each AA pair is realized follows a multinomial distribution, and the variance can be calculated from the second moment of this distribution as

See the SI (Sections S5 and S6) for a detailed derivation. In figure 2a the variances computed by simulation for the CDR3α-pMHC interfaces of 3QIB, 3QIU, 3QIW, and 5C0A (top row of Fig. 1b ) are presented along with the predicted variance from (6). As we can see, this approximation captures the basic dependence on the total number of contacts. In the SI (Fig. S5) , we provide further evidence for this result by considering the effects of varying the cutoff used in the definition of the contact matrix.

B. Variance depends on the repeat structures of the TCR and pMHC AA sequences

If we are looking for the distribution of energies for a fixed TCR sequence, there is no simple formula that can encompass the dependence of the variance on the exact TCR sequence and on the exact contact map. As already mentioned, we have to find the variance for different possible repeat structures and then weight them appropriately by their occurrence probability. Specifically,

Where N R is the total number of different possible structures.

We would like to work out a specific and relatively simple example to illustrate how this works. To simplify the analysis, we focus on the 3QIB CDR3α-pMHC contact map W α 3QIB in figure 1b (top left) , and assume that the TCR is a constant sequence of a single repeated AA t = (t 1 , t 1 , t 1 ...). Note that this makes labelling of repeat motifs dependent on the pMHC's primary sequence only. In W α 3QIB , only 7 AAs in t and 7 AAs in q make significant contacts, so the effective lengths are k t = k q = 7.

We will break down the problem of computing the terms in this sum as follows: We will first focus on the probable configurations of the peptide by itself and consider how the different sites are chosen. Drawn from a |A| = 20 AA alphabet, there are N = 15 different repeat configurations of length 7; when randomly generating AA sequences, the four most likely repeat configurations C q,1 = (2, 1 5 ), C q,2 = (1 7 ), C q,3 = (2 2 , 1 3 ), and C q,4 = (3, 1 4 ) (in the section above, C is the repeat structure of the TCR-pMHC pairing, whereas C q,n (n = 1, · · · , N ) here indicate the repeat structure only of the pMHC), cover about p c = 96.66% of the AA sequence space. A complete breakdown of these probabilities can be found in SI Table 1 . We thus truncate the sum in (7) to the pairings that can be obtained from these leading order structures. Now, each peptide configuration can give rise to a set of different possible pairing structures, depending on the specific non-vanishing elements of the contact matrix. These then need to be averaged together (with proper weighting). This somewhat complicated calculation is presented in the SI (section S6) and is carried out by using the self-averaging property to allow for computing the average over different realizations of the energy coefficient matrix; no rounding to 0 or 1 for the values W ij is made in this calculation and the results to follow. Finally, we obtain σ t (p c ) = 9.7833σ, and extrapolating this value to approximate the full analytical value in (7), we get

This estimation has relative error of 0.6% as compared to the simulated value of the standard deviation, the blue plot in figure 2b . The simulated PDFs related to the four most likely repeat structures are also shown in figure 2b. It is worth noting that in (7) the contributions of higher values of variances are dominated by the even faster vanishing of the corresponding probabilities. For reference, the standard deviation for this contact map ranges from σ 2 = 9.0761 for C q,2 = (1 7 ) to σ 15 = 21.4090 for C q,15 = (7); whereas the probabilities are p 2 = 30.52% and p 15 = 1.56 × 10 −6 %, respectively.

Negative selection trains the naïve T cell repertoire to avoid host cells by eliminating T cells that bind too strongly to any of the self-peptides. We now wish to consider the effects on the post-selection repertoire due to incorporating crystal-structure motivated contact maps into the negative selection process.

We focus on determining the negative selection recognition probability as a function of the energy survival threshold U n . For a T cell to survive negative selection, it must not bind strongly i.e. U < U n to any of the selfselecting pMHCs it encounters during selection. This is described by the probability that the maximum of the TCR-pMHC binding energies, max{U (t, q i )} Nq i=1 , resulting from a T cell t undergoing negative selection against a repertoire, Q = {q i } Nq i=1 of N q self-pMHCs, is below the threshold U n [21] . This recognition probability is thus a monotonically decreasing function that gradually transitions from 1 to 0 with ever increasing values of U n . For a fixed TCR, the scale of the transition correlates with a typical value of σ 2 t . Averaging this over different TCRs will give rise to a width that strongly correlates with the number of contacts, as suggested by the phenomenological relationship given above and verified in the SI.

We simulate negative selection for various CDR3-pMHC interfaces (contact maps), using fixed randomly generated TCR and pMHC repertoires and 16 zero-mean, unit-variance randomly generated energy matrices E. In figure 3 , we show the recognition probability averaged over energy matrices E for seven different simulations, four of them using contact maps 3QIB, 3QIU, 3QIW, and 5C0A; along with a 7 × 7 identity-matrix contact map case, as well as the original RICE model, and a 7 × 7 contact map with all unit entries case simulating the scenario where all AAs in t are interacting with all AAs in q. At a given U n , the recognition probability is higher for those contact maps with higher σ 2 , giving a higher probability for a pair of t and q to bind strongly enough and thus for t to face deletion. Interestingly, the data in the figure show directly that, similar to what we argued earlier, the recognition probability curve for a single realization is quite accurately given by the average over energy matrices.

One of the motivations to model negative selection is to understand how the rejection of T cells that detect selfpeptides negatively impacts the chances that T cells can detect tumor neo-antigens; after all, these neo-antigens are typically just one mutated amino acid away from a self-peptide sequence. We therefore turn to the probability that a T cell (t) that has survived negative selection is able to recognize an antigen (q) whose primary sequence differs by only one AA from a self-peptide (q) included in the negative-selecting repertoire (Q). We call such antigen a point-mutant. In general, this probability for fixed T cell is defined viã (8) where we have averaged over all possible point-mutants with non-trivial contacts. Here Q denotes the selecting repertoire of N q peptides, one of which is q. In the limiting case where t has not undergone negatively selection (N q = 0), equation (8) reduces to the recognition probability of a randomly generated antigen. Another extreme case corresponds to t negatively trained only on q (N q = 1) where the point mutant position has k contacts, resulting in

where F k (x) and f k (x) denote the distribution function and density function of mean-zero normal random variables with variance σ 2 k (see SI Section S7 for a full derivation). We expect that for relatively small N q , it is unlikely that any of the peptides in the training set will be close enough to q orq to help distinguish the two binding energies; hencep 1 should be a reasonable approximation to D t . This agreement should decrease as N q increases. The accuracy of this approximation is explored in SI Fig. S8 . More generally, we ran a set of simulations with varying sizes N q = {10 2 , 10 3 , 10 4 } to assess the detection of q by a T cell trained to evade q. We used the CDR3α-pMHC interface of 3QIB (top left of Fig. 1b) as contact map for the simulations for simplicity. Figure 4a shows the simulated point-mutant recognition probabilities as a function of T cell negative selection survival probability at three different sizes of the selecting repertoire. At lower (resp. higher) values of negative selection survival probability, i.e. when negative selection is more (resp. less) stringent during T cell maturation, a mature T cell's sense of an antigen resembling self-antigens is rather strict (resp. lenient), therefore, almost (resp. hardly) any deviation from this criterion caused by pointmutations triggers recognition of the point-mutant by the T cell; this results in higher (resp. lower) point-mutant recognition probability at lower (resp. higher) T cell negative selection survival probability.

Next, we compare the results at different N q . This is a bit tricky, because fixing the negative selection probability leads to different thresholds U n at different training set sizes. This accounts for a large part but not all of the difference in the curves seen in Fig. 4a ; see SI Fig  S8. By increasing the size of the negative-selecting repertoire N q , a mature T cell's sense for self-antigen resemblance broadens; thus leading to higher tolerance (less detectability) for point-mutants at higher N q values.

Another feature impacting point-mutant recognition probability that stems from incorporating contact maps Fig 1b) in use, pMHC-AAs in high-contact sites are in contact with 5 TCR-AAs, whereas pMHC-AAs in sparse-contact sites are in contact with only 1 TCR-AA; and when picking random sites to mutate, the number of peptide-AAs that a given TCR-AA can contact ranges from 1 to 5.

into the model, pertains to the site in the pMHC sequence of the mutated AA. As can be seen in the contact maps in Fig. 1b , some pMHC AAs make more significant contacts with TCR AAs than other pMHC AAs. In the case of 3QIB's CDR3α-pMHC contact map (top left of Fig. 1b) , the number of non-vanishing contacts for a particular pMHC AA ranges from 1 (sparse-contact site) to 5 (high-contact site), with an averaged 3.06 TCR AAs in contact by the 7 pMHC AAs with non-vanishing contacts. Accordingly, a point-mutantq with its mutation occurring in a sparse-contact site (resp. high-contact site) bears higher (resp. lower) resemblance with the non-mutant q for a T cell. This effect clearly should impact the point-mutant recognition probability, with high-contact site point-mutants having higher recognition probability than their sparse-contact counterparts, and point-mutants with randomly chosen mutation sites having recognition probability somewhere in between the aforementioned two. We investigated this idea by running three simulations as explained in the paragraph above, but with the additional constraint that in each round of simulations the mutated site was: one, always a high-contact site; two, always a sparse-contact site; and three, randomly chosen. The negative-selection repertoire was fixed at N q = 10 4 . The point-mutant recognition probability of these simulations are shown in Fig.  4b and exhibit agreement with the expected behavior.

The aforementioned RICE framework cannot adequately distinguish high contact sites from sparse ones on either the TCR or pMHC amino acid sequences. RICE's prediction for neo-epitope recognition probability therefore represent fixed estimates for a typical 'one-contact' mutation. On the other hand, this new approach enables a quantitative estimate of this obvious dependence. This aligns with previous strategies calling for mutations to target TCR-facing peptide amino acids; see for example [32] , [33] VI. CONCLUSIONS In this manuscript, we considered the role of a nontrivial contact map acting as a template for the explicit interactions between the TCR and pMHC AA sequences. This approach is a compromise between making an arbitrary rule as to how these sequences interact (for example, assuming only diagonal coupling as done in previous models) or using measured crystal structure for each considered pair, an obvious impossibility for anything resembling a large repertoire undergoing negative selection. The formulation isolates contributions from spatial conformation of CDR3 loops and pMHC complexes into these contact maps, while remaining features are encapsulated in energy coefficient matrices. Although all the analysis here was done using randomly generated energy matrices, serving as a baseline "toy" model, the methodology is not restricted to such a choice, and other energy matrices such as the hydrophobicity-driven MJ matrix [30, 34] , or data-driven matrices [28] can be used instead.

We observed that the inclusion of contact maps gave rise to several features impacting the variance of the TCR-pMHC binding energy: a density-related one, as the number of non-vanishing contacts correlates with increased variance; and a topology-related one, in which the repeat structure of the AAs in CDR3-loops' and in pMHC-complexes' sequences also skews the variance, with additional repeats correlating with increased variance. These changes in variance also affect negative selection recognition probabilities, with larger variances driving higher recognition probabilities. The proposed generalization is therefore useful for characterizing the distributional behavior of TCR systems with a relatively fixed contact structure. Given that even at fixed MHC allele there are likely to be several distinct spatial conformations that can give rise to effective binding, a full treatment of the repertoire should include finding the set of templates that gives rise to the largest possible binding for the sequences under consideration. This extension will be reported on elsewhere.

Another influence of the topology of the contact map manifest in the recognition probability of point-mutated antigens by T cells that have been negatively selected. Here, some pMHC-AAs have higher number of nonvanishing contacts with TCR-AAs, that upon mutation make the antigen to be perceived more like foreign by the T cells than when mutating pMHC-AAs with fewer non-vanishing contacts; this results in higher recognition probability of high-contact site point-mutants. Conversely, this notion can provide at least some information about which mutations in a previously detected peptide could prevent the detection of an evolved virus by mem-ory T cells generated in an earlier infection. Data to this effect is now becoming available in the context of COVID-19-specific T cells in never infected individuals resulting from prior responses to other endemic coronaviruses [35] .

As seen here, the problem of dissecting the generation and functioning of the post-selection T cell repertoire is incredibly complex, even utilizing a number of vastly simplifying assumptions. The full problem requires attention to biases in the generation of the naïve repertoire [36] , inclusion of a set of different MHC alleles for different individuals, a better handle on the statistical properties of the negative selection training set, and of course the full range of molecular biophysics effects that contribute to binding energy and on-off kinetics. These cannot all be included in any useful theoretical model. By isolating and improving our understanding of the effects of specific contact geometries, we hope to build intuition for how different aspects of this complex system contribute to different functional aspects of the full T cell arm of adaptive immunity.

Clonal evolution in relapsed acute myeloid leukaemia revealed by wholegenome sequencing

The ipd-imgt/hla database -new developments in reporting hla variation

Neoantigen landscape dynamics during human melanoma-t cell interactions

Force-dependent transition in the t-cell receptor β-subunit allosterically regulates peptide discrimination and pmhc bond lifetime

The case for absolute ligand discrimination: modeling information processing and decision by immune t cells

T-cell-receptor affinity and thymocyte positive selection

How t cells 'see' antigen

A direct estimate of the human α β t cell receptor diversity

How diverse should the immune system be?

Theories and quantification of thymic selection

Explaining high alloreactivity as a quantitative consequence of affinity-driven thymocyte selection

Positive and negative selection of the t cell repertoire: what thymocytes see (and don't see)

Identification of the cognate peptide-mhc target of t cell receptors using molecular modeling and force field scoring

Structural basis of specificity and crossreactivity in t cell receptors specific for cytochrome c-iek

Structural and dynamic control of t-cell receptor specificity, cross-reactivity, and binding mechanism

How a single t cell receptor recognizes both self and foreign mhc

How the thymus designs antigenspecific and self-tolerant t cell receptor sequences

Thymic selection of t-cell receptors as an extreme value problem

Statistical mechanical concepts in immunology

Effects of thymic selection on t cell recognition of foreign and tumor antigenic peptides

Is t cell negative selection a learning algorithm?

Effects of thymic selection of the t-cell repertoire on hla class i-associated control of hiv infection

How nonuniform contact profiles of t cell receptors modulate thymic selection outcomes

Hotspot autoimmune t cell receptor binding underlies pathogen and insulin peptide cross-reactivity

Crossreactivity of a human autoimmune tcr is dominated by a single tcr loop

A molecular basis for the t cell response in hla-dq2.2 mediated celiac disease

Rapid assessment of t-cell receptor specificity of the immune repertoire

Awsem-md: Protein structure prediction using coarse-grained physical potentials and bioinformatically based local structure biasing

Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation

Development of immune-specific interaction potentials and their application in the multi-agent-system vaccimm

Tcr contact residue hydrophobicity is a hallmark of immunogenic cd8+ t cell epitopes

Rational optimization of tumor epitopes using in silico analysis-assisted substitution of tcr contact residues

Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading

Sars-cov-2-reactive t cells in healthy donors and patients with covid-19

Statistical inference of the generation probability of t-cell receptors from sequence repertoires

The authors would like to thank Dr. Michael E. Birnbaum for fruitful discussion on systems-level TCRantigen specificity. This work was supported by National Science Foundation (NSF) grant NSF PHY-2019745 (Center for Theoretical Biological Physics).