key: cord-0289228-jhginti1
authors: Karnaukhov, Vadim K.; Shcherbinin, Dmitriy S.; Chugunov, Anton O.; Chudakov, Dmitriy M.; Efremov, Roman G.; Zvyagin, Ivan V.; Shugay, Mikhail
title: Predicting TCR-peptide recognition based on residue-level pairwise statistical potential
date: 2022-02-19
journal: bioRxiv
DOI: 10.1101/2022.02.15.480516
sha: 0e3f8b73f26a81d4db1501c505b6b18737dcf865
doc_id: 289228
cord_uid: jhginti1

T cell receptor (TCR) recognition of foreign peptides presented by major histocompatibility complex (MHC) proteins is a crucial step in triggering the adaptive immune response. Prediction of TCR:peptide recognition is important for many clinically relevant problems: prediction of cross-reactivity of TCRs used in adoptive T-cell-based therapies, identification of targets for antigen-specific therapies of autoimmune disorders and cancer, vaccine design. In this work, we describe a novel computational method that ranks TCR-peptide pairs based on statistical estimation of binding energy to select the most likely cognate epitopes. Our method starts from a structure of a peptide-MHC complex with the TCR of interest (either experimentally derived or based on a homology model) and a set of candidate epitopes and generates a ranked list of these peptides with predicted recognition probabilities. In our method interaction energy between TCR and peptide residues is estimated using statistical potential TCRen which we derived via statistical analysis of amino acid contact preferences in TCR-peptide interfaces from crystal structures of TCR-peptide-MHC complexes. We demonstrate excellent performance of our method for two tasks related to TCR-peptide recognition: 1) discrimination between cognate epitopes and unrelated peptides; 2) discrimination between real and mocked TCR-peptide-MHC complexes. Comparison of TCRen with potentials reflecting the rules of general protein-protein interaction and protein folding reveals the distinctive features of TCR-peptide interactions, such as intrinsic asymmetry of the interface, a complex interplay between different physico-chemical properties of contacting residues and lower impact of hydrophobic interactions. Finally, we demonstrate the use of our method for identification of cancer neo-epitopes recognized by tumor-infiltrating lymphocytes.

T cells are the key actors of cell-mediated adaptive immune response able to distinguish between self and non-self antigens presented on the cell surface. Specificity of a T cell is determined by its T-cell receptor (TCR), which recognises peptide antigens presented by major histocompatibility complex (MHC) proteins. MHC class I proteins are expressed in almost all cell types and specialized on presentation of peptides from degraded cytoplasmic proteins for inspection of intracellular status by CD8+ T cells; MHC class II proteins are expressed mainly on specialized antigen-presenting cells and present peptides from endocytosed or phagocytosed extracellular proteins for CD4+ T cells. The majority of T cells have αβTCR, the heterodimer composed of α and β chains, both contacting antigen (peptide-MHC complex) mainly by three loops called complementarity-determining regions (CDRs). CDR1 and CDR2 are germline-encoded, whereas CDR3 is formed during T cell maturation via V(D)J-recombination process by random selection of variable (V), diverse (D), and joining (J) gene segments and random nucleotide deletions and insertions on the segment junctions. As a result of the V(D)J-recombination, repertoire of TCRs in each individual is very diverse (more than 10 8 different T cell clones with unique TCRs 1,2 ) which enables recognition of a huge number of potential antigenic peptides and provides defence from different pathogens and cancer.

To date, more than 200 crystal structures of TCR-peptide-MHC complexes are deposited in Protein Data Bank (PDB) 3 . Most of these structures share some common features: diagonal binding mode with α-chain contacting with N-terminal part of peptide and β-chain -with C-terminal part of peptide; CDR3s mainly interact with peptide and CDR1s and CDR2s -with MHC.

Prediction of TCR specificity (that is which peptides are recognised by the given TCR) is critical for many clinically relevant problems: identification of targets for antigen-specific therapies of autoimmune disorders and cancer, prediction of cross-reactivity in TCR-based adoptive T cell therapies, vaccine design, TCR repertoire-based diagnostics. In recent years several sequence-based approaches for prediction of TCR specificity was proposed 4, 5 , utilizing the idea that TCRs which recognise the same antigen have similar sequences. Evidently, these approaches are dependent on experimental data about antigen-specific TCRs and thus can hardly be applied to TCRs with low sequence similarity to TCRs with characterised antigen specificity, as well as for epitopes without identified specific TCRs. Structure-based methods may help to solve these limitations. Approaches for prediction of structure of TCR-peptide-MHC complex have been proposed on the basis of homology modeling 6 or flexible docking 7 . Also in some studies the possibility of structure-based prediction of TCR-peptide interactions was investigated 8,9 . However, limited success was achieved in these works, that may result from using general-class energy functions for protein modeling (such as Rosetta 10 and ZDOCK 11 ) rather than energy potentials specialized for TCR-peptide-MHC complex modeling. The TCR-peptide interface significantly differs from general protein-protein interfaces, that comes from biological functions of TCRs which require that each TCR should be specific and cross-reactive at the same time 12 . Typical values of TCR affinity to their cognate peptide-MHC ligands lie within the range of~1-100 μM that is several orders of magnitude higher than typical values of affinity in protein-protein complexes 13 . Reaching of specificity in such a low affinity range requires interaction mediated by a number of medium-strength contacts rather than by salt bridges, hydrogen bonds and hydrophobic interactions [14] [15] [16] . These features make description of TCR-peptide recognition using conventional approaches (such as MMPBSA or universal scoring functions such as Rosetta) problematic and inspire development of methods specifically focused on prediction of TCR-peptide interactions 17 .

In this work, we describe a novel computational method that ranks TCR-peptide pairs based on statistical estimation of binding energy to select the most likely cognate epitopes. Our approach employs calculation of a statistical potential (called TCRen, from "TCR energy") that describes the interaction energy between TCR and peptide residues. We derived TCRen potential via statistical analysis of amino acid contact preferences in TCR-peptide interfaces from a non-redundant set of crystal structures of TCR-peptide-MHC complexes (Figure 1 , upper panel). Our method starts from a structure of the peptide-MHC complex with the TCR of interest-either experimentally derived or based on a homology model-then extracts a TCR-peptide contact map, estimates the TCRen score for all candidate peptides by convolving the contact map with TCRen pairwise energies, and finally, generates a ranked list of peptides based on this score (Figure 1, lower panel) . Such residue-level scoring is optimal for practical applications in which hundreds or thousands of TCR-peptide pairs are being screened, as it makes the method more robust against inaccuracies in structures and does not require the computationally expensive refinement procedures that are necessary when all-atom scoring functions are used.

Importantly, in contrast to the conventional protein-protein interfaces, the interface of TCR-peptide interaction is essentially asymmetric: TCR repertoire is shaped by V(D)J-recombination and selection processes for minimization of risk of potential self-reactivity, while immunopeptidome is shaped by MHC presentation machinery 19 . Consequently, TCRen that is derived from the natural TCR-peptide-MHC complexes inherits the asymmetric form.

In the first part of the work we describe the derivation of TCRen. Further we assess TCRen ability to discriminate between cognate TCR epitopes and unrelated peptides and between real and mock TCR-peptide-MHC complexes, and compare its performance with other residue-level pairwise potentials, atom-level Rosetta scoring function and MMPBSA approach. Then we perform an analysis of TCRen correlation with physicochemical properties of contacting residues that reveals specific features of TCR-peptide recognition compared to other protein-protein interactions. Finally, we propose a pipeline utilizing TCRen for prediction of cancer neo-epitopes recognised by tumor-infiltrating lymphocytes.

Derivation of TCRen potential is schematically represented in Figure 1, Table 2 ). The used non-redundant database includes human and mouse TCR-peptide-MHC complexes with both MHC-I and MHC-II. Although the majority of TCR contacts with peptide are formed by CDR3α/β loops, a significant number of contacts are formed by CDR1α/β and CDR2α/β loops (Supplementary Figure 1) , so for derivation of TCRen we used all the contacts between TCRs and peptides. For all structures in the non-redundant set we extracted pairwise amino acid contacts between TCR and peptide and summarised the number of occurences of contacts between each of 400 possible amino acid pairs (with explicit assignment which of the residues corresponds to peptide or TCR) across all the structures. Based on these summary contact counts we calculated amino acid contact preferences in TCR:peptide interface, that reflect which residue pairs are found in contact more or less frequently than it is expected from random contact formation. TCRen values for energy of interaction between TCR and peptide residues were calculated from these contact preferences in the assumption that the more energetically favorable residue pair is, the more frequently (with normalization to baseline frequency of contacting amino acids) it is found in real structures.

We considered 3 reference states, which represent different ways to estimate the baseline number of contacts between the given amino acids in the absence of any pairing preferences: "contact-fraction", which considers only residues which actually form contacts 20 , "mole-fraction", which considers all residues in peptides and central parts of CDR3 regions, and "normalized mole-fraction", which accounts for different propensity for forming contacts for residues in different positions of peptides and CDR3s (see details in the Methods section). As was discussed in the Introduction, TCR:peptide interface is asymmetric that results in asymmetric form of TCRen. To investigate the importance of introducing asymmetry we also considered the symmetric form of TCRen, derived from the matrix of TCR-peptide contacts computed regardless of which of the contacting residues come from TCR or peptide. So, in this work we compared the performance of 6 versions of TCRen: symmetric and asymmetric ones with 3 reference states for each. These potentials will further be referred to in the form TCRen.X.Y, when X is either "s" for symmetric or "a" for asymmetric forms, Y is either "cf" for contact-fraction or "mf" for mole-fraction or "nmf" for normalized mole-fraction reference states (notations are summarised in Supplementary Figure 2A) .

For comparison we considered pairwise potentials from AAindex database 21 . Since most of these potentials are highly correlated with each other (Supplementary Figure 3A) , we selected a representative set of 8 potentials derived from amino acid contact preferences either in general protein-protein interfaces ("MOOG990101", "KESO980101") or in protein folding ("MIYS990106", "BRYS930101", "MICC010101", "THOP960101", "TOBD000102", "VENM980101").

Pairwise correlation coefficients ( ) between entries of different versions of TCRen and representative potentials from AAindex are presented as a correlation matrix in Supplementary Figure 3B . Note that for symmetric and asymmetric potentials that use the same ≈ 0. 6 reference state, for potentials of the same type (symmetric or asymmetric) with different ≈ 0. 9 reference states, and no statistically significant correlation exists between any of AAindex potentials and any of TCRen versions. The upper panel describes statistical analysis of contact preferences in TCR-peptide interactions. A set of non-redundant (in terms of CDR3α, CDR3β, and peptide sequences) TCR-peptide-MHC complexes (n = 144, which included structures with MHC class I and II from mouse and human) is selected from all available TCR-peptide-MHC crystal structures (n = 239). Contacts between TCR and peptide residues are identified and summarized (total number of contacts = 3,106). A 20 × 20 matrix of contact odds is then derived by normalizing contact frequencies. Rows and columns in this matrix respectively correspond to amino acids from the peptide and TCR. Cells are colored according to TCRen values. The lower panel outlines the bioinformatic pipeline, starting with a structure of the TCR-peptide-MHC complex-either experimentally derived or from homology modeling-followed by TCR-peptide contact map extraction and convolution with TCRen potential, leading to the final TCRen score. Our method outputs a ranked list of candidate peptides according to their estimated TCR binding energy.

TCRen for identification of cognate TCR epitope First we assessed the performance of TCRen in the task of finding the cognate epitope of the given TCR (that is the peptide contained in the crystal structure of the corresponding TCR-peptide-MHC complex) among a set of unrelated peptides. For each structure in the non-redundant set of TCR-peptide-MHC complexes we extracted a TCR-peptide contact map and calculated TCRen score for the cognate peptide. Then we generated 1000 unrelated peptides (mismatched from the cognate epitope in all non-anchor positions) and calculated TCRen score for each of them (see Methods and Figure 1 , lower panel for details). As a metric of performance we used "cognate epitope rank" -the fraction of unrelated peptides which have lower values of energy of interaction with TCR than the cognate epitope. The lower the cognate epitope rank is, the more accurate distinction between the cognate epitope and unrelated peptides is achived. For a random classifier percentage cognate epitope rank is close to 50%. Figure 2B , where distribution of cognate epitope rank for all TCR-peptide-MHC structures is plotted for each potential. The best results are obtained for TCRen.a.cf: median value of cognate epitope rankis 3%. Notably, for all reference states asymmetric potentials had higher performance than corresponding symmetric versions, likely reflecting the asymmetric nature of TCR-pMHC interaction.

On the same benchmark we compared the performance of TCRen and other pairwise potentials (Figure 2A) . For potentials from AAindex median value of cognate epitope rank is in the range 32-55% which is significantly worse than for TCRen. TCRen.a.cf potential also has the best performance when comparing all complexes with cognate epitopes against all complexes with unrelated epitopes without segregation based on PDB template with , both Since TCRen.a.cf has the best performance in this benchmark, in further sections we mainly considered this version and refer to it as TCRen. Distributions of percentile ranks of peptide-TCR interaction energies for cognate epitopes among a set of unrelated peptides, as calculated for all structures in a non-redundant TCR-peptide-MHC dataset using TCRen (TCRen.a.cf; red) or pairwise potentials from the AAindex database (grey). The latter were derived from amino acid contact preferences in general protein-protein interfaces (MOOG990101, KESO980101) or in protein cores (MIYS990106, BRYS930101, MICC010101, THOP960101, TOBD000102). See Methods for details of the benchmarking experiment. B. ROC curves for distinguishing all complexes with cognate epitopes versus all complexes with unrelated peptides using TCRen (red) or potentials from AAindex (grey). C. Performance of TCRen (red) and pairwise potentials from AAindex database (grey) in discriminating a set of peptides recognized by TCRs 2B4, 226, and 5cc7, which were previously identified from yeast display screening, and unrelated peptides. Top shows sequence logos for the yeast display hits, bottom shows ROC curves for 2B4, 226, and 5cc7. TPR = true positive rate, FPR = false positive rate.

Assessing TCRen performance on an independent validation set of recently deposited PDB structures Figure  5AB) . For majority of these cases TCRen had high performance, e.g. for structures containing cancer neoepitopes (derived from KRAS (6ULR), HHAT (6P64) and p53 (6VQO)), tumor-associated antigens (from cancer testis antigen NY-ESO-1 (6RP9), breast cancer antigen NY-BR-1 (6R2L), melanoma antigen gp100 (6VM7)), autoantigen from fibrinogen β (6V13 and 6V0Y), viral peptide from EBV (6VMX) and others (Supplementary Figure 5A) .

Notably, among these newly deposited structures there were three structures with TCRs recognizing peptides from SARS-CoV2 (7N1E, 7N1E, 7N6E). For two of them, 7N1E (with "RLQ" epitope) and 7N1F (with "YLQ" epitope) TCRen can accurately distinguish cognate epitopes from unrelated peptides (Supplementary Figure 5B) . For the 7N6E case (with "YLQ" epitope) TCRen performance is worse that could be explained by a high negative contribution to TCR-peptide interaction energy from contact between glutamine (Q) in TCR CDR1α and Q in the epitope (Supplementary Table 3 ). Q-Q is considered by TCRen as highly unfavorable, as it is a very rare contact that occurred only once in the structures used for TCRen derivation, while Q occurs relatively frequently both in peptides and in peptide-contacting positions of TCR in these structures (see equations in "Derivation of TCRen" section in Methods). Inspection of other cases for which TCRen fail to identify cognate epitope also revealed that all these structures contain contacts that are extremely rare and are not found or occurred only once in the TCRen derivation set and these contacts have the highest unfavorable impact to the interaction energy (Supplementary Table 3 ). However, if in 7N6E case we consider only contacts between peptide and CDR3α/β (which are regarded to be the most important for TCR recognition and comprise the majority of TCR-peptide contacts 16 ), TCRen performance is significantly improved (Supplementary Figure 5B) . This case raises the question about the interplay between CDR3-mediated and germline-encoded specificity (e.g. governed by TRAV12-1 22 ), which would be peculiar to investigate in future studies.

We further tested the ability of TCRen.a.cf to distinguish all epitopes for TCRs, for which dozens of different recognized peptides were identified using yeast display screening. In this benchmark we considered 3 different TCRs (2B4, 226, 5cc7) from the work of Birnbaum et al. 23 and used the TCRen version, derived without inclusion of structures containing 2B4, 226, 5cc7 and TCRs with similar sequences. ROC-curves for distinction between recognized and unrelated peptides are shown in Figure 2C . For all the 3 cases TCRen demonstrated excellent performance with ROC AUC 0.98, 0.84 and 0.74 for 2B4, 226, 5cc7, respectively.

TCRen for discrimination between real and mock TCR-peptide-MHC complexes Next we assessed performance of potentials in distinguishing real TCR-peptide-MHC complexes from the ones obtained by shuffling of TCRs and pMHCs in existing crystal structures (see Methods section for details). Among different versions of TCRen in this benchmark the best results were demonstrated by TCRen.a.cf ( Figure 3A , Supplementary Figure 2D ), which also had the best performance in distinction between cognate epitope and unrelated peptides as described above (Supplementary Figure 2C) . Pairwise potentials from AAindex database had performance in this benchmark close to that of a random classifier ( Figure 3A) . Accuracy of discrimination between real and shuffled complexes increases when using total energy of peptide interaction with TCR (both with α-and β-chains) rather than energy of interaction with separate chains (Figure 3B ). There is no correlation between calculated peptide-TCRα and peptide-TCRβ energies of interaction ( Figure 3C ) so impact of TCR α-and β-chains to interaction may be considered as complementary.

shuffled TCR-peptide-MHC complexes using TCRen.a.cf (TCRen; red) or pairwise potentials from the AAindex database (grey). B. ROC-curves for discrimination between real and shuffled TCR-peptide-MHC complexes using TCRen when considering only TCRα (TRA) or TCRβ (TRB) or both TCRα and TCRβ (TRA+TRB) chains. C. Absence of correlation between energies of TCRα-peptide (X axis) and TCRβ-peptide (Y axis) interactions calculated using TCRen potential.

In the previous sections we compared TCRen performance with other residue-level pairwise contact potentials. However, more fine-grained atom-level scoring functions (such as Rosetta 10 ) and approaches based on physical calculations (such as MMPBSA 24 ) exist and are widely used for assessment of protein-protein interactions.

First we compared the robustness of TCRen and Rosetta calculations. For this purpose we examined correlation betweenof TCR-peptide interaction scores calculated for one static structure (obtained using fixbb protocol) and averaged across 1000 frames of 100-ns molecular dynamics (MD) simulations of corresponding TCR-peptide-MHC complexes (Supplementary Figure 6 ). For TCRen excellent correlation between these values is observed that highlights robustness of residue-level TCRen potential to inaccuracies in modeled structures that allows its use without computationally expensive MD simulations. For Rosetta dG values calculated from MD trajectories were significantly lower than the ones of static structures in all the cases and no correlation between them was observed, that is explained by fitting of TCR and peptide interfaces during MD simulation. Thus, calculation of Rosetta dG for structures obtained using fixbb protocol is unreliable, and using Rosetta potential for scoring of peptide-TCR interactions requires generation of an ensemble of TCR-peptide-MHC complexes, e.g. using MD simulations.

In view of the lack of robustness for Rosetta, for comparison of TCRen and Rosetta performance we carried out MD simulations and averaged TCR-peptide energy of interaction across the trajectory. We also used the obtained trajectories for MMPBSA calculations. We assessed performance of TCRen, Rosetta and MMPBSA in identification of peptides recognized by DMF5 TCR. For this TCR a number of recognised epitopes was identified by yeast display 18 and these peptides are clustered to two groups with different motifs: one with a hydrophobic core (GIGI in positions 4-7) and the second with a charged core (DRG in positions 4-6). We performed MD simulations of DMF5 TCR-peptide-MHC complexes with different peptides from 3 groups: experimentally confirmed epitopes with either "GIGI" or "DRG" motif and unrelated peptides that share neither of these motifs. For each peptide we calculated trajectory-averaged energy of interaction with TCR using either TCRen or Rosetta or MMPBSA. Only TCRen reported lower values of energy for both "GIGI" and "DRG" groups of peptides compared to unrelated peptides. Rosetta scoring function assigned lower energy values for "DRG" but not for "GIGI" peptides, while MMPBSA approach identified no statistically significant differences between unrelated and recognised peptides from both groups (Figure 4 ). TCRen correlation with physicochemical properties of contacting amino acids As TCRen was derived based on amino acid contact preferences in TCR:peptide interface, it should reflect the principles of TCR-peptide recognition. We performed the analysis of correlation between TCRen entries ( Figure 5A ) and physicochemical properties of contacting amino acids to shed light on the specific nature of TCR-peptide interactions. For comparison we used the potential derived by Miyazawa and Jernigan 25 (further referred to as MJ; Figure 5B ), representing the rules governing protein folding, and the potential from the work of Keskin et al. 26 (further referred to as Keskin; Figure 5C ), reflecting preferences in general protein-protein interactions.

To investigate regularity patterns in TCRen, MJ and Keskin potentials we first grouped non-polar, polar and charged amino acids and compared values for contacts between two non-polar residues, one non-polar and one polar residue, one non-polar and one charged residue, etc. (Figure 5D ). MJ demonstrates clear clustering of non-polar, polar and charged residues ( Figure 5B ) with the best values for contacts of two non-polar residues and the worst values for contacts of two charged residues ( Figure 5D ). Keskin potential follows in general the same patterns as MJ but with more favorable energy values for contacts of oppositely charged residues ( Figure 5E ) and contacts of glutamine and arginine with other amino acids ( Figure  5C ). For TCRen some clustering of amino acids with similar properties (e.g. positively charged R and K, polar T, N and S, hydrophobic W, V, M, F) is observed (Supplementary Figure 7) , however, generally it doesn't have a simple regular pattern analogous to that of MJ and Keskin.

Further we examined correlation between TCRen, MJ and Keskin values and physicochemical properties of contacting residues represented as 5 Atchley factors (AF) 27 , which were derived from a multivariate analysis with dimension reduction of 494 properties of amino acids ( Figure  5G ). MJ and Keskin entries to a greater extent correlate with polarity/hydrophobicity AF1 of individual amino acids. On the opposite, TCRen correlates mostly with the products of AFs of contacting residues that implies that TCRen accounts for interaction of physicochemical properties of contacting amino acids.

We also performed eigenvalue decomposition of TCRen, MJ and Keskin matrices (see Supplementary Note 1, Supplementary Figure 8 ) that also demonstrated mainly hydrophobic nature of driving forces of protein folding 28 and general protein-protein interactions and more complex rules of TCR-peptide interactions. Keskin potential. D-E. Comparison of TCRen and MJ entries grouped by properties of the contacting amino acids. Groups in D.: "nonpolar" -leucine, isoleucine, valine, methionine, phenylalanine, tryptophan, alanine, glycine, proline; "polar" -serine, threonine, asparagine, glutamine, histidine, cysteine, tyrosine; "charged" -arginine, lysine, aspartic acid, glutamic acid). In E. the group of "charged & charged" is decomposed to "+ charged" (arginine, lysine) and "-charged" (aspartic acid, glutamic acid). F. Correlation of TCRen, MJ and Keskin entries with AFs of contacting amino acids. Suffixes ".TCR" and ".peptide" denote residues from TCR and peptide, respectively. The first column corresponds to correlation coefficients of matrix entries with AFs of TCR residues, the last row -to correlation coefficients of matrix entries with AFs of peptide residues, and all other cells (marked with dashed rectangles) -to correlation coefficients between matrix entries and products of AFs, the first of which referred to residue from TCR and the second -to the residue from peptide. MJ and Keskin entries better correlate with properties of individual amino acids, while TCRen -with products of AFs of contacting amino acids. AF1: polarity/hydrophobicity, AF2: secondary structure preferences, AF3: molecular volume, AF4: codon diversity, AF5: electrostatic charge (see also Supplementary Table 4 ).

Tumor-infiltrating lymphocytes (TILs) recognition of neo-epitopes (peptides from mutated proteins) presented by cancer cells is an important part of anti-tumor immunity. Prediction of neo-epitopes recognised by host TILs is critical for development of efficacious neo-epitope-based therapeutic vaccines 29, 30 . Approaches based solely on neo-epitope sequences currently have very limited success 31 and accounting for information about TCR sequences of TILs, which recognise immunogenic neo-epitopes, may improve prediction accuracy.

In previous sections we demonstrated that TCRen can accurately distinguish between cognate TCR epitopes and unrelated peptides. So TCRen may be useful for identification of neo-epitopes recognized by TILs. The overall scheme of the proposed pipeline is presented in Figure 6A . The key stage of this approach is assessment of TCRen score for each of candidate neoepitopes based on a model of TCR-peptide-MHC complex (e.g., obtained using TCRpMHCmodels server 6 ).

We performed a proof-of-concept demonstration of this approach using recently published data on several TCRs of TILs from patients with recurrent advanced epithelial ovarian cancer 32 . We focused on TCR 302TIL recognising neoepitope HHAT p8F (KQWLVWLFL) in HLA-A*02:06 context as only for this TCR the crystal structure of TCR-peptide-MHC complex with cognate peptide-MHC is available 33 . In general case models rather than crystal structures of TCR-peptide-MHC complexes of interest are used for prediction of TCR-peptide recognition (Figure 6A) , thus availability of crystal structure is important for benchmarking purposes as it allows to separate between inaccuracy coming either from modeling errors or from TCRen performance.

For the same patient from which TILs with TCR 302TIL were isolated, 20 candidate neo-epitopes were predicted based on exome sequencing of tumor and normal tissues and NetMHC 34 predictions of binding to HLA-A*02:06. We performed homology modeling of the TCR-peptide-MHC complex for TCR 302TIL using TCRpMHCmodels server 6 (corresponding crystal structure was not used as a template; see Methods for details). Using the resulting model we assessed TCRen score of TCR-peptide interaction for each of 20 candidate neoepitopes and ranked them according to this value. Cognate neoepitope HHAT p8F , had the best predicted TCRen score (Figure 6B ). This result perfectly demonstrates the perspective of TCRen for identification of neoepitopes recognized by TILs. We also performed ranking of the candidate neoepitopes using the crystal structure of 302TIL TCR-peptide-MHC complex. In this case TCRen also had an excellent performance, ranking the cognate neoepitope as Top1 ( Figure 6C) . Interestingly, the set of TCR-peptide residue contacts was not identical for the crystal structure and the homology model with~60% of contacts shared. The fact that cognate peptide got high rank using both for the crystal structure and the homology model, may support the idea that TCR-peptide recognition is robust to small inaccuracies in TCR-peptide-MHC structure that is important in early stages of TCR-pMHC interaction. TCRen values for energy of interaction of TCR 302TIL with candidate neo-epitopes. Contact map for the 302TIL TCR was taken from either a homology model (B) or the actual crystal structure (PDB ID: 6UK4) (C). The cognate peptide recognized by TCR 302TIL is highlighted in red. Note that five out of the top seven peptides with the best TCRen score as well as the peptide with the worst score are matched when ranking based on both the homology model and the PDB structure.

TCR recognition of peptides presented by MHC molecules on the surface of cells is a critical event in triggering and realization of adaptive immune response. Recent advances in high-throughput sequencing allowed deep and accurate profiling of TCR repertoires 35 , and emulsion-based techniques allow for the massive pairing of TCR alpha and beta chain repertoires 36, 37 . However, a major challenge still exists in linking sequences of TCRs and their specificities.

In this study we propose a method for prediction of TCR:peptide recognition. We provide the pipeline which requires a structure of TCR-peptide-MHC complex (e.g. obtained by homology modeling in TCRpMHCmodels server 6 ) and a list of candidate peptides and outputs a ranked list of these peptides according to their binding probabilities. Given the structure of TCR-peptide-MHC complex, the pipeline first extracts all pairwise contacts between peptide and MHC and then estimates the energy of TCR-peptide interaction for all candidate peptides using a statistical potential named TCRen.

TCRen was derived based on amino acid contact preferences in TCR:peptide interface of TCR-peptide-MHC crystal structures obtained to date. These preferences may be calculated in the assumption of either asymmetry (residues of the same amino acid in TCR and in peptide side are regarded as different residue types) or symmetry of peptide:TCR interface. Asymmetric form of the potential can in part account for multi-body interactions and differences in the environment of residues in CDRs and peptides. We compared both approaches and found that asymmetric potential performs significantly better. This result is in line with the intrinsic asymmetry of TCR-peptide interface. Since TCRs with highly hydrophobic CDR3s may be strongly cross-reactive due to non-specific hydrophobic interactions, amino acid composition of TCR:peptide interfaces is characterized by depletion in hydrophobic and enrichment in polar amino acids in CDR3 regions of TCRs compared to immunopeptidome (Supplementary Figure  9) . Better performance of asymmetric potential in our study is in concordance with the recent work of Brenke et al. 38 which demonstrated superior performance of asymmetric potential compared to symmetric one in the task of antibody-antigen docking and linked these results to enrichment of aromatic residues (tyrosine, phenylalanine and tryptophan) in CDR regions of antibodies.

We showed that TCRen can accurately distinguish between the cognate epitope and unrelated peptides, both for a wide set of TCRs from PDB for which either one or a set of recognized peptides is known, as well as between real and mock TCR-peptide-MHC complexes. On the same benchmarks we demonstrated that other approaches developed for more general tasks related to protein-protein interactions (including residue-level pairwise potentials from AAIndex database 21 , all-atom Rosetta scoring function and MMPBSA approach) fail to predict TCR:peptide recognition. The superior performance of TCRen is explained by accounting for the special nature of TCR:peptide recognition, characterized by lower impact of hydrophobic interactions and complex interplay between different properties of contacting amino acids. Due to residue-level contact definition, TCRen is not sensitive to fine details of TCR-peptide-MHC structure. This is highlighted by high correlation of TCRen energies calculated from one static structure and averaged across 100-ns MD simulations (for Rosetta scoring function dG for a single structure and MD-averaged didn't correlate at all).

Finally, we demonstrate that TCRen can be used for identification of neo-epitopes recognised by tumor-infiltrating lymphocytes.

The major limitation of the TCRen is the small number of available crystal structures of TCR-peptide-MHC complexes accumulated to date. That resulted in low counts for some rare pairs of amino acids (e.g. the ones including cysteine or histidine residues) and consequently high uncertainty for TCRen values for these pairs. So an investigator should carefully examine results obtained for TCR-peptide interfaces which include rare pairs of contacting residues. Increase in the number of available TCR-peptide-MHC crystal structures will provide more statistics for rare amino acid pairs. However, for majority of residue pairs the used non-redundant database has enough statistics for calculation of corresponding values with high precision. Validation in a holdout set of newly deposed TCR-peptide-MHC structures demonstrates that TCRen can identificate epitopes even for TCRs that share no homology to TCRs used for TCRen derivation.

The accuracy of TCRen is dependent to a large extent on correctness of structure of TCR-peptide-MHC complex used for extraction of peptide-TCR contacts. So in practical applications of TCRen for identification of epitopes of TCRs for which no crystal structures are available, the accuracy of TCR-peptide-MHC models is critical. If templates with a close sequence of CDR regions are available, homology modeling may be used, in other cases flexible docking approaches should be used 7 . We expect that TCRen potential may also facilitate improvement of TCR-peptide-MHC modeling algorithms, being incorporated in the stage of rescoring of decoys obtained by homology modeling or docking and/or in sampling procedure.

We suppose that TCRen may be used in development of novel approaches for prediction of TCR specificity and cross-reactivity with applications in antigen-specific therapy in cancer and autoimmunity and vaccine design.

Derivation of TCRen potential and calculating TCRen score For all structures in the non-redundant set of TCR-peptide-MHC complexes, we extracted contacts between TCR and peptide residues. Residues were considered to be in contact if any two atoms of these residues were separated by a distance ≤ 5 Å. The number of contacts that CDR and framework regions of TCR form with peptides in different crystal structures is shown in Supplementary Figure 1 , the number of contacts that residues in specific positions in CDRs of different lengths form with peptides is shown in Supplementary Figure 10 .

Derivation of TCRen potential is based on the assumption that the distribution of frequencies of pairwise amino acid contacts can be described using the inverse Boltzmann relationship:

where is a residue from TCR, is a residue from the peptide, is the observed ( , ) probability of contacts between these residues in the database, and is the expected ( , ) probability of contacts between these residues when assuming random pairing of amino acids. and are calculated using the following equations:

where is the number of contacts between residues and observed in the # ( , ) database plus pseudocount 1, and , where summation is carried out through all 20 amino acids denoted with indices and .

In this work we also considered two other reference states ("mole-fraction" and "normalized mole-fraction") and corresponding symmetric versions of TCRen. Equations used for their derivation are described in detail in Supplementary Methods.

The interaction energy between TCR and peptide in the given complex is estimated as:

where summation is carried out through all residues in the peptide and all residues in the TCR; if residues and are in contact and otherwise. δ( , ) = 1 = 0

Cognate epitope identification among unrelated peptides For each structure in the non-redundant set of TCR-peptide-MHC complexes, we extracted a contact map: a list of contacts between particular residues in the TCR and peptide with residue positions indicated (see Figure 1 for an example).

For each entry in our non-redundant database of TCR-peptide-MHC structures, we composed a list of 1,000 unrelated peptides with the same length as the cognate epitope, but which were mismatched from the cognate epitope in all non-anchor positions. The second and last residues were considered anchors, as they are responsible for MHC binding. For each unrelated peptide, we generated contact maps from the corresponding crystal structure and substituted residues of the cognate epitope with residues located at the same positions in the unrelated peptide.

For each peptide (both cognate epitope and unrelated ones) we calculated TCRen score using equation 4. Then for each structure in our non-redundant set we calculated coganate epitope rank -the fraction of unrelated peptides which have TCRen score lower (better) than the score of the cognate epitope.

In the database of TCR-peptide-MHC complexes we found pairs of structures which have the same lengths of CDR3α, CDR3β and peptide, but mismatched in at least 3 positions in each of these 3 regions. Shuffled complexes were constructed by combining TCRs and pMHC from different complexes in these pairs. Contact map for a shuffled complex was taken from the original crystal structure with TCR included in the shuffled complex. For this benchmark we used only complexes with MHC-I as peptides in complexes with MHC-II may accommodate different binding registers that makes comparison of contact maps in different complexes ambiguous.

Molecular dynamics simulations of DMF5 TCR-peptide-HLA-A02 complexes MD simulations were performed for DMF5-peptide-HLA-A02 complexes with 42 different peptides from 3 groups ("GIG", "DRG" and "random") using two different starting structures. This yields 84 systems, for each 100 ns MD simulations were carried out in at least 3 replicas for "GIG" and "DRG" peptides and 2 replicas for "random" peptides (a total simulation time was 17,8 μs). Sequences for "GIG" and "DRG" peptides were taken from the work by Riley et al. 39 , both groups consist of 5 peptides for each of them recognition by DMF5 TCR in HLA-A02 context was confirmed using SPR and INFγ release assay. 32 "Random" peptides were taken from IEDB database 40 with the following conditions: they are experimentally determined to be HLA-A*02:01 binders, derived from human proteins and share neither "GIG" nor "DRG" motif. Starting structures for MD simulations were constructed based on PDB crystal structure templates (6AM5 for "standard" register, 6AMU for "shifted" register) with corresponding peptide sequence introduced by "mutate_residue" PyRosetta protocol.

Preparation of the system and MD simulations were carried out using GROMACS v.5.1.4 software with Amber99sb-ildn force field (see Supplementary Methods for details).

Rosetta Interface analyser module was used for calculation of the binding energy (dG_separated) according to Rosetta scoring function 10 between TCR and peptide. Molecular mechanics Poisson-Boltzmann surface area (MM-PBSA) was applied for estimation of binding energy between TCR and peptide-MHC using g_mmpbsa tool 24 . 100-ns MD trajectories were splitted to 1000 frames, for each frame calculated using Rosetta scoring function or Δ MMPBSA and then was averaged over all the frames. Δ

Sequences of 302TIL α-and β-chains were taken from the crystal structure PDB ID: 6UK4 (from Devlin et al. 13 ). The model of TCR-peptide-MHC complex for 302TIL with the random peptide VVVVVVVVV (chosen because valine has an intermediate value of side-chain volume amongst the 20 canonical amino acids) and HLA-A*02:06 was obtained using the TCRpMHCmodels server 14 , which performs homology modeling based on an internal database of TCR-peptide-MHC templates. Templates selected by TCRpMHCmodels for TCR-peptide-MHC modeling are shown in Supplementary Figure 11A . Superimposition of the resulting homology model and crystal structure 6UK4 is shown in Supplementary Figure 11B .

Sequences of candidate neo-epitopes were from Bobisse et al. 12 . In this study, 20 peptides were identified that were predicted to bind to HLA-A*02:06 and include non-synonymous somatic mutations. For both the homology model and the crystal structure, we derived contact maps, and for each of the 20 candidate neo-epitopes, we estimated the energy of TCR-peptide interaction based on this contact map using TCRen as described above. Sequences of candidate neo-epitopes and the corresponding TCRen scores of TCR-peptide interaction are shown in Supplementary 

For each pair of TCR-peptide-MHC crystal structures from PDB we calculated Levenstein distance between their CDR3α, CDR3β and peptide sequences (d CDR3α , d CDR3β and d peptide , respectively) and the overall sequence distance between these two structures were calculated as:

Then we performed hierarchical clustering of PDB structures based on this measure with cutting the resulting tree at the height 6. This procedure yielded in 135 clusters.

If selection of a non-redundant set is based only on sequence information, this may result in filttering out structures with similar TCRs recognising similar epitopes using different mechanisms (see, for example, structures with PDB ID 1NAM and 2OL3; 5M02 and 6G9Q). To include such cases in our database, we performed additional clustering based on contact information. For each structure we extracted all TCR-peptide contacts and represented them as a 400-mer vector containing information about the number of contacts of residues of the given type (i.e. if we label amino acids from A to W as 0 to 19, the contact between i-th and j-th amino acid has an index of i * 20 + j). For each pair of structures we calculated the Manhattan (rectilinear) distance between these 400-mer vectors normalized on the sum of lengths of these vectors. Then we performed hierarchical clustering of PDB structures based on this measure with cutting the resulting tree at the height 0.7. This procedure yielded in 91 clusters.

Finally, we divided all the PDB structures to groups containing structures assigned to the same clusters according to both sequence-based and contact-baseded clustering. We select a single representative from each group to include in the nonredundant set of TCR-peptide-MHC structures (n = 144).

We also explored other approaches for calculating TCRen. We considered three different reference states-contact-fraction, mole-fraction, and normalized mole-fraction-and derived symmetric and asymmetric versions for each of these. In the symmetric version, we considered contacts without assignment of which molecule type (peptide or TCR) the contacting residues come from:

.

That resulted in and consequently .

In the asymmetric version, we explicitly considered chain types corresponding to both interacting residues:

.

In general cases, this resulted in:

and consequently

and are also calculated separately for peptide and TCR, and equation (3) for ( ) ( ) asymmetric potential may be rewritten as:

Definitions of the reference states, which represent different approaches for calculation of , were the following: The reason for introducing a normalized mole-fraction reference state is the different propensities of forming contacts for residues at different positions in peptides and CDRs of different lengths. As can be seen from Supplementary Figure 10 , residues that are close to the center of peptide or CDR3 tend to form more contacts. However, substantial variation in the number of contacts for given peptide and CDR3 positions exists between structures, so the observation that a residue in the center of the peptide or CDR3 does not form contacts with its counterpart in the given structure may indicate that all available contacts are energetically disfavored and that it is more favorable to avoid any contact.

In calculating the mole fractions of residues in TCRs, we considered only residues in CDR3, with trimming of the first three and the last four residues. This is because these residues in CDR3 almost never contact the peptide in available TCR-peptide-MHC crystal structures (Supplementary Figure 10) . Although CDR1α/β and CDR2α/β often form a number of contacts with the peptide (Supplementary Figure 1) , position-specific interaction patterns are not regular for these regions, and the median number of contacts is 0 for almost all residue positions in CDR1α/β and CDR2α/β (Supplementary Figure 10) Figure 4) are underlined. To assess intrinsic regularity of TCRen potential, we performed eigenvalue decomposition of the corresponding matrix. As this method is applicable only to symmetric matrices, we performed eigenvalue decomposition of TCRen.s.cf matrix, which also has excellent performance in all benchmarks ( Table 1) and thus may be considered as appropriate for representation of rules of TCR:peptide interactions. Eigenvalue decomposition of asymmetric TCRen.a.cf matrix can be performed after a simple transformation and its analysis results in qualitatively the same conclusions as reported further in this section (see Supplementary Note 1).

Eigenvalue decomposition allows to represent all entries of the matrix as:

where are eigenvalues and are elements of corresponding eigenvectors. λ α α,

If some eigenvalues prevail and have higher absolute values compared to other eigenvalues, corresponding eigenvectors may be considered as reflecting some properties of residues that govern interaction (in case of TCRen -TCR-peptide recognition).

Eigenvalue spectrum of TCRen.s.cf matrix and correlation coefficients of eigenvectors and AFs are shown in Supplementary Figure 8 , left panel. For comparison, eigenvalue decomposition was performed also for MJ and Keskin matrices (before analysis they were modified by subtracting mean values (Supplementary Figure 8, central and right panels) . In the < >)

case of MJ as little as two eigenvalues prevail, both with a strong negative correlation with polarity, that points to the hydrophobic nature of driving forces of protein folding 28 . The form of the eigenvalue spectrum of Keskin matrix is similar to that of MJ matrix, with two dominating eigenvectors negatively correlated with polarity. The notable difference between Keskin and MJ eigenvalue spectrums is that Keskin prevailing eigenvectors also correlate with AF2 (related to secondary structure preferences) and absolute values of other eigenvectors of Keskin are a bit higher compared to MJ matrix, that explains higher correlation of Keskin entries with AF2 of contacting amino acids ( Figure 6D ) compared to MJ. However, both entries and prevailed eigenvectors of Keskin have the highest correlation with hydrophobicity AF1, so the major driving force of general protein-protein interactions may be also considered as hydrophobic.

Eigenvalue spectrum of TCRen.s.cf ( Figure 6E) is much more complex than that of MJ and Keskin and a higher number of eigenvalues prevail. Top eigenvectors correlate with different properties of contacting residues: AF4 for eigenvector 1, AF1 for eigenvectors 2 and 3, AF5 for eigenvectors 18 and 19. This analysis suggests that different properties of contacting residues contribute to specificity of TCR-peptide recognition. Although hydrophobicity is important for this process (that is evident from correlation of TCRen.a.cf entries with products of AF1s of contacting residues in Figure 6B and correlation of TCRen.s.cf eigenvectors 2 and 3 with AF1 in Figure 6E ) it is not as dominant as in protein folding and protein-protein interaction and other properties (such as electric charge) are also important for TCR-peptide interaction.

As eigenvalue decomposition may be performed only for symmetric matrices, a simple transformation of TCRen.a.cf matrix was made: residues of the same type in TCR and 20 × 20 peptide were considered as distinct entities that resulted in a symmetric matrix of size 40 × 40 with zero values in cells for which both dimensions correspond to residues from the same molecule (TCR or peptide). Eigenvalue spectrum of this matrix is shown in Figure  40 × 40 SN1A. To examine the relation between eigenvectors and properties of amino acids, correlation coefficients of each eigenvector with 5 AFs were computed separately for residues in TCR and peptide (Figure SN1B-C) . Top eigenvectors correlate with different properties of contacting residues and this pattern slightly differs for residues in TCR and peptide side (AF1, AF3, AF4 and AF5 for TCR residues; AF1, AF2 and AF4 for peptide residues; see Figure SN1B -C). 

Diversity and clonal selection in the human T-cell repertoire

Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling

The Protein Data Bank

Identifying specificity groups in the T cell receptor repertoire

Quantifiable predictive features define epitope-specific T cell receptor repertoires

TCRpMHCmodels: Structural modelling of TCR-pMHC class I complexes

A flexible docking approach for prediction of T cell receptor-peptide-MHC complexes

Genome-wide structural modelling of TCR-pMHC interactions

Identification of the cognate peptide-MHC target of T cell receptors using molecular modeling and force field scoring

The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design

ZDOCK: an initial-stage protein-docking algorithm

Why must T cells be cross-reactive?

A structure-based benchmark for protein-protein binding affinity

How the thymus designs antigen-specific and self-tolerant T cell receptor sequences

Emerging Concepts in TCR Specificity: Rationalizing and (Maybe) Predicting Outcomes

T cell antigen receptor recognition of antigen-presenting molecules

ATLAS: A database linking binding affinities with structures for wild-type and mutant TCR-pMHC complexes

Antigen Identification for Orphan T Cell Receptors Expressed on Tumor-Infiltrating Lymphocytes

HLA binding of self-peptides is biased towards proteins with specific molecular functions

Use of pair potentials across protein interfaces in screening predicted docked complexes

AAindex: amino acid index database, progress report

Germ line-governed recognition of a cancer epitope by an immunodominant human T-cell receptor

Deconstructing the peptide-MHC specificity of T cell recognition

Open Source Drug Discovery Consortium & Lynn, A. g_mmpbsa--a GROMACS tool for high-throughput MM-PBSA calculations

Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation

Empirical solvent-mediated potentials hold for both intra-molecular and inter-molecular inter-residue interactions

Solving the protein sequence metric problem

Nature of Driving Force for Protein Folding: A Result From Analyzing the Statistical Potential

Neoantigen vaccine: an emerging tumor immunotherapy

Therapeutic vaccines for cancer: an overview of clinical trials

Key Parameters of Tumor Epitope Immunogenicity Revealed Through a Consortium Approach Improve Neoantigen Prediction

Sensitive and frequent identification of high avidity neo-epitope specific CD8 + T cells in immunotherapy-naive ovarian cancer

Structural dissimilarity from self drives neoepitope escape from immune tolerance

NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8-11

Using T Cell Receptor Repertoires to Understand the Principles of Adaptive Immune Recognition

Single Cell T Cell Receptor Sequencing: Techniques and Future Challenges

Pairing of T-cell receptor chains via emulsion PCR

Application of asymmetric statistical potentials to antibody-protein docking

T cell receptor cross-reactivity expanded by dramatic peptide-MHC adaptability

The Immune Epitope Database (IEDB): 2018 update

The starting structure was placed in a rectangular box filled with explicit TIP3P water with size allowing the distance between the complex and the box boundaries to be no less than 1.2 nm. Then the system was neutralized by addition of a necessary number of Na + ions, minimized by a steepest descent method and heated to 298 K with positional restraints for all heavy atoms. MD replicas were performed from the structures obtained after heating using different seeds for initial velocities of atoms. In order to prevent rotation of the complex leading to its interaction with periodic images we used positional restraints on Cα-atoms of MHC residues 182 and 197 along the two smaller axes of the box with force constants 10-times less than the ones used in the heating stage. These residues were chosen as they are located in MHC α3-domain so these restraints should have no impact on TCR-pMHC interface behavior. The obtained trajectories were visually checked on absence of interactions of the complex with its periodic images. Use of rectangular box with imposed positional restraints resulted in~5-time reduction of computational time of MD compared to simulation of the same systems in dodecahedronic box.

Supplementary Figure 1 . Distribution of contacts between complementarity determining (CDR) and framework (FR) regions of TCR with peptide in TCR-pMHC crystal structures of the non-redundant set used for TCRen derivation.

Summary of notations for different versions of TCRen potential. All versions are called as TCRen.X.Y, when X is either "s" for symmetric or "a" for asymmetric forms, Y is either "cf" for contact-fraction or "mf" for mole-fraction or "nmf" for normalized mole-fraction reference states. B. Distributions of cognate epitope ranks for different versions of TCRen in cognate/unrelated peptides benchmark (benchmark setup is the same as in Figure 2A , see Methods for the details). Median value for each potential is marked. C. ROC-curves for distinction between all complexes with cognate epitopes and all complexes with unrelated peptides using different versions of TCRen. D. ROC-curves for discrimination between real and shuffled TCR-peptide-MHC complexes using different versions of TCRen (benchmark setup is the same as in Figure 3A , see Methods for the details). 

Supplementary Table 1 . Non-redundant set of crystal structures of TCR-peptide-MHC complexes from PDB used in the work for derivation of TCRen potential.