key: cord-0939172-idp4odj6
authors: Gao, Ang; Chen, Zhilin; Amitai, Assaf; Doelger, Julia; Mallajosyula, Vamsee; Sundquist, Emily; Segal, Florencia Pereyra; Carrington, Mary; Davis, Mark M.; Streeck, Hendrik; Chakraborty, Arup K.; Julg, Boris
title: Learning from HIV-1 to predict the immunogenicity of T cell epitopes in SARS-COV-2
date: 2021-03-15
journal: iScience
DOI: 10.1016/j.isci.2021.102311
sha: 06af1f66a46ff2b1af9626dfc33687f98bdef162
doc_id: 939172
cord_uid: idp4odj6

We describe a physics-based learning model for predicting the immunogenicity of Cytotoxic-T-Lymphocyte (CTL) epitopes derived from diverse pathogens including SARS-CoV-2. The model was trained and optimized on the relative immunodominance of CTL epitopes in Human Immunodeficiency Virus infection. Its accuracy was tested against experimental data from COVID-19 patients. Our model predicts that only some SARS-CoV-2 epitopes predicted to bind to HLA molecules are immunogenic. The immunogenic CTL epitopes across all SARS-CoV-2 proteins are predicted to provide broad population coverage, but those from the SARS-CoV-2 spike protein alone are unlikely to do so. Our model also predicts that several immunogenic SARS-CoV-2 CTL epitopes are identical to seasonal coronaviruses circulating in the population and such cross-reactive CD8+ T cells can indeed be detected in prepandemic blood donors, suggesting that some level of CTL immunity against COVID-19 may be present in some individuals prior to SARS-CoV-2 infection.

It is generally thought that an effective vaccine will likely be required to bring the pandemic of Coronavirus Disease 2019 caused by the SARS-CoV-2 virus under control. Thus, a myriad of efforts to develop vaccines that may protect against infection by SARS-CoV-2 have been launched (Akst, 2020) . To date, 3 vaccines have received emergency use approval (EUA) by the FDA in the US. These vaccines primarily elicit protective antibody responses against the spike protein of SARS-CoV-2. How durable the protection conferred by these but also other vaccine candidates will be, remains the be shown.

While current vaccines and vaccine candidates are primarily geared towards induction of neutralizing antibody responses, a potential role for SARS-CoV-2 specific T-cells in protection against and control of infection has been proposed (Le Bert et al., 2020; Braun et al., 2020; Rydyznski Moderbacher et al., 2020) . Indeed, there is evidence from other coronaviruses, like SARS-CoV that caused Severe Acute Respiratory Syndrome (SARS) in 2003, that the antibody response elicited in patients infected with SARS-CoV was protective but relatively short-lived (Liu et al., 2006; Mo et al., 2006; Tang et al., 2011) . In contrast, T cell responses were more durable (Channappanavar et al., 2014; Fan et al., 2009; Tang et al., 2011) . For example, Fan et al (Fan et al., 2009) showed that most patients who recovered from SARS-CoV have memory T cell responses directed against the virus 4 years after recovery. Tang et al (Tang et al., 2011) showed that 6 years after recovery SARS-CoV patients did not have significant amounts of virusspecific circulating antibodies, but had significant memory T cell responses compared to healthy controls. Furthermore, a critical role for virus-specific memory T-cells in broad and long-term protection against SARS-CoV infection has been elucidated in animal models (Channappanavar et al., 2014; Zhao et al., 2010) . The nucleocapsid (N), membrane (M), and envelope (E) proteins of SARS-CoV-2 are over 90 % conserved compared to SARS-CoV, and the spike (S) protein is 76 % similar (Ahmed et al., 2020) . Given this similarity between SARS-CoV-2 and SARS-CoV, it is worth exploring the development of vaccines that may also elicit protective T cell responses.

Multiple recent studies have been focusing on discovering potential epitopes of SARS-CoV-2 that can elicit T cell responses. Ahmed et al (Ahmed et al., 2020) and Grifoni et al (Grifoni et al., 2020a) have tried to identify peptides of SARS-CoV-2 that have high sequence identity with SARS-CoV epitopes. However, only a small number of SARS-CoV peptides that are experimentally known to elicit T cell responses in humans are shared by SARS-CoV-2. Moreover, these shared peptides are associated with a limited set of HLA molecules, thus providing poor coverage of the global population. Ahmed et al (Ahmed et al., 2020) and Prachar et al (Prachar et al., 2020) , among other groups (Campbell et al., 2020; Nerli and Sgourakis, 2020; Prachar et al., 2020) , also identified SARS-CoV-2 peptides that are capable of binding to HLA molecules, either based on MHC binding assay results or bioinformatic methods. By doing this, they J o u r n a l P r e -p r o o f identified a large pool of SARS-CoV-2 peptides that are associated with diverse HLA molecules, which cover a broad cross-section of the global population. But, binding to HLA molecules does not imply that the peptide epitope will elicit an immunogenic T cell response in humans (Yewdell, 2006) . Predicting the immunogenicity of T cell epitopes in humans with given HLA types in a reliable way is a major challenge. The ability to address this challenge will significantly aid the design of vaccines that aim to elicit protective T cell responses against diverse pathogens, with SARS-CoV-2 being one example.

We aimed to systematically determine the relative immunogenicity of CTL epitopes in people with diverse HLA alleles by developing a physics-based learning algorithm. We exploited the fact that short peptides, about 9 -11 amino acids in length, are not unique to the organism from which they are derived (Butler et al., 2013; Kosmrlj et al., 2008) . Therefore, the information obtained from data on epitope-specific CTL responses against one pathogen can be used to learn CTL immunogenicity patterns against diverse other pathogens. To train and optimize our method we first studied a viral infection, Human Immunodeficiency Virus (HIV), for which well-defined data on the relative immunodominance of CTL epitopes is available. The results suggest that our method is more accurate than the immunogenicity prediction tool publicly available on the Immune Epitope Database (IEDB, http://tools.iedb.org/immunogenicity) (Calis et al., 2013; Vita et al., 2019) .

We then applied our algorithm to identify immunogenic SARS-CoV-2 peptide epitopes and tested our approach by experimentally quantifying CTL responses towards a selection of predicted immunodominant optimal peptides in patients with COVID-19. Our model predicts that only a fraction of peptide epitopes that are known to bind different HLA molecules is likely to be immunogenic. However, the set of immunogenic peptides still provides broad coverage of the global population. Given the low mutability of the SARS-CoV-2 virus so far, these results suggest that a whole proteome immunogen may be able to elicit potent T cell responses in diverse individuals. We also predict that the immunogenic CTL epitopes contained in the spike protein of SARS-CoV-2 (immunogen in most current vaccine formulations) is unlikely to provide broad population coverage, since these spike epitopes are associated with a limited number of HLA alleles. Finally, several predicted immunogenic peptide epitopes derived from the SARS-CoV-2 proteome are identical to those contained in the four human coronaviruses of low pathogenicity (HCoV) that regularly circulate in the human population. Indeed, we found evidence for such cross-reactive CTL responses in pre-pandemic blood, suggesting that HCoVspecific memory CTL responses present in a subset of the population may be able to target SARS-CoV-2 epitopes.

Our results provide a useful guide for the design of second generation COVID-19 vaccines that aim to elicit CTL immunity and will also inform other investigators about the likely dominant Tcell responses they may wish to test in patients with specific HLA types. More importantly, upon further testing, validation and elaboration, our approach for predicting immunogenicity of CTL epitopes may be useful for diverse infectious pathogens including those that will undoubtedly emerge in the future.

Model Development and training against clinical data CTLs bind to short peptides, about 9 -11 amino acids in length, displayed in complex with HLA class I molecules. Such short motifs do not contain any long-range information about the genome of the organism from which they are derived (Butler et al., 2013; Kosmrlj et al., 2008) . Therefore, the ability to predict relative immunogenicity of peptide epitopes derived from the genome of one virus in persons with a given HLA allele is likely to allow prediction of immunogenicity of epitopes derived from another virus' proteome.

Our model for immunogenicity of CTL peptide epitopes is inspired by studies aimed to predict immunogenicity of cancer neo-antigens for immunotherapy (Luksza et al., 2017) . We wish to predict the peptide immunodominance hierarchy in people with different HLA genes. For that purpose, we define a "CTL immunogenicity metric" ( , ), where is the sequence of the peptide whose relative immunogenicity we wish to predict for a person with the HLA allele, . This metric is the product of three terms, and is written as follows:

Each of the terms above reflects a different physical phenomenon. The binding term, ( , ) is a measure of the probability that the peptide can be processed, bound to, and displayed by HLA molecule . Machine learning approaches have been trained on many measurements of peptide presentation by different HLA molecules, and a resulting method, netMHCpan4.0 (Vanessa and Nielsen, 2017) , can make reasonable estimates of ! "( , ) for many alleles as the percentile rank of the elution-ligand score. We next posit that whether or not a peptide is targeted by human CTLs should correlate with how similar its sequence is to peptides derived from diverse pathogens that are known to elicit a CTL response in humans (listed in the IEDB database (Vita et al., 2019) ; Methods). The term, ( , pathogen) in Eq. 1 is the sequence similarity of peptide to these pathogen-derived peptides. We define ( , pathogen) mathematically as the number of pathogenic peptides whose alignment score with is larger than a threshold value, # $%&'( ) :

Here,is a pathogenic peptide in the database, |-, | is the alignment score ofand , which is determined by the BLOSUM62 based Smith-Waterman alignment method (as used by Luksza et al (Luksza et al., 2017) ), and + is the step function. A higher alignment score means that the biochemical properties of the two peptides are more similar to each other.

T cells develop in the thymus, where they are exposed to HLA-bound peptides derived from the host's proteome. For a thymocyte to mature into a peripheral T cell, it must bind to at least one of these peptides with an affinity that exceeds a threshold value, and not bind to any of them with an affinity that exceeds a higher threshold value (Daniels et al., 2006) . In past studies (Butler et al., 2013; Kosmrlj et al., 2008 Kosmrlj et al., , 2009 Košmrlj et al., 2010; Stadinski et al., 2016) , we developed a mechanistic understanding of how thymic development shapes the pathogen reactivity of the T cell repertoire in an organism. Extending these studies leads us to the seemingly counter-intuitive conclusion that T cells that bind to human peptides more strongly will also be likely to bind more strongly to pathogen-derived peptides. One simple way to understand this is by examining data in mice, which show that more self-reactive T cells are statistically enriched in more hydrophobic amino acids at residues that contact the HLA-bound peptides (Stadinski et al., 2016) . This is because TCR binding to HLA-bound peptides creates an interface from which water must be partially expelled. So, hydrophobic amino acids are more likely to favor formation of such an interface. But, this argument applies to both self and pathogen-derived peptides. Therefore, statistically, TCRs that bind more avidly to self-peptides should also bind more avidly to pathogen-derived peptides. There is some experimental evidence supporting this prediction (Mandl et al., 2013) . Based on these considerations, we include the term, ( , self), in Eq 1, which is the biochemical sequence similarity of peptide to peptides derived from the human proteome. These peptides are also gathered from IEDB database (Vita et al., 2019) (Methods) . Similar to Eq. 2, ( , self) is defined as the number of self peptides whose alignment score with is larger than a threshold value, # )01 :

Now, -′ denotes a self-peptide.

We used Eqs 1 -3 to train a predictor of the immunodominance hierarchy of peptides targeted by CTLs in humans with different HLA alleles. The two parameters in our model are # )01 and # $%&'( ) , which will be determined by fitting the model to the training data (Methods). Given experimental measurements on the immunodominance hierarchy of peptides derived from J o u r n a l P r e -p r o o f pathogens in humans with different HLA alleles, we constructed a binary classifier based on ( , ). A peptide with ( , ) larger than a threshold is classified as dominant and the others as nondominant. We trained and tested our model for ( , ) as a predictor of immunodominance using experimental data on HIV peptides targeted by humans with different HLA alleles (Methods).

We systematically assembled data on HIV-1 specific CTL responses, as determined by gamma interferon (IFN-γ) enzyme-linked immunospot (ELISPOT) assay, against a panel of up to 457 peptides including previously described optimal HIV-1 epitopes as defined in the Los Alamos National Laboratory HIV epitope database (www.hiv.lanl.gov) and epitope variants (Methods). Data was available from multiple cohorts of HIV-1 infected individuals at different stages of the infection and subsets of the data have been reported previously (Pereyra et al., 2014; Streeck et al., 2009 ). In total, optimal epitope specific CD8 + T-cell data was available from 1102 individuals, including 619 individuals during acute and early infection, and 483 individuals during chronic infection of which 321 were considered spontaneous HIV controllers with median plasma HIV RNA levels < 2,000 copies/ml. For the majority of individuals, the peptides used for T-cell response assessment were selected based on the individual's HLA class I genotype. However, 314 Individuals with chronic infection had been tested against 267 optimal epitopes, irrespective of the individual's HLA class I alleles. An average of 7 (range, 0-42) epitope specific CTL responses were detected in the primary-infection cohort, while HIV-1-specific CTL responses against an average of 20 epitopes (range, 0 to 95) were detected in chronically infected individuals. For our analysis, HLA class I restricted CTL responses were considered only if the respective HLA allele was shared by at least 20 individuals in the data set. Table S1 in the supplemental material summarizes the frequencies of recognition for tested HIV-1-specific CTL epitopes in the respective cohorts.

We used the percentage of patients with a given HLA allele responding to a given HIV peptide , denoted as 4( , ), as the metric of immunodominance. The peptides which elicit response in more than 25% of tested patients with a given HLA allele were labelled as dominant and the others as non-dominant. Repeated 10-fold Cross-Validation was performed to train and test the model (Methods).

The performance of the ( , )-based classifier on the test sets is summarized as Receiver Operator Characteristic (ROC) curves (Figure 1 A-B) . For the HIV acute infection group, the classifier has an Area Under the Curve (AUC) score of approximately 0.71 for the ROC curve. For the chronic infection group, the classifier has an AUC score of approximately 0.66. The superior performance in the acute infection dataset can be explained by the fact that, as HIV infection progresses, the virus mutates to escape CTL response, and as a result less immunodominant J o u r n a l P r e -p r o o f peptides are targeted by CTLs in the chronic phase (Streeck et al., 2009) . The performance of the current model is compared to a T cell epitope immunogenicity prediction model developed by Calis et al (Calis et al., 2013) , which is publicly available in IEDB. Our model shows superior performance as measured by the AUC (0.71 vs 0.57 for the acute group, 0.66 vs 0.34 for the chronic group; Figure 1A -B). We also evaluated the importance of each of the three terms of ( , ), the binding term, the term representing similarity to pathogenic peptides, and that representing similarity to human peptides by constructing partial models with one or two terms removed from ( , ). The same training and testing procedures were repeated for these partial models. For both the acute and the chronic groups the partial models show less predictive power compared to the full model as measured by the values of the AUC ( Figure 1C -F). The "binding only" bar graphs correspond to the predictions from netMHCpan4.0, which is used commonly. Our full model outperforms netMHCpan4.0 as measured by the values of the AUC, a point that is also evident from the ROC curves corresponding to our model and netMHCpan4.0 ( Figure 1G and 1H). For example, the values of the AUC for our full model and netMHCpan4.0 for the acute cohort are 0.71 and 0.64, respectively.

Only a fraction of SARS-CoV-2 peptides that bind to HLA molecules are immunogenic Many research groups have identified peptides derived from SARS-CoV-2 that can bind with HLA molecules (Ahmed et al., 2020; Campbell et al., 2020; Grifoni et al., 2020a; Nerli and Sgourakis, 2020; Prachar et al., 2020) . Two different approaches were employed. In one approach, peptides that bind to different HLA molecules were identified based on experimental assays (Ahmed et al., 2020; Prachar et al., 2020) . In the other approach, bioinformatic tools were used to identify peptides that bind to HLA molecules (Campbell et al., 2020; Grifoni et al., 2020a; Nerli and Sgourakis, 2020) . We used our trained classifier to predict the immunogenicity of peptides that were determined to bind to different HLA molecules experimentally, as reported by Ahmed et al (Ahmed et al., 2020) and Prachar et al (Prachar et al., 2020) . Our classifier can also be easily applied to the peptides reported by other groups too.

Ahmed et al (Ahmed et al., 2020) identified 187 SARS-CoV peptides that were suggested by HLA binding assays to bind to diverse HLA Class I molecules, and were identical in SARS-CoV-2. We further screened these peptides using our classifier (Methods) and found that only 74 of them are predicted to be immunogenic (Table 1 and Supplementary Table S2 ). These predicted peptides are associated with 33 different HLA alleles. Standard methods predict that this would enable coverage of 98.8% of the global population (i.e., 98.8% of the global population has at least one of these alleles), and 99.2% of US population (Methods).

The same analysis was performed for the 152 SARS-CoV-2 peptides identified by Prachar et al (Prachar et al., 2020) , which are also verified by HLA binding assays to be strong binders to diverse HLA Class I molecules. Our classifier predicted that 98 of them are immunogenic (Table  2 and Supplementary Table S2 ). They are associated with 10 different HLA alleles, which cover 94% of the global population and 93.2% of US population. These two sets of predicted immunogenic peptides can be combined together, which gives a total of 162 immunogenic peptides associated with 37 different HLA alleles (Supplementary Table S2 ). These HLA alleles can cover 99.6% of the global population and 99.7% of US population. On average each HLA allele is associated with 7.3 immunogenic peptides. Recall that the immunogenic peptides predicted by our model are defined as those that elicit a response in more than 25% of population with the associated HLA allele. With more than 7 immunogenic peptides associated with each HLA allele, it is likely that immunogenic CTL responses will be present in most people with the corresponding allele.

Currently most SARS-CoV-2 vaccines only contain the spike protein of the virus as the immunogen (Akst, 2020) . Thus, we wanted to test whether the immunogenic peptides from the spike protein alone can elicit CTL responses in a large portion of the population. Among the combined set of 162 predicted immunogenic peptides that we identified, 22 belong to the spike protein of the virus, and they are associated with 16 HLA alleles (Supplementary Table S2 ). These 16 HLA alleles cover 92.3% of the global population and 93.5% of the US population. However, on average each HLA allele is only associated with 1.8 immunogenic peptides. This relatively low number indicates that it is likely that most people with a particular allele will not mount immunogenic CTL responses. Therefore, including various viral proteins in the vaccine immunogen may be necessary in order to achieve broad coverage of CTL responses in a given population.

Testing predictions against experimental data on immunogenicity of long SARS-CoV-2 peptides Peng et al. (Peng et al., 2020) tested 423 overlapping 15-to 18-mer SARS-CoV-2 peptides for CD4 + and CD8 + T-cell responses using blood samples from 42 recently recovered COVID-19 patients. IFN-γ ELISPOT assays and intracellular cytokine staining were used to test for responses. Confirmed CD8 + T-cell responses were detected against 7 peptides in a number of patients. We calculated our CTL immunogenicity metric ( ( , )) for all 9-to 11-mers contained in these long peptides in persons with the HLA types of the positively tested patients (Table 3 ). If the largest value of ( , ) for at least one 9-11-mer contained in a positively tested 15-to 18-mer exceeded the threshold for immunodominance in our model, we predicted a positive CTL response. Encouragingly, we correctly predicted 5 out of the 7 positively tested peptides to be immunogenic. Our predictions were incorrect for two long peptides that registered positive responses in one patient each. The small number of confirmed patients tested for each long peptide precluded meaningful calculation of statistical significance.

In order to determine if individuals with COVID-19 indeed mount CTL responses against optimal epitopes that our model predicted to be immunodominant, we collected PBMCs from 28 individuals with SARS-CoV-2 PCR confirmed infection. We selected a total of 108 individual 9mer SARS-CoV-2 peptides for the HLA alleles that were most frequently expressed in this cohort, namely HLA A*01:01, A*02:01, A*03:01, A*11:01, A*24:02, B*07:02, B*08:01 and C*04:01. The peptides were selected based on their predicted immunogenicity hierarchy; i.e., according to values (amplitude) of our immunogenicity metric ( ( , )). Figure 2A summarizes the mean prediction amplitude of all peptides across all tested alleles, with peptide 1 being the peptide with the highest predition amplitude for each allele. Peptides with very low predicted immunogenicity values were excluded from this analysis. T-cell responses to these peptides were quantified by IFN-γ-ELISpot assay following PBMC expansion with anti-CD3-antibody in IL-2-containing media. Overall, we detected responses for each HLA allele, except for C*04:01 (only four peptides of this allele were tested). While some peptides were not targeted at all, several induced responses in all patients with the corresponding HLA allele ( Figure 2B ). As expected, no responses were seen against predicted HLA-restricted peptides in patients without the corresponding allele ( Supplementary Fig S1) . Mean IFN-γ + T-cell response magnitude per peptide across individuals with identical HLA ranged from 33 to 987 SFU/10 6 PBMCs ( Figure 2C ). Overall, the breadth of CTL responses differed by HLA, with A*01:01 restricted epitopes being more frequently targeted than peptides restricted by other alleles ( Figure 2D ). Interestingly, one individual who expressed 4 of the tested alleles showed a CTL response to at least 1 of the predicted peptides in each allele. Orf1ab was the most frequently targeted viral protein followed by N and S ( Figure 2E ). One individual (with allele C*04:01) did not have any CTL responses against the respective peptides but had detectable CTL responses against overlapping peptide pools for N and S while two other individuals had no detectable SARS-CoV-2 specific responses at all. On the other hand, 5 of the 7 patients that did not respond to the N OLP pool as well as 10 of the 14 patients that did not respond to the S OLP pool, showed IFN-γ + T-cell responses to one or more of our predicted peptides, suggesting an advantage of testing optimal epitopes predicted by our algorithm rather than pools of overlapping peptides in the detection of antigen-specific CTL responses in COVID-19 patients.

To further analyze these data, for each peptide, we compared the fraction of patients that positively responded to a peptide to its metric of immunogenicity ( ( , )) predicted by our model. For the 108 peptides tested, we found a Pearson correlation value of 0.43 between the predicted value of ( , ) and the fraction of positively responding patients ( Figure 3A ). We then grouped peptides according to their restricting HLA types, and compared the fraction of positively responding patients for each HLA type to the mean value of ( , ) predicted by our model for the grouped peptides. This comparison was characterized by a Pearson correlation value of 0.82 ( Figure 3B ). Interestingly, peptides tested for the HLA type A*01:01 were predicted to be most immunodominant with a large mean value of ( , ) (4.9x10 5 ) and patients with this HLA type had the highest fraction of positively responding individuals. This result is consistent with the data that peptides restricted by HLA A*01:01 were most frequently targeted ( Figure 2D) Taken together, these data show a reasonable positive correlation between model predictions and patient data. Moreover, using the model we were able to rapidly identify several highly targeted optimal peptides across several alleles that induced responses in more than 50% of tested patients ( Figure 2F ), further validating our approach.

There is significant overlap between immunogenic CTL epitopes in SARS-CoV-2 and less pathogenic human coronaviruses Unlike SARS-CoV-2 that causes severe respiratory disease, other less pathogenic coronaviruses circulating in the human population usually only cause mild diseases (like the common cold). Four common human coronaviruses (HCoV), HCoV-229E(NC_002645.1), NL63(NC_005831.2), OC43(NC_006213.1) and HKU1(NC_006577.2) are responsible for 10-30% of upper respiratory tract infections in adults (Paules et al., 2020) . Given that memory T cell responses are likely induced in at least a fraction of the human population infected by these coronaviruses, we wanted to explore whether such memory responses could theoretically be induced/expanded following infection with SARS-CoV-2. Thus, we employed our classifier to identify common immunogenic peptide epitopes between HCoV and SARS-CoV-2. We first gathered a set of 38 HLA class I alleles that represent more than 99% of the world. We then applied our classifier to all possible overlapping 8-11mers in the proteome of SARS-CoV-2 and determined that there are 2311 potentially immunogenic CTL epitopes associated with those 38 HLA alleles. We then further determined the unique set of immunogenic peptides that were common between SARS-CoV-2 and the four common HCoV. We found 46 shared immunogenic peptides, which are associated with 31 HLA alleles (Supplementary Table S3 ). These HLA alleles cover 98.6% of the global population and 99.0% of US population. On average each of these alleles are associated with 5.6 immunogenic peptides. Given this level of overlap between immunogenic epitopes between HCoVs and SARS-CoV-2, one can hypothesize that CTL memory responses elicited by past infection with common coronaviruses could respond to SARS-CoV-2 infection. We demonstrate this directly by an ex vivo assessment of pre-pandemic PBMCs using one of our SARS-CoV-2 peptides predicted to be both immunogenic and conserved among coronaviruses in a peptide-MHC tetramer (HLA-A*02:01/Orf1ab 4725-4733 ). We find that SARS-CoV-2 reactive CD8 + T cells can indeed be detected in unexposed individuals (mean frequency of ~8.3x10 -6 , n=8) ( Figure 4A-B, Supplementary Fig. S2 ). The frequency of SARS-CoV-2 (Orf1ab 4725-4733 ) specific T cells in unexposed individuals, however, was significantly lower than CD8 + T cells (mean frequency of ~4.2x10 -4 , n=8) reactive to influenza-virus (HLA-A*02:01/M1 58-66 ) ( Figure 4A-B) . Phenotypic characterization using CCR7 and CD45RA staining indicates that these SARS-CoV-2 (Orf1ab 4725-4733 ) specific T cells predominantly display a memory phenotype ( Figure 4C ), further suggesting that they had been exposed previously to other HCoVs. We also accounted for the fact that each individual may have been infected by only a subset of the four less pathogenic coronaviruses. So, to determine a lower bound, we determined the immunogenic epitopes that are shared between SARS-CoV-2 and each of the four less pathogenic HCoV. The results are presented in Table 4 . On average, 19 epitopes are common with any one of the less pathogenic human coronaviruses. The results shown in Table 4 and the experimental data described above lead us to believe that some fraction of the human population has memory T cell responses that may target immunogenic SARS-CoV-2 CTL epitopes and provide some measure of protection. Indeed, several studies have shown that CD4 + and also CD8 + T-cell responses against SARS-CoV-2 peptides were detectable in prepandemic blood donors using OLP mega-pools (Le Bert et al., 2020; Grifoni et al., 2020b) .

In this work we developed a physics-based learning algorithm that aims to predict the CTL immunogenicity of peptides in humans. By physics-based learning algorithm, we mean that machine learning is performed to determine the parameters in a model that is rooted in the underlying biophysics of T cell responses to antigen, rather than a generic classifier. The model applies to diverse pathogens and requires information only on the HLA alleles in an individual. A significant number of bioinformatics tools exist for modeling peptide binding to MHC molecules, such as the NetMHC algorithm 4.0 (doi: 10.1093/bioinformatics/btv639) and are available within the IEDB epitope analysis toolbox (IEDB). A number of efforts have aimed to develop bioinformatic tools to characterize the sequences of TCRs in human repertoires, and to follow how particular clones evolve in response to viral infections, thus aiming to characterize how specific TCRs react to particular infections (Minervina et al., 2020; Murugan et al., 2012) .

By analyzing sequences of TCRs, Glanville et al and Dash et al (Dash et al., 2017; Glanville et al., 2017; Pogorelyy et al., 2019) developed sequence-similarity-based clustering algorithms that cluster TCRs with shared sequence motifs that are likely to exhibit similar epitope specificity. However, estimating the immunogenicity of particular epitopes in humans with a particular HLA type has been challenging. Calis et al. proposed a model to predict the immunogenicity of new peptide-MHCs (pMHC) using a large sets of immunogenic and non-immunogenic pMHC data (Calis et al., 2013) . Here, we chose a different approach and developed our model based on biophysical considerations. Because short peptides do not carry information about the genome of origin, our model should be applicable to peptides derived from diverse pathogens. Our model was then trained and validated against a large data set of experimentally quantified Tcell responses in HIV-infected individual. The model results in improved performance in predicting immunogenicity compared to publicly available models, such as netMHCpan4.0 and that due to (Calis et al., 2013) .

Many groups have identified SARS-CoV-2 peptides that are able to bind with HLA molecules (Ahmed et al., 2020; Campbell et al., 2020; Grifoni et al., 2020a; Nerli and Sgourakis, 2020; Prachar et al., 2020) , either using experimental assays or bioinformatic tools. We screened these peptides for immunogenicity using our algorithm. Specifically, we studied the peptides suggested by Ahmed et al (Ahmed et al., 2020) and Prachar et al (Prachar et al., 2020 ), but our model can be applied to filter peptides suggested by other groups. Our results suggest that only a fraction of the peptides that bind to HLA molecules are likely to be immunogenic.

We tested the predictive capability of our model against two sets of experimental data; one set was reported by Peng et al. (Peng et al., 2020) , and the other was obtained by us using samples from patients infected with SARS-CoV-2. Despite the paucity of a large cohort of patients, we found satisfactory correlations between model predictions and experimental data for immunogenicity in patients. Taken together, this model tested against data from patients infected with HIV and SARS-CoV-2 suggest that our immunogenicity prediction method is reasonably accurate and will add to exisiting models that are currently available.

The combined set of SARS-CoV-2 peptides that we predict to be immunogenic among known HLA binders provides broad coverage of the global population. We note that CTL escape mutations in SARS-CoV-2 thus far are uncommon (Ahmed et al., 2020) . Therefore, determination of mutational vulnerabilities of the virus to focus CTL responses to special epitopes, as has been done for HIV (Abdul-Jawad et al., 2016; Ahmed et al., 2019; Barton et al., 2016; Dahirel et al., 2011; Ferguson et al., 2013; Gaiha et al., 2019; Hayton et al., 2014; Létourneau et al., 2007; Louie et al., 2018; Mann et al., 2014; Shekhar et al., 2013) , is likely not necessary. Whole proteome immunogens should suffice in a vaccine that aims to elicit potent CTL responses that provides broad population coverage.

Since most SARS-CoV-2 vaccines, under EUA or in development, use only the spike protein as immunogen, we also analyzed whether peptides from the spike alone can yield broad CTL coverage over the global population. Based on our analysis, the immunogenic spike peptides alone are unlikely to provide such broad coverage from the standpoint of CTL responses. Therefore, to get broad CTL coverage, an immunogen consisting of other SARS-CoV-2 proteins, ie ORF, N etc. might be necessary. This is potentially significant if antibody responses to SARS-CoV-2 prove not to be durable, as reported for SARS-CoV.

With regards to common human coronaviruses which have likely infected substantially more individuals than SARS-CoV-2 despite the current pandemic, our model predicts that there is overlap between the immunogenic CTL epitopes among these viruses. Indeed, several groups have now described cross-reactive T-cell responses in prepandemic donors (Le Bert et al., 2020; Grifoni et al., 2020b) and we also found effector-memory CTL responses to predicted crossreactive epitopes in blood samples from individuals prior to 2019. Here, we were able to identify single optimal peptides, providing more granularity in the cross-reactive T-cell specificity. Furthermore, the cross-reactive CTLs that we found exhibited a memory phenotype. This suggests that memory CTLs directed against less pathogenic coronaviruses could target immunogenic SARS-CoV-2 epitopes upon infection. Clinical outcomes and the course of disease during SARS-CoV-2 infection are extremely heterogeneous ranging from asymptomatic disease to death (Fu et al., 2020) . Whether pre-existing HCoV specific memory T-cells actually play a disease modifying or even protective role needs to be determined, but our model now provides the most likely immunodominant CTL epitope specific responses to focus on.

Although validated by HIV and also tested against SARS-CoV-2 CTL response data, our model is not yet experimentally tested in larger COVID-19 patient cohorts or for other viral infections. It is important to further validate, and potentially elaborate, the model by testing against experimental data for diverse viruses. More data will also help improve the model. Currently the model contains two parameters # )01 and # $%&'( ) , which are the cutoff thresholds for similarity to self and pathogenic peptides, respectively. These two parameters are used for all HLA alleles. However, it is known that peptides bound to different alleles can use different peptide residues to make primary contacts with the TCR. So, a model with allele-specific similarity cutoff thresholds might further improve the performance. This will require training our model against more extensive datasets. Since short peptides derived from the proteome do not carry long-range information about the pathogen, if our model is further validated and elaborated, it will be a valuable and simple tool for rapid identification of immunogenic CTL epitopes contained in diverse pathogens. The availability of such a tool will aid many applications pertinent to infectious diseases, including new pandemic-causing pathogens that will undoubtedly emerge in the future.

Variable accuracy in predicting immunogenic peptides was observed across alleles, ie predictions for the HLA-A*01:01 allele were more robust than for other alleles, such as HLA-A*24:02 where the measured peptides did not seem to match the predicted hierarchy. We can only speculate that this is a reflection of the variable amount of data per HLA allele in the HIV data set that was used to build and train the model. Training the model against more extensive datasets and adjusting the cutoff thresholds per HLA allele will likely help to improve the prediction accuracy. Furthermore, for this proof-of-principle analysis we did not experimentally explore CTL responses against SARS-CoV-2 peptides with very low prediction scores, to confirm that these peptides indeed were not targeted in-vivo. While we predicted cross-reactive peptide responses with HCoV, these were only detected using tetramers at very low levels in unexposed individuals. In order to reliably establish the existence and phenotype of these predicted response, a larger set of HLA/peptide tetramers needs to be evaluated, potentially utilizing tetramer enrichment strategies.

Lead contact Information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Boris Julg (bjulg@mgh.harvard.edu).

This study did not generate new unique reagents.

All relevant data are available from the Lead Contact upon reasonable request. The code for the computational tool is publicly available at https://github.com/andy90/immunogenicity_predictor. the Ragon Institute of MGH, MIT, & Harvard. This project has been funded in part with federal funds from the Frederick National Laboratory for Cancer Research, under Contract No. HHSN261200800001E . The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. This Research was supported in part by the Intramural Research Program of the NIH, Frederick National Lab, Center for Cancer Research.

Project conceptualizing and planning was performed by A.G., A.K.C. and B.J., HIV and SARS-CoV-2 ELISpot data were generated by Z.C., E.S., F.P.S. and H.S., HLA data was generated by M.C., the model development and validation were done by A.G., A.A., J.D., and A.K.C., VM generated the tetramer data under the supervision of MMD, and the manuscript was written by A.G., Z.C., A.K.C.,V.M., M.M.D. and B.J.

The authors declare no competing interests. Table 4 . Shared immunogenic peptides between SARS-CoV-2 and four common low pathogenicity human coronaviruses.

The first column shows the number of shared immunogenic peptides between SARS-CoV-2 and each of the four viruses. The second column shows the number of HLA alleles associated with those peptides. The third and fourth column show the population coverage of those HLAs for the US and World, respectively. The fifth column shows the average number of immunogenic peptides associated with each HLA. The last row of the table shows the average of all these quantities. Table 4 . The list of predicted immunogenic peptides that are shared between SARS-Cov-2 peptides and four common coronaviruses. 

Increased valency of conserved-mosaic vaccines enhances the breadth and depth of Epitope recognition

Sub-dominant principal components inform new vaccine targets for HIV Gag

Preliminary Identification of Potential Vaccine Targets for the COVID-19 Coronavirus (SARS-CoV-2) Based on SARS-CoV Immunological Studies

COVID-19 Vaccine Frontrunners

Relative rate and location of intra-host HIV evolution to evade cellular immunity are predictable

SARS-CoV-2-specific T cell immunity in cases of COVID-19 and SARS, and uninfected controls

SARS-CoV-2-reactive T cells in healthy donors and patients with COVID-19

Predicting population coverage of T-cell epitope-based diagnostics and vaccines

Quorum sensing allows T cells to discriminate between self and nonself

Properties of MHC Class I Presented Peptides That Enhance Immunogenicity

Prediction of SARS-CoV-2 epitopes across 9360 HLA class I alleles

Virus-Specific Memory CD8 T Cells Provide Substantial Protection from Lethal Severe Acute Respiratory Syndrome Coronavirus Infection

Coordinate linkage of HIV evolution reveals regions of immunological vulnerability

Thymic selection threshold defined by compartmentalization of Ras/MAPK signalling

Quantifiable predictive features define epitope-specific T cell receptor repertoires

Characterization of SARS-CoV-specific memory T cells from recovered individuals 4 years after infection

Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design

Clinical characteristics of coronavirus disease 2019 (COVID-19) in China: a systematic review and meta-analysis

Structural topology defines protective CD8 + T cell epitopes in the HIV proteome. Science

Identifying specificity groups in the T cell receptor repertoire

A Sequence Homology and Bioinformatic Approach Can Predict Candidate Targets for Immune Responses to SARS-CoV-2 Theory A Sequence Homology and Bioinformatic Approach Can Predict Candidate Targets for Immune Responses to SARS-CoV-2

Targets of T Cell Responses to SARS-CoV-2 Coronavirus in Humans with COVID-19 Disease and Unexposed Individuals

Immunobiology of the human MHC: 13th International Histocompatibility Workshop and Congress

Safety and tolerability of conserved region vaccines vectored by plasmid DNA, simian adenovirus and modified vaccinia virus Ankara administered to human immunodeficiency virus type 1-uninfected adults in a randomized, single-blind phase i trial

How the thymus designs antigen-specific and self-tolerant T cell receptor sequences

Thymic selection of Tcell receptors as an extreme value problem

Effects of thymic selection of the T cell repertoire on HLA-class I associated control of HIV infection

Design and pre-clinical evaluation of a universal HIV-1 vaccine

Two-Year Prospective Study of the Humoral Immune Response of Patients with Severe Acute Respiratory Syndrome

Fitness landscape of the human immunodeficiency virus envelope protein that is targeted by antibodies

A neoantigen fitness model predicts tumour response to checkpoint blockade immunotherapy

T Cell-Positive Selection Uses Self-Ligand Binding Strength to Optimize Repertoire Recognition of Foreign Antigens

The Fitness Landscape of HIV-1 Gag: Advanced Modeling Approaches and Validation of Model Predictions by In Vitro Testing

Primary and secondary anti-viral response captured by the dynamics and phenotype of individual T cell clones

Longitudinal profile of antibodies against SARS-coronavirus in SARS patients and their clinical significance

Statistical inference of the generation probability of T-cell receptors from sequence repertoires

Structure-based modeling of SARS-CoV-2 peptide/HLA-A02 antigens

Simultaneous detection of many T-cell specificities using combinatorial tetramer staining

Coronavirus Infections-More Than Just the Common Cold

Broad and strong memory CD4 (+) and CD8 (+) T cells induced by SARS-CoV-2 in UK convalescent COVID-19 patients

HIV Control Is Mediated in Part by CD8+ T-Cell Targeting of Specific Epitopes

Detecting T cell receptors involved in immune responses from single repertoire snapshots

COVID-19 Vaccine Candidates: Prediction and Validation of 174 SARS-CoV-2 Epitopes

Antigen-Specific Adaptive Immunity to SARS-CoV-2 in Acute COVID-19 and Associations with Age and Disease Severity

Spin models inferred from patient-derived viral sequence data faithfully describe HIV fitness landscapes

Hydrophobic CDR3 residues promote the development of self-reactive T cells

Human Immunodeficiency Virus Type 1-Specific CD8 ϩ T-Cell Responses during Primary Infection Are Major Determinants of the Viral Set Point and Loss of CD4 ϩ T Cells ᰔ †

Lack of Peripheral Memory B Cell Responses in Recovered Patients with Severe Acute Respiratory Syndrome: A Six-Year Follow-Up Study

Design and use of conditional MHC class I ligands

NetMHCpan-4.0: Improved Peptide− MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data

The Immune Epitope Database (IEDB): 2018 update

Confronting Complexity: Real-World Immunodominance in Antiviral CD8+ T Cell Responses

T Cell Responses Are Required for Protection from

This work was supported by the National Institutes of Health (AI138790 to B.J. and AI057229 to M.M.D.), NSF grant # PHY-2026995 to A.G. and A.K.C. A.G., B.J. and A.K.C. were supported by

 al, 2020 ).The first column shows the tested peptide sequences and column 2 shows the patient index, where patients are counted from top to bottom row for each peptide in Table 2 in Peng et al. Columns 3 to 5 show the epitope-HLA combination from the positively tested patients that results in the largest predicted amplitude. A peptide is predicted to be immunogenic, if this maximum amplitude is larger than the threshold for immunogenicity in our model. The last column shows the true positive rate of the predictions for each peptide. The total number of