key: cord-0868665-myrnj26v authors: Hamelin, David J.; Fournelle, Dominique; Grenier, Jean-Christophe; Schockaert, Jana; Kovalchik, Kevin; Kubiniok, Peter; Mostefai, Fatima; Duquette, Jérôme D.; Saab, Frederic; Sirois, Isabelle; Smith, Martin A.; Pattijn, Sofie; Soudeyns, Hugo; Decaluwe, Hélène; Hussin, Julie; Caron, Etienne title: The mutational landscape of SARS-CoV-2 variants diversifies T cell targets in an HLA supertype-dependent manner date: 2021-06-03 journal: bioRxiv DOI: 10.1101/2021.06.03.446959 sha: ba9a44ec24c503674085c6d9a90650f0eae1e54a doc_id: 868665 cord_uid: myrnj26v The rapid, global dispersion of SARS-CoV-2 since its initial identification in December 2019 has led to the emergence of a diverse range of variants. The initial concerns regarding the virus were quickly compounded with concerns relating to the impact of its mutated forms on viral infectivity, pathogenicity and immunogenicity. To address the latter, we seek to understand how the mutational landscape of SARS-CoV-2 has shaped HLA-restricted T cell immunity at the population level during the first year of the pandemic, before mass vaccination. We analyzed a total of 330,246 high quality SARS-CoV-2 genome assemblies sampled across 143 countries and all major continents. Strikingly, we found that specific mutational patterns in SARS-CoV-2 diversify T cell epitopes in an HLA supertype-dependent manner. In fact, we observed that proline residues are preferentially removed from the proteome of prevalent mutants, leading to a predicted global loss of SARS-CoV-2 T cell epitopes in individuals expressing HLA-B alleles of the B7 supertype family. In addition, we show that this predicted global loss of epitopes is largely driven by a dominant C-to-U mutation type at the RNA level. These results indicate that B7 supertype-associated epitopes, including the most immunodominant ones, were more likely to escape CD8+ T cell immunosurveillance during the first year of the pandemic. Together, our study lays the foundation to help understand how SARS-CoV-2 mutants shape the repertoire of T cell targets and T cell immunity across human populations. The proposed theoretical framework has implications in viral evolution, disease severity, vaccine resistance and herd immunity. INTRODUCTION 133 ( Figure S1 ). We found a total of 13,780 mutations identified in at least 4 SARS-CoV-2 134 genomes/individuals from GISAID, including 1,721 unique amino acid mutations in the S protein, 135 with D614G as the most frequent one (94%) (Korber et al., 2020) ( Table S1 and Table S2 ). Next, 136 we implemented a bioinformatics pipeline to assess the impact of these mutations on HLA binding 137 for 620 unique SARS-CoV-2 HLA class I epitopes that were recently reported to trigger a CD8+ 138 T cell response in acute or convalescent COVID-19 patients (Quadeer et al., 2020; Tarke et al., 139 2020) (see Methods). On average, we found that the predicted binding affinity of 181 of these 140 SARS-CoV-2 epitopes (30%) for common HLA-I alleles was reduced by ~100-fold (Table S3 and 141 Figure S1 ). It is also apparent that mutations negatively impacted the HLA binding affinity of 56 142 (31%) and 19 (10%) CD8+ T cell epitopes located in the immunodominant S and N proteins, 143 respectively ( Figure 1A,B) . Notably, a gap in the N protein, composed of a serine-rich region, is 144 associated with higher mutation rate and a marked lack of predicted T cell epitopes and response 145 (Figure 1B) . Epitopes located in the RBD vaccine locus were also impacted by mutations (Figure 146 1C). 147 Loss of epitope binding for commonly expressed HLA class I molecules was validated in 148 vitro for a subset of representative SARS-CoV-2 epitopes ( Figure S2 ). Of relevance, we found 149 that the common D614G mutation in the S protein is linked to a 15-fold decrease in the binding 150 affinity for the mutated HLA-A*02:01 epitope YQGVNCTEV when compared to the 151 reference/unmutated epitope YQDVNCTEV (Figure S2A,B) . Interestingly, our analysis also 152 identified a mutation in the HLA-B*07:02-restricted N105 epitope SPRWYFYYL, which is one 153 of the most immunodominant SARS-CoV-2 epitope (Ferretti et the presence of two previously reported CD8+ T cell mutated epitopes (i.e. GLMWLSYFI à 160 GFMWLSYFI, found in 38 genomes; and MEVTPSGTWL à MKVTPSGTWL, found in 23 161 genomes), which were shown to lose binding to HLA-A*02:01 and -B*40:01, respectively, in 162 addition to disrupt epitope-specific CD8+ T cell response in COVID-19 patients ( Figure S3 ) 163 (Agerer et al., 2021) . Together, these results demonstrate that mutations driving the global genomic 164 diversity of SARS-CoV-2 can drastically disrupt HLA binding of clinically relevant CD8+ T cell 165 epitopes, including epitopes encoded by the immunodominant S and N antigens, therefore 166 affecting epitope-specific T cell responses in In addition to mutations leading to a loss of HLA epitope binding, we identified a 168 significant number of mutations predicted to enhance the presentation of peptides by their 169 respective HLA molecules, leading to a 'Gain' of binding ( Figure S4 ). Because the unmutated 170 epitopes are predicted to be non-HLA binders, these mutations were not searched against the list 171 of known validated epitopes, which consist of strong-HLA binding reference epitopes. Whether 172 SARS-CoV-2 mutations predicted to increase HLA epitope binding can enhance T cell responses 173 to control the virus in COVID-19 patients remains to be determined experimentally. 174 175 While analysing the impact of the mutational landscape of SARS-CoV-2 on validated CD8+ T-177 cell epitopes, we observed that specific mutation types were over-represented while others were 178 under-represented (Figure S2C,D) . For instance, we found that 31% of the mutated epitopes were 179 represented by a removal of proline residue (Figure S2C,D) , leading to the hypothesis that such 180 biases could originate from biases in the proteome of SARS-CoV-2 mutants. To further investigate 181 whether specific amino acid mutational biases could be observed globally in the proteome of 182 SARS-CoV-2 mutants, we asked whether certain amino acid residues were preferentially removed 183 from, or introduced into the global proteomic diversity of SARS-CoV-2, thereby potentially 184 diversifying CD8+ T cell epitopes in a systematic manner. 185 To test this, we computed all residue substitutions (amino acid removed and introduced) 186 found in SARS-CoV-2 proteomes and calculated Global Residue Substitution Output (GRSO) 187 values, i.e. the % difference in overall amino acid composition for individual amino acids (see 188 Methods for details). GRSO values were computed for mutations found at various frequencies in 189 GISAID (i.e. found in only 1 genome, 2 to 100 genomes, 100 to 1000 genomes and > 1000 190 genomes) (Figure 2) . Interestingly, distinct mutational patterns at the amino acid level were 191 observed amongst mutations detected in more than 100 genomes/individuals (Figure 2) , referred 192 in this study to as 'prevalent mutations' (see Methods and Table S2 ). Amongst those mutations, 193 the amino acids alanine (A), proline (P) and threonine (T) were preferentially removed by 10.2% 194 (p = 1.2x10 -13 ), 9.1% (p = 1.6x10 -15 ), and 10.5% (p = 1.3x10 -14 ), respectively. In contrast, 195 phenylalanine (F), isoleucine (I), leucine (L) and tyrosine (Y) were preferentially introduced by 196 13.4% (p = 2.0x10 -17 ), 15.2% (p = 2.4x10 -17 ), 4.3% (p = 6.3x10 -11 ) and 5.0% (p = 7.0x10 mutations that were detected in 2 to 100 individuals appeared significantly more neutral, with none 201 of the mutational patterns enriched above the selected cut-off values (fold change > 4; p-value < 202 1x10 -11 ). Thus, our results show that specific amino acid residues were preferentially removed or 203 introduced in the proteome of SARS-CoV-2 mainly by prevalent mutations. Therefore, we 204 introduce the notion that the global diversity of SARS-CoV-2 proteomes is shaped by specific 205 amino acid mutational biases. Such biased amino acid composition generated by prevalent 206 mutations may have a systematic impact on epitope processing and presentation to shape SARS-207 CoV-2 T cell immunity in human populations. To address this systematic impact, all downstream 208 analyses described in this study were performed from the set of 1,933 prevalent mutations (>100 209 genomes) listed in Table S2 . 210 211 Prominent removal of proline residues leads to a predicted global loss of epitopes presented 212 by HLA-B7 supertype molecules 213 The association of peptides with the binding groove of HLA molecules largely relies on the 214 presence of anchor residues, also known as peptide binding motifs (Falk et al., 1991) . Hundreds of 215 different peptide binding motifs have been reported over the last decades (Gfeller and Bassani-216 Sternberg, 2018) . Overlapping binding motifs are qualified as "HLA supertypes" on the basis of 217 their main anchor specificity (Greenbaum et al., 2011; Sidney et al., 2008) . Of relevance here, 218 proline acts as a critical anchor residue at position P2 for epitopes presented by HLA-B7 (B7) 219 supertype molecules, which include a wide range of commonly expressed HLA-B alleles in 220 humans, i.e. HLA-B*07, -B*15, -B*35, -B*42, -B*51, -B*53, -B*54, -B*55, -B*56, -B*67 and 221 B*78 (Sidney et al., 2008) . In fact, the B7 supertype covers ~35% of the human population 222 (Francisco et al., 2015) . Hence, we reasoned that the global removal of proline residues observed 223 in the proteome of prevalent SARS-CoV-2 mutants (Figure 2 ) could drastically compromise T 224 cell epitope binding to B7 supertype molecules, thereby potentially interfering with SARS-CoV-2 225 T cell immunity in a relatively large proportion of the human population. 226 Due to the preferential removal of proline by prevalent mutations, we investigated the 227 extent at which proline residues were substituted at anchor binding position P2 and, consequently, 228 resulted in loss of epitopes presented by B7 supertype molecules. To answer this, we performed 229 the following four steps: (i) We applied NetMHCpan 4.1 (Reynisson et al., 2020) using the 230 reference and mutated SARS-CoV-2 genomes to generate a list of all possible reference/mutated 231 peptide pairs (8-11 mers) predicted to bind 16 common HLA-B types that belong to the B7 232 supertype family ( Figure S5B ). (ii) We analyzed all reference/mutated peptide pairs, along with 233 their differential predicted binding affinities to quantitatively identify HLA strong binder (SB) to 234 non-binder (NB) transitions [(SB) NetMHCpan %rank < 0.5 to (NB) NetMHCpan %rank >2]. (iii) 235 We categorized all peptide pairs based on the mutation type (amino acid X à amino acid Y) and 236 the position of the mutation within the peptide sequence. (iv) Lastly, we quantified the number of 237 reference/mutated peptide pairs and the associated fold-change in predicted binding affinity for 238 each category. Our results show that prevalent mutations predicted to impact the presentation of 239 peptides by the B7 supertype are dominated by PàL (p = 8.6x10 -35 ) and PàS (p = 3.4x10 -24 ) 240 substitutions at anchor residue position P2 (Figure 3A,B) . Reference/mutated peptide pairs from 241 these categories were the most abundant, with > 250 mutated peptides per category ( Figure 3C) . 242 PàL and PàS mutations resulted, on average, in a 61-fold reduction in predicted HLA binding 243 affinity for a representative set of clinically validated CD8+ T cell epitopes ( Figure 3D) . 244 In addition to the dominant PàS/L substitution type, other PàX substitutions were 245 observed. Interestingly, analysis of mutations found in the Pangolin B.1.1.7 variant (January 2021) 246 showed that the P681H mutation found in the Spike protein led to disrupted association of the 247 reference epitope SPRRARSVA for several HLA-B7 types. In fact, the P-to-H substitution 248 resulted in a strong loss of epitope binding predicted for 7/16 HLA-B types tested. Thus, our results 249 strongly suggest that biased substitutions of proline residues in the proteome of SARS-CoV-2 250 shapes the repertoire of epitopes presented by B7 supertype, including epitopes encoded by the 251 genome of the B.1.1.7 variant. This finding let us to propose that mutation biases found in SARS-252 CoV-2 may contribute to CD8+ T cell epitope escape in a B7 supertype-dependent manner. 253 The mutational landscape of SARS-CoV-2 enables disruption or enhancement of epitope 255 presentation in an HLA supertype-dependent manner 256 We found that specific amino acid residues were preferentially removed (proline, alanine and 257 threonine) or introduced (isoleucine, phenylalanine, leucine and tyrosine) in SARS-CoV-2 258 proteomes (Figure 2) . Importantly, most of these amino acids act as key epitope anchor residues 259 for multiple HLA class I supertypes ( Figure S5 ). For instance, phenylalanine and tyrosine are key 260 anchor residues for all known A*24 alleles of the A24 supertype family, whereas proline is known 261 to play a critical role in the anchoring of epitopes to alleles of the B7 supertype family (Figure 4) . 262 Therefore, one would expect the introduction of phenylalanine and tyrosine in SARS-CoV-2 263 proteomes to facilitate peptide presentation by A24, whereas the removal of proline would disrupt 264 peptide presentation by B7. With this concept in mind, we hypothesized that the distinct amino 265 acid mutational biases found throughout prevalent SARS-CoV-2 mutations could systematically 266 mold epitope presentation in an HLA supertype-dependent manner. 267 In order to compare supertypes to each other, we generated a 'Gain/Loss plot' for each 268 supertype assessed ( Figure 4C ). Gain/Loss plot were generated by computing the number of 269 mutations that resulted in 'Gain' or 'Loss' of epitopes for representative class I alleles selected for 270 each supertype (see methods for details). 'Gain' was assigned for mutated epitopes that were 271 predicted to transit from non-HLA binders (NetMHCpan %rank > 2) to strong HLA binders 272 (NetMHCpan %rank < 0.5), whereas 'Loss' was assigned for mutated epitopes that were predicted 273 to transit from strong HLA binders to non-HLA binders. Surprisingly, our analysis shows that 274 most supertypes preferentially gain new epitopes as a result of SARS-CoV-2 mutations: A1 (p = 275 4.5x10 -11 ), A2 (p = 0.001), A24 (p = 1.0x10 -26 ), B8 (p = 2.4x10 -14 ), B27 (p = 2.5x10 -6 ). 276 Interestingly, preferential loss of epitopes was only shown to be statistically significant for B7 277 supertype (p = 0.0012). Note that we explain the relatively low statistical value obtained for B7 278 supertype by the presence of isoleucine and phenylalanine (preferentially introduced in SARS-279 CoV-2 proteomes; see Figure 2 ) at anchor residue P9 for certain HLA types (namely HLAB*51:01 280 and HLA-B*53:01) ( Figure 4A ). In fact, omitting motifs containing isoleucine or phenylalanine 281 increased the significance of epitope lost versus gained (p = 2.6x10 -7 ) ( Figure 4C ). Together, our 282 results show that the amino acid mutational biases that feature the global diversity of SARS-CoV-283 2 proteomes can positively or negatively affect binding affinities of mutated epitopes for a wide 284 range of HLA class I molecules in a supertype-dependent manner. 285 The C-to-U point mutation bias largely drives diversification of SARS-CoV-2 T cell epitopes 287 Next, we sought to better understand the genetic determinants that drive the association between 288 epitope presentation and the amino acid mutational biases found in the SARS-CoV-2 population. 289 To this end, we analyzed the abundance of all the possible nucleotide mutation types (i.e. A-to-C, 290 A-to-G, A-to-U, C-to-A, C-to-G, C-to-U, etc.). This analysis indicates that C-to-U is the most 291 common mutation type (43%), followed by G-to-U (28%), as well as A-to-G, G-to-A and U-to-C 292 (from 9.7% to 11.6%) (Figure S6A Next, we aimed to determine the contribution of these different nucleic acid mutation types 296 to the global mutational pattern observed at the amino acid level in Figure 2 . To do so, we 297 generated simulated population samples of 1000 SARS-CoV-2 genomes using SANTA-SIM 298 (Jariani et al., 2019), applying various extents of mutational biases corresponding to the two most 299 common mutation types observed (i.e. C-to-U and G-to-U). The resulting simulated viral 300 populations were then analyzed to elucidate the global amino acid mutational pattern engendered 301 by these simulated nucleic acid point mutation biases, and whether they recapitulate the observed 302 patterns. Indeed, our data show that the mutational pattern resulting from the simulated C-to-U 303 bias very closely mimicked the mutational pattern observed in the real-life dataset ( Figure 5A) . 304 Namely, the in silico introduction of a C-to-U mutation bias resulted in the preferential removal 305 of alanine, proline, and threonine, by 6.7% (p = 5.1x10 -11 ), 6.9% (p = 1.2x10 -11 ) and 8% (p = 306 4.8x10 -12 ), respectively, as well as the introduction of isoleucine and phenylalanine by 8.2% (p = 307 1.3x10 -8 ) and 5.2% (p = 4.3x10 -11 ), respectively (Figure 5A ). The G-to-U mutation bias also 308 contributed to the introduction of isoleucine and phenylalanine ( Figure S6 ). Together, these results 309 show that the predominant C-to-U point mutations largely contribute to shaping the global 310 proteomic diversity of SARS-CoV-2. 311 Given the significant impact of the C-to-U point mutation bias on the amino acid content 312 of SARS-CoV-2 proteomes, we reasoned that C-to-U could be the main driver shaping the 313 repertoire and diversification of SARS-CoV-2 T cell targets in human populations, including 314 targets presented by the particularly interesting B7 supertype molecules. To investigate this, we 315 used all the SARS-CoV-2 CD8+ T cell epitopes that were experimentally validated using 316 peripheral blood mononuclear cells (PBMC) of acute and convalescent COVID-19 patients 317 (Quadeer et al., 2020; Tarke et al., 2020) and matched them with their corresponding nucleic acid 318 sequence found in reference/mutated genome pairs. We then calculated the frequency of the 319 various mutation types (i.e. A-to-C, A-to-G, A-to-U, C-to-A, C-to-G, C-to-U, etc.) coding for the 320 mutated form of those clinically validated CD8+ T cell epitopes. Importantly, we found that C-to-321 U and G-to-U were the two main mutation types leading to mutated epitopes, both accounting for 322 37% of all mutation types amongst prevalent mutations (>100 individuals) ( Figure 5B ). Most 323 strikingly, 62% of the prevalent mutations predicted to disrupt the presentation of epitopes by HLA 324 alleles for the B7 supertype were found to derive from the C-to-U mutation type (Figure 5B) . 325 These results strongly suggest that the dominant C-to-U point mutation bias found amongst 326 prevalent SARS-CoV-2 mutants has the potential to significantly contribute to shaping the 327 repertoire of SARS-CoV-2 T cell epitopes in B7 supertype individuals across human populations. 328 Collectively, our study lets us to propose the model that C-to-U editing enzymes play a 329 fundamental role in shaping the mutational landscape dynamics of SARS-CoV-2 CD8+ T cell 330 targets in humans (Figure 5C ), and hence, may contribute to molding T cell immunity against 331 COVID-19 at the population level. In the case of novel virus such as SARS-CoV-2, such a relationship remains to be established and 343 does not constitute the scope of our work. Here, we rationalized that an alternative approach to 344 interrogating SARS-CoV-2 epitope-associated variants is by investigating the global genomic and 345 proteomic diversity of SARS-CoV-2 for any outstanding mutational biases, and then, assessing 346 the relationship between such biases and epitope presentation for a broad set of HLA alleles. In 347 other words, in this study, we did not seek to understand how viral mutations are shaped by T cell 348 immunity, but rather to understand how mutational biases in SARS-CoV-2 may have shaped T 349 cell immunity at the population level during the first year of the pandemic. This approach was 350 possible thanks to an unprecedented number of SARS-CoV-2 genome sequences available for 351 downstream analysis. Our approach is universal and could be applied to other epidemic or 352 pandemic viruses in the future, given the development of distinct, prevalent mutational biases. 353 Importantly, our global approach has led to several striking conclusions to help understand how 354 the increasing genomic diversity of SARS-CoV-2 may shape T cell immunity in human 355 populations. Our findings have important implications that are discussed below in the context of 356 disease severity, viral evolution and vaccine resistance. 357 In this study, we found that prevalent SARS-CoV-2 mutations are governed by defined 358 mutational patterns, with C-to-U being a predominant mutation type, as previously shown by the C-to-U mutation bias in SARS-CoV-2 genomes has a remarkably intimate relationship with 362 the observed amino acid mutational biases, indicating that C-to-U mutations largely contribute to 363 the global proteomic diversity of SARS-CoV-2. Most importantly, we show that this mutational 364 bias leads to the preferential substitution of proline residues with leucine or serine residues in the 365 P2 anchor position of SARS-CoV-2 CD8+ T cell epitopes, and hence, drastically compromise 366 epitope binding to B7 supertype molecules, which represent ~35% of the human population 367 (Francisco et al., 2015) . Therefore, the C-to-U mutational bias observed amongst prevalent 368 mutants may partially disrupt SARS-CoV-2 T cell immunity in a very significant proportion of the 369 human population. Noteworthy, this impact of C-to-U mutations on B7-depedent epitope escape 370 was somehow predictable. In fact, proline residues originate from codons that are highly rich in C 371 whereas serine and leucine residues originate from codons that are rich in both C and U. One could 372 therefore predict, at least to some extent, that a strong C-to-U bias would lead to proline-to-leucine 373 or proline-to-serine substitutions. Thus, this study highlights the impact of viral mutational biases 374 and codon usage in shaping the diversity of CD8+ T cell targets. This being said, it is important to 375 realize that we do not make the claim that the presence of proline-to-leucine or proline-to-serine 376 mutations in the SARS-CoV-2 proteomes depend on patients being B7 supertype-positive, or that 377 the B7 supertype drives the evolution of proline-to-leucine/serine mutations. We do, however, 378 demonstrate that the prevalent mutations currently in circulation are enriched for proline-to-379 leucine/serine, and our in silico predictions suggest that the high occurrence of this mutation type 380 leads to widespread hinderance of epitope presentation in B7 supertype-positive individuals. 381 A key question to address is to what extent does the C-to-U bias drives SARS-CoV-2 382 evolution and adaptation over the course of the ongoing pandemic. As proposed by others, the 383 most likely explanation for the observed C-to-U bias is the action of the host-mediated RNA- very much in line with our findings. Indeed, we showed that amino acid mutation biases in SARS-396 CoV-2 proteomes generally positively affect epitope binding for various HLA class I supertypes, 397 and most strikingly for A24, whereas B7 is the only supertype negatively affected by the mutation 398 biases given the markable loss of proline residues in SARSCoV-2 proteomes. Together, our results 399 raise the important hypothesis that host-mediated RNA editing systems shape the repertoire of 400 SARS-CoV-2 T cell epitopes in a positive and negative HLA-dependant manner. 401 Another question is whether populations of B7 supertype individuals represent an 402 advantageous reservoir for the virus to evolve toward more transmissible variants. As the genetic 403 diversity of the SARS-CoV-2 population continue to increase, and as new variants emerge, our 404 global analysis suggests that the probability for SARS-CoV-2 epitopes to escape CD8+ T cell 405 immunosurveillance is much higher in B7 individuals compared to A24 individuals. In fact, a 406 slower T cell response dynamic to control SARS-CoV-2 infection in B7 individuals may offer a 407 selective advantage for the virus to evolve. In this regard, we noted that the B.1.1.7 variant lost the 408 B7 supertype-associated epitope SP/HRRARSVA as a result of a proline-to-histidine substitution. Saharan Africa) (http://www.allelefrequencies.net/top10freqs.asp) may provide insights into this 413 concern. As new variants of concern continue to emerge and as new epitope data are continuously 414 being generated (Grifoni et al., 2021) , another interesting avenue would be to study the mutational 415 patterns of those emerging variants and assess whether and how the potential loss of B7-associated 416 epitopes in those specific variants impact T cell response in infected patients. Understanding the 417 impact of losing several subdominant B7-associated epitopes versus one single immunodominant 418 epitope could also be investigated in the context of those variants. In this regard, a particular 419 attention was allocated in our study to the B*07:02-restricted N105 epitope SPRWYFYYL. This 420 epitope is of high interest as its immunodominance was experimentally demonstrated in many 421 independent studies (Ferretti et PàS at P2 of this epitope (SPRWYFYYL à SSRWYFYYL). Its occurrence was predicted to 424 result in the complete abrogation of binding of the epitope to B*07:02, thereby likely reducing the 425 breadth of the immune response in individuals carrying this mutation. As such, we advise the 426 community to carefully monitor this mutation in subsequent months. Moreover, it is also possible 427 that B7 individuals respond less efficiently to the currently available vaccines, as genetic variants 428 promoting B7 escape might favorably emerge in the future. The B7 supertype could therefore 429 potentially represent a biomarker of vaccine resistance. 430 In summary, our study shows that mutation biases in the SARS-CoV-2 population diversify 431 the repertoire of SARS-CoV-2 T cell targets in humans in an HLA-supertype dependent manner. 432 Hence, we provide a foundation model to help understand how SARS-CoV-2 may continue to 433 mutate over time to shape T cell immunity at a global population scale. The proposed process will 434 likely continue to influence the evolution and diversification of SARS-CoV-2 lineages as the virus 435 is under tremendous pressure to adapt in response to mass vaccination. 436 Our analyses focused on class I molecules for which predictors are established to be more accurate 439 in comparison with class II. HLA-C and non-classical HLA were not included in this study. 440 Predictions were performed on the most common HLA class I alleles and rare HLA alleles were 441 not included. Study has been performed using the GISAID dataset available in December 31 st 442 2020, i.e. first year of the pandemic, before mass vaccination. Our epitope binding results rely on 443 in silico predictions using a method that has been widely benchmarked, but is designed to predict 444 peptide presentation rather than immunogenicity. Follow up experiments would need to be 445 performed to further validate the proposed model. Priority follow up studies are 1) to investigate 446 T cell response to SARS-CoV-2 mutants in large cohorts of B7 supertype-positive versus negative 447 patients, and 2) to determine the direct role of APOBEC family proteins in modulation of SARS-448 CoV-2-specific T cell immunity. Moreover, this study lays the foundation to understand the Table S1 . The 1,933 prevalent mutations observed in 574 more than 100 genomes are also clearly shown in Table S2 . binding affinity as well as eluted ligand data to produce a likelihood score for a peptide to be an 586 eluted ligand for the indicated HLA types. The likelihood score consists of a percentile rank 587 (%rank) wherein predicted (weak) binders obtain a %rank below 2.0, whereas strong binder (SB) 588 obtain a %rank below 0.5. Using this ranking system, only mutation-containing peptides where 589 the mutated and/or the reference peptide were ranked as SB were considered for further analyses. We simulated SARS-CoV-2 genomes with SANTA-SIM, using the consensus sequence 626 WuhanHu-1 as input sequence available at https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3. 627 Each simulation was run with a population size of 10,000 individual viral sequences evolving for 628 1000 generations, and analyses were conducted on random samples of 1,000 viral sequences. globally in at least 4 GISAID entries were analysed together. Preferential introduction or removal 646 of amino acids was determined by comparing the overall amino acid composition in reference 647 residues vs mutated residues throughout the mutation pool, resulting in a percentile difference in 648 amino acid composition. As such, for amino acid X, the % difference was calculated according to 649 the following formula: 650 This analysis took into consideration the number of unique mutations. Therefore, to consider 652 mutational biases in the context of mutation frequencies, the analysis described above was 653 conducted separately for mutations occurring in a single GISAID entry (expected to be enriched 654 for errors); 2-10 GISAID entries; 11-99 GISAID entries; and 100 or more GISAID entries. As a 655 negative control, the SANTA SIM algorithm was used to simulate the neutral evolution of 1000 656 SARS-CoV-2 genomes (baseline simulations, N = 10 replicates). This control was used to 657 calculate the statistical significance of the observed biases, by way of a One-Sample T-Test. 658 659 Reference/mutated peptide pairs for which the differential predicted binding affinities led to 661 transitions from strong HLA binder (SB) to non-HLA binder (NB) [(SB) NetMHCpan %rank < 662 0.5 to (NB) NetMHCpan %rank >2] or from NB to SB, were identified, catalogued and analyzed 663 as described above. Binding affinities were predicted for representative HLA types from several were translated into protein sequences and analyzed for the identification of any amino acid mutational bias. Amino acid residues (x-axis) that were removed and introduced in SARS-CoV-2 variants are presented by negative and positive %difference in overall amino acid composition (GRSO values; y-axis), respectively. Analysis of mutational biases was performed for mutations occurring at various frequencies: 1 genome (blue line), 2 to 100 genomes (orange line), 100 to 1000 genomes (green line) and more than 1000 genomes (red line). Simulation of neutral evolution simulation (random mutations) were performed using the SANTA-SIM algorithm and serves as control for assessing the statistical significance of the observed pattern for individual amino acid residues. The dotted red lines show the cutoff values (fold change > 4; p-value < 1x10 -11 ) that were used to define the residues that were preferentially removed or introduced (asterisk). Simulated dataset Real life dataset (Fig.2) Figure 5 . The C-to-U point mutation bias largely drives the diversity of SARS-CoV-2 proteomes and CD8+ T cell epitopes. (A) Comparison of global amino acid mutational patterns generated from real-life versus simulated SARS-COV-2 genomes. Amino acid residues (x-axis) that were removed and introduced in real-life versus simulated SARS-CoV-2 are presented by negative and positive %-difference in overall amino acid composition (GRSO values; y-axis), respectively. Evolution of SARS-CoV-2 was simulated by introducing various extents of C-to-U biases, i.e. x1, x15 and x20 (n = 10). The red line shows the pattern obtained from mutations identified in more than 100 SARS-CoV-2 genomes, related to Anchor residues are located at P2 and P9. Pale orange and green squares cover amino acid residues that are preferentially introduced (F, I, L, Y) and removed (A, P, T) in SARS-CoV-2 proteomes, respectively. Representative supertypes used in this study are shown by an asterisk. Epitope binding motifs were extracted from NetMHCpan Motif Viewer (http://www.cbs.dtu.dk/services/NetMHCpan/logos_ps.php). (B) SARS-CoV-2 mutations 810 in MHC-I-restricted epitopes evade CD8+ T cell responses Long-Term 812 Restriction by APOBEC3F Selects Human Immunodeficiency Virus Type 1 Variants with 813 Restored Vif Function SARS-CoV-2 T cell immunity: Specificity, function, 815 durability, and role in protection SARS-CoV-2-specific T cell immunity in cases of 818 COVID-19 and SARS, and uninfected controls SARS-CoV-2-reactive T cells in healthy 821 donors and patients with COVID-19 Evidence of Differential HLA Class I-824 Mediated Viral Evolution in Functional and Accessory/Regulatory Genes of HIV-1 The race for coronavirus vaccines: a graphical guide The antiviral factor APOBEC3G improves CTL recognition of cultured HIV-infected T 829 cells Extremely High 831 Mutation Rate of HIV-1 In Vivo Immunological memory to SARS-CoV-2 assessed for up to 834 8 months after infection No 836 evidence for increased transmissibility from recurrent mutations in SARS-CoV-2 Emergence of genomic diversity and recurrent 840 mutations in SARS-CoV-2 Allele-specific 842 motifs revealed by sequencing of self-peptides eluted from MHC molecules Unbiased Screens Show CD8+ T Cells 846 of COVID-19 Patients Recognize Shared Epitopes in SARS-CoV-2 that Largely Reside outside 847 the Spike Protein HLA supertype variation across populations: new insights into the 850 role of natural selection in the evolution of HLA-A and HLA-B polymorphisms Predicting Antigen Presentation-What Could We 853 Learn From a Million Peptides? Front Immunol 9 Evidence 855 for host-dependent RNA editing in the transcriptome of SARS-CoV-2 Evasion of adaptive immunity by HIV through the action of 857 host APOBEC3G/F enzymes Functional 859 classification of class II human leukocyte antigen (HLA) molecules reveals seven different 860 supertypes and a surprising degree of repertoire sharing across supertypes Targets of T cell responses to 864 SARS-CoV-2 coronavirus in humans with COVID-19 disease and unexposed individuals A 867 Sequence Homology and Bioinformatic Approach Can Predict Candidate Targets for Immune 868 Responses to SARS-CoV-2 CoV-2 Human T cell Epitopes: adaptive immune response against COVID-19. Cell Host 871 Microbe Memory T cell responses targeting the SARS coronavirus persist up to 11 years post-infection APOBEC Enzymes as Targets for Virus and 946 Cancer Therapy Broad and strong memory CD4+ and CD8+ T cells induced by 949 SARS-CoV-2 in UK convalescent individuals following COVID-19 Characterization of BK Polyomaviruses from Kidney 953 Transplant Recipients Suggests a Role for APOBEC3 in Driving In-Host Virus Evolution Genomic epidemiology of superspreading 957 events in Austria reveals mutational dynamics and transmission properties of SARS-CoV-2 Epitopes targeted by T cells in 960 convalescent COVID-19 patients NetMHCpan-4.1 and 962 NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif 963 deconvolution and integration of MS MHC eluted ligand data Evidence for strong mutation bias towards, and 967 selection against, U content in SARS-CoV-2: implications for vaccine design APOBEC3G Contributes 970 to HIV-1 Variation through Sublethal Mutagenesis SARS-CoV-2 genome-wide T cell epitope mapping reveals 973 immunodominance and substantial CD8+ T cell activation in COVID-19 patients The APOBEC Protein Family: United by 976 Structure, Divergent in Function High levels of SARS-CoV-2 specific T-cells with 979 restricted functionality in severe course of COVID-19 Characterization of pre-existing and induced SARS-CoV-982 2-specific CD8+ T cells Robust T cell 985 immunity in convalescent individuals with asymptomatic or mild COVID-19 Longitudinal observation and decline of 989 neutralizing antibody responses in the three months following SARS-CoV-2 Adaptive immunity to SARS-CoV-2 and COVID-19 HLA class I supertypes: a 994 revised and updated classification Measurement of MHC/Peptide Interactions by Gel Filtration or Monoclonal Antibody Capture Rampant C→U Hypermutation in the Genomes of SARS-CoV-2 and 999 Other Coronaviruses: Causes and Consequences for Their Short-and Long-Term Evolutionary Impact of 1002 APOBEC Mutations on CD8+ T Cell Recognition of HIV Epitopes Varies Depending on 1003 the Restricting HLA COVID-19 and the Path to Immunity Lack of Peripheral Memory B Cell Responses in 1008 Recovered Patients with Severe Acute Respiratory Syndrome: A Six-Year Follow-Up Study Comprehensive analysis of T cell immunodominance and 1012 immunoprevalence of SARS-CoV-2 epitopes in COVID-19 cases Host Immune Response 1015 Driving SARS-CoV-2 Evolution Phenotype 1018 and kinetics of SARS-CoV-2-specific T cells in COVID-19 patients with acute respiratory 1019 distress syndrome HIV Evolution in Early Infection: Selection 1022 Pressures, Patterns of Insertion and Deletion, and the Impact of APOBEC Long-1025 term adaptation of the influenza A virus by escaping cytotoxic T-cell recognition Duration of Antibody Responses after Severe Acute 1029 Respiratory Syndrome Acute SARS-CoV-2 infection impairs dendritic cell and T cell 1032 responses A) Histograms showing the number of unique mutations identified for each mutation type (A-to-C, A-to-G, etc.) after simulating the evolution of SARS-CoV-2 genomes through the introduction of different C-to-U bias values (x4 to x20) using the SANTA-SIM software. Simulated (black squares) and real-life/observed prevalent mutations found in more than 100 genomes (red square) at the nucleotide level are shown. (B) Comparison of global amino acid mutational patterns generated from simulated versus real-life/observed SARS-COV-2 genomes. Various extents of C-to-U (top) and G-to-U (bottom) biases were Figure S1C ,D) to one of 12 highly common HLA types queried (color coded) due to a mutation.