key: cord-0829684-g2y63b5v authors: Pretti, M. A. M.; Galvani, R. G.; Vieira, G. F.; Bonomo, A.; Bonamino, M. H.; Boroni, M. title: Class I HLA allele restricted antigenic coverage for Spike and N proteins is associated with divergent outcomes for COVID-19 date: 2020-06-05 journal: nan DOI: 10.1101/2020.06.03.20121301 sha: efe2cd737e873a8dd47179cdbe7ba1abdccee7a4 doc_id: 829684 cord_uid: g2y63b5v The world is dealing with the worst pandemics ever. SARS-CoV-2 is the etiological agent of COVID-19 that has already spread to more than 200 countries. However, infectivity, severity and mortality rates do not affect all countries equally. Here we investigate the landscape of potential HLA-I A and B restricted SARS-CoV-2-derived antigens and how different populations in the world are predicted to respond to those peptides considering their HLA-I distribution frequencies. Clustering of HLA-A and HLA-B allele frequencies partially separates most countries with the lowest number of deaths per million inhabitants from the other countries. We further correlated the patterns of in silico predicted strong binder peptides and epidemiological data. The number of deaths per million inhabitants inversely correlated with the antigen coverage of peptides derived from viral protein S, while a direct correlation was observed for those derived from viral protein N, highlighting a potential risk group carrying HLAs associated with the latter. In addition, we identified 7 potential antigens bearing at least one amino acid of the small insertion that differentiates SARS-CoV-2 from previous coronavirus strains. We believe these data can contribute to the search for peptides with the potential to be used in vaccine strategies. The world is dealing with the worst pandemics ever. SARS-CoV-2 is the etiological agent of 23 COVID-19 that has already spread to more than 200 countries. However, infectivity, severity and 24 mortality rates do not affect all countries equally. Here we investigate the landscape of potential 25 HLA-I A and B restricted SARS-CoV-2-derived antigens and how different populations in the world 26 are predicted to respond to those peptides considering their HLA-I distribution frequencies. 27 Clustering of HLA-A and HLA-B allele frequencies partially separates most countries with the 28 lowest number of deaths per million inhabitants from the other countries. We further correlated the 29 patterns of in silico predicted strong binder peptides and epidemiological data. The number of deaths 30 per million inhabitants inversely correlated with the antigen coverage of peptides derived from viral 31 protein S, while a direct correlation was observed for those derived from viral protein N, highlighting 32 a potential risk group carrying HLAs associated with the latter. In addition, we identified 7 potential 33 antigens bearing at least one amino acid of the small insertion that differentiates SARS-CoV-2 from 34 previous coronavirus strains. We believe these data can contribute to the search for peptides with the 35 potential to be used in vaccine strategies. 36 Since December 2019 the world has been facing one of the worst pandemics of all times caused by a 38 novel Betacoronavirus (Severe Acute Respiratory Syndrome Coronavirus 2 -SARS-CoV-2). Until 39 this day more than 5.2 million cases and 337 thousand deaths were confirmed. However, the 40 infection does not strike all nations equally when considering multiple epidemiological parameters 41 (1). The viral genome of this etiological agent consists of structural protein genes such as Spike (S), 42 Envelope (E), Membrane (M), and Nucleocapsid (N) and non-structural protein genes (e.g. ORF1ab) 43 (2). Their functions have been extensively reviewed elsewhere (2,3). Once in the human body, 44 SARS-CoV-2 infect cells using the surface molecules ACE2 and TMPRSS2 highly expressed in the 45 lungs. The infection is recognized by macrophages and alveolar epithelial cells that initiate a pro-46 inflammatory response which may trigger an acute respiratory distress syndrome among other 47 symptoms (4-7). The response to SARS-CoV-2 viral proteins has been shown to elicit both cellular 48 immunity and humoral response (8). One of the most promising viral proteins for vaccine 49 development is the S protein due to its accessibility to antibodies and pivotal role in the infection (9). 50 Nevertheless, antibody response appears first against the N protein, the most abundant protein in 51 coronavirus, and a few days later for S protein (8, 9) . The ability to trigger an adaptive immune 52 response greatly relies on the ability to present antigens through the class I and II Histocompatibility 53 Leukocyte Antigen (HLA) molecules. Peptides derived from different SARS-CoV-2 proteins are 54 recognized by CD4 and CD8 T cells from COVID-19 convalescent patients (10). The allelic 55 distribution of the largely polymorphic HLA genes across different countries may affect the 56 appropriate antigen presentation of SARS-CoV-2-derived peptides among different populations (11) 57 and as a result, distinct profiles of disease susceptibility and severity are observed (1). Therefore, 58 these characteristics must be taken into account when considering epitope-based vaccines since target 59 epitopes must not only be able to bind to the HLA molecule but also prove to be immunogenic in 60 order to promote a functional response (12 CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 5, 2020. Figure S2A ). However, when 97 we evaluate cases per million inhabitants, IRL was the leading country in the rank, followed by SGP, 98 USA, ITA, SWE, and POR while BRA, THA, CHN, INA, TUN, and JPN were the countries with the 99 lowest ratios ( Figure S2B ). The USA also led the rank in number of COVID-19 deaths, followed by 100 GBR, ITA, FRA, BRA, and DEU while the countries with the lowest number of deaths were NZL, 101 OMA, SGP, SEN, GHA, and TUN ( Figure S2C ). When analyzing the number of deaths per million 102 inhabitants, ESP was the country with the highest ratio followed by ITA, GBR, FRA, SWE, and 103 HOL while THA, GHA, SEN, CHN, MAL, and INA were the countries with the lowest ratios 104 ( Figure S2D ). The USA also had the highest number of serious cases followed by BRA, FRA, GBR, 105 DEU, and ESP while the countries with the lowest number of serious cases were MAR, UAE, TUN, 106 GHA, SEN, and ARM ( Figure S2E ). On the other hand, the USA also led in the number of 107 recovered, followed by ESP, ITA, BRA, CHN, and RUS while BUL, TUN, SGP, SEN, ROM, and 108 GRE were the countries with the lowest numbers of recovered ( Figure S2F ). 109 To evaluate if the distribution of HLA allele frequencies among populations could explain the 110 epidemiologic data, we performed an unsupervised hierarchical clustering using HLA-A and HLA-B 111 frequencies for each distinct population, that were divided into quartiles according to each 112 epidemiological parameter. Clustering of HLA allele frequencies among populations explained better 113 deaths per million inhabitants. Therefore, we chose to use deaths per million inhabitants as the main 114 metric since it also has been used as a proxy for disease outcome (16,17). We observed two branches 115 enclosing 8 of 11 countries that belong to the first quartile (Q1, green, less number of deaths per 116 million), while countries with the worst outcome (Q4, red, more number of deaths per million) 117 tended to cluster together ( Figure 1) This is a provisional file, not the final typeset article In order to study the binding of SARS-CoV-2 peptides to different HLA alleles, we predicted the 127 affinities of HLA alleles (33 HLA-A and 78 HLA-B, Supplementary File 2) for each potential SARS-128 CoV-2 derived peptide using the netMHCpan tool (18). The 58,470 predicted 8-11 mer peptides were 129 derived from the complete genome of the SARS-CoV-2 deposited on the NCBI 130 (BetaCoV/Wuhan/IPBCAMS-WH-01/2019, see methods). Peptides were categorized as Non-Binders 131 (NB, %Rank ≥ 2; 38,413 peptides), Strong Binders (SB, %Rank < 0.5; 5,785 peptides) and Weak 132 Binders (WB, 0.5 ≤ %Rank < 2; 12,731 peptides) based on the %Rank definition by the developers, 133 which accounts for the probability of a peptide do bind the HLA given a pool of natural ligands. A 134 crescent value in the predicted affinity is observed with the decrease in the qualitative classifications 135 according to the program, lower for SB (median of 180.6 nM), moderated for WB (median of 1,575.9 136 nM) and higher for NB (median of 36,442.2 nM, Figure S3A ). In order to define which group of 137 peptides we should work with, we compared the predicted in silico affinities with affinities of viral- an IC50 threshold. A similar pattern was observed for the experimental affinities, which were lower 145 for Pos-H, median of 10.1 nM, and higher in Pos-L peptides, median of 1,430 (Wilcox, p < 0.0001, 146 Figure S3B ). Moreover, we also predicted the binding affinities of 18,522 unique peptide/HLA pairs 147 retrieved from the IEDB whose capacity to elicit a TCR response was tested through 34,872 T cell 148 assays. They are qualitatively classified into Positive (Pos, 2,705) or Negative (Neg, 15,817) taking 149 into account in vitro assays such as ELISPOT and chromium release. A significant difference in the 150 affinities was observed when comparing Pos (median of 49.1 nM) assays with Neg (median of 188.0 151 nM) ones (Wilcox, p < 0.0001, Figure S3C ). It is important to mention that a difference in affinity 152 (Wilcoxon, p < 0.0001) was observed for the predicted data ( Figure S3D ) but no overall difference in 153 binding affinities was observed for IEDB HLA-A and B alleles ( Figure S3E ). Fifty-seven percent of 154 the HLA alleles assessed in the in vitro binding assays selected from the IEDB are shared by the 155 HLAs used for predictions in our analysis. In summary, the predicted affinities of peptides classified 156 into strong, weak or no binder are comparable to the experimental affinities from ligand assays at the 157 IEDB showing that SB peptides have affinities equivalent to the vast majority of peptides proven to 158 bind the HLA experimentally (Pos-H and Pos-I - Figures S3A-B) , probably eliciting the immune 159 response. Therefore, we decided to work only with the SB peptides in the downstream analyses. 160 Of the 111 analyzed alleles, a total of 5,785 unique SB peptides were identified after the removal of 9 161 SB peptides that matched with 100% identity the sequences found in human proteome using BLAST 162 search. The number of SB peptides potentially presented by each HLA allele is depicted in Figure 2 . 163 The alleles presenting the higher number of SB peptides are HLA-A*11:01 (11,385 peptides) and 164 HLA-B*35:01 (11,174 peptides). 165 We confirmed that the number of SB peptides increases accordingly the number of HLA alleles for a 166 given population by performing a Pearson correlation between these two variables (R=0.96, p-value 167 < 0.01, Figure S4 ). Even considering that the correlation could be expected, it is important to 168 emphasize that the number of SB peptides restricted to a given HLA allele varies greatly and shall 169 affect a given population antigen coverage. 170 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 5, 2020. peptides derived from the protein N (420 aa, Figure S5A ). 177 The use of HLA-restricted epitopes capable of binding with substantial affinity to the HLA alleles 178 provides a means of addressing population coverage related to HLA polymorphism and different 179 HLA binding specificities. We, therefore, calculated the percentage of individuals from each 180 population predicted to respond to class I SARS-CoV-2 derived SB epitope set on the basis of HLA 181 genotypic frequencies and HLA binding data, using the population coverage calculation tool 182 available through the IEDB. This allowed us to obtain the number of Epitopes/HLA combinations 183 that cover each population at a given rate (Table S2, positive correlation was observed for the N protein (p=0.021, R=0.37) (Figure 4) . Consistently, other 211 significant correlations were also found when considering different epidemiological parameters that 212 are depicted in Figures S8-11 . 213 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 5, 2020. . https://doi.org/10.1101/2020.06.03.20121301 doi: medRxiv preprint This is a provisional file, not the final typeset article In fact, a stronger association with disease outcome was found when evaluating the ratio between 214 population's AUC predicted for S and N proteins together versus the number of deaths per million 215 inhabitants (R=-0.56, p=0.00028, Pearson correlation, Figure 5 ). This result highlights the opposite 216 roles associated with potential recognition and presentation of peptides derived from these two 217 SARS-CoV-2 proteins and the COVID-19's outcome. Interested in better exploring this recognition 218 pattern, we calculated the frequency of SB peptides derived from S or N proteins by dividing the 219 number of SB peptides by their respective lengths as in Table S1 to account for the difference in 220 protein sizes. A ratio was calculated dividing the SB frequency from protein S by the frequency of 221 SB in protein N. The majority of HLA-A and B alleles have a ratio superior to 1 meaning that they 222 preferentially bind SB peptides derived from S rather than from N protein ( Figure S12A We highlighted the SARS-CoV-2 derived SB peptides generated within each ORF (left Y-axis, 230 Figure 6A ) or protein as well as the number of HLA-I alleles capable of presenting them (right Y-231 axis, Figure 6A ). We observed regions capable of generating an elevated number of unique SB 232 peptides (black bars -top) in comparison to others and the ability of each of these peptides to bind to 233 several HLA-I alleles (blue dots -bottom), which can help selecting vaccine candidates. We found 40 234 unique SB peptides falling within the receptor-binding motif and 7 SB peptides derived from the 4 235 amino acids insertion (19) in the S1/S2 cleavage site (highlighted in purple in the Figure 6B ). These 7 236 SB peptides vary greatly in the range of affinities and also the number of HLA binders (Table 1) In this work, we investigated the distinct susceptibility associated with HLA diversity profiles among 246 populations facing the SARS-CoV-2. We applied multiple in silico approaches to predict the 247 populations' response to the SARS-CoV-2 infection and correlated these findings with the number of 248 deaths per million and other epidemiological parameters. We suggest that a positive population 249 outcome relies partially on their ability to present peptides derived from the S protein while those 250 derived from the N protein correlate with unfavorable outcomes. 251 Our analyses depict the potential population's patterns of SARS-CoV-2 recognition of class I HLA 252 restricted peptides. The HLA alleles repository allowed us to carefully select populations prioritizing 253 blood donor registry rather than anthropology studies to better represent a country. Since the most 254 relevant genes for CTL responses are the HLA-A and B and considering that HLA-C data is not 255 available for several populations, we have not accounted for HLA-C (20). A threshold of 0.75 for the 256 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 5, 2020. have more targets for the S than for the N protein (3). This same report indicates that several 286 immunogenic HLA-I restricted peptides are derived from S protein, in line with our findings that S 287 peptide coverage has a positive impact in COVID-19 outcome. These results suggest CD8 mediated 288 anti S-derived peptide responses could be one of the mechanisms associated with more favorable 289 COVID-19 outcomes. 290 Features of SARS-CoV-2 S proteins are responsible for unique characteristics of this new virus. A 291 recent report indicated differences in the protein S sequence regarding the receptor binding motif and 292 a four amino acid insertion when comparing with bat and previous strains (19). We found unique SB 293 peptides falling within the receptor binding motif sequence, including the 4 amino acid insertion in 294 the S1/S2 cleavage site. This sequence is considered to be related to the high virulence of SARS-295 CoV-2 and HLA-I restricted peptides derived from this region could help mounting differential 296 immune responses between SARS-CoV-2 and other coronaviruses. On the other hand, the lack of 297 potential peptides derived from the GIVNNTVYDPLQPELDSFKEELDKYFKN region is somehow 298 intriguing and exploiting functional features of this segment and evolutive forces that could have 299 shaped its low immunogenicity can shed light over the relevance of these findings. 300 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 5, 2020. . https://doi.org/10.1101/2020.06.03.20121301 doi: medRxiv preprint This is a provisional file, not the final typeset article The scarceness of N protein derived immunogenic HLA-I restricted peptides reported by Grifoni and 301 colleagues (2020) resemble the low proportion of N protein derived peptides found in our analysis. 302 This reinforces the concept that N derived peptides may not be involved in functional and productive 303 anti SARS-CoV-2 responses since our correlation analyses indicate that populations with greater N 304 coverage have higher numbers of deaths per million inhabitants. Protein N is highly expressed and 305 has a high immunogenic potential so that antibodies against it are the first found in the serum of 306 infected patients (22). Since the homology between N proteins of SARS-CoV-2 and SARS-CoV is 307 over 90% (13) and the SARS-CoV-2 peak viral load is earlier (23), the immunogenicity of SARS-308 CoV-2 protein N is believed to be similar to that of SARS-CoV with an even faster response. 309 However, when investigating T lymphocytes from convalescent patients, a lower frequency of targets 310 associated with protein N was observed (3), indicating that these peptides are either less 311 immunogenic or could lead to the depletion of reacting cells by some T cell hyperactivation 312 mechanism. If this is the case, N derived peptide-based hyperactivation could be related to cytokine 313 storm, a process largely associated with the most severe cases of COVID-19 (4,24,25). The number 314 of deaths per million inhabitants showed a significant positive correlation only when correlated with 315 N coverage, reinforcing the possibility that N peptides lead to a more aggressive disease. This 316 deregulation of the immune system has been shown to occur in other viral infections and could be 317 dependent on CD8 T lymphocytes (26). 318 Finally, we highlighted some important viral regions predicted to be presented by a vast number of 319 HLA alleles, which can help understanding the response to viral peptides and peptide vaccine design. 320 More importantly, we suggest that peptide-based vaccination strategies should rely mainly on the S 321 protein due to its negative correlation with the number of deaths per million inhabitants and its role in 322 the infection. On the other hand, these strategies must avoid using the N protein since it was the only 323 region showing a positive correlation with the number of deaths per million. Last, we identified 324 potential antigens derived from the 4 amino acid insertion of SARS-CoV-2 that is absent from 325 previous strains and could serve as a guide for SARS-CoV-2 specific vaccine development. 326 The epidemiological data used in this analysis (number of total cases, total deaths, total recovered, 329 serious critical cases, total cases per 1 million of inhabitants and total deaths per 1 million 330 inhabitants) were retrieved on the Worldometer database (www.worldometers.info/coronavirus, 331 accessed at 01:18PM, May 17, 2020). This database collects data from official reports, directly from 332 Government's communication channels or indirectly, through local media sources when deemed 333 reliable. The 50 countries with the highest death per million of inhabitants and more than 1 thousand 334 total cases were evaluated. 335 Human Leukocyte Antigen (HLA) I alleles for A and B genes with at least 4 digits and its 337 frequencies in the selected populations were retrieved from the Allele Frequency Net Database (27) 338 between 23/04/2020 and 07/05/2020. Larger datasets that were not part of anthropological studies or 339 minority ethnic groups were preferred (Table S1 ). The allele frequency of the most frequent HLA 340 alleles was added up to reach a minimum coverage of 0.75 for each population. Allele frequencies 341 with resolution greater than 4 digits were combined, e.g. for BRA: B*07:02:01 and B*07:02:03 342 became B*07:02. 343 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 5, 2020. values for "Quantitative.measurement" and entries with ">", "≤" or "<" in "Measurement.Inequality" 348 were excluded. Also, we retrieved T cell assays using the same parameters except for the positive 349 assays only. 350 The translated proteome from SARS- . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 5, 2020. Pearson correlation between the S/N AUC ratio and deaths per million inhabitants. Colors represent 493 the countries inside the confidence interval with a higher (red) or lower (blue) number of deaths. 494 Names are shown for those that are discussed in the text. 495 Binding Domain; S1/S2, S1/S2 cleavage site; S2', S2 cleavage site. 502 Table 1 . Strong binder peptides including some of the 4 amino acid insertion in protein S, which are 503 highlighted in bold. 504 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 5, 2020. . https://doi.org/10.1101/2020.06.03.20121301 doi: medRxiv preprint Update (Live): Cases and Deaths from COVID-19 Virus 422 Allele frequency net: a 425 database and online repository for immune gene frequencies in worldwide populations Correlation 428 between universal BCG vaccination policy and reduced morbidity and mortality for COVID-429 19: an epidemiological study. medRxiv There is no fund associated with this study.