key: cord-0770517-vc62b5qi
authors: Barquera, Rodrigo; Collen, Evelyn; Di, Da; Buhler, Stéphane; Teixeira, João; Llamas, Bastien; Nunes, José M.; Sanchez‐Mazas, Alicia
title: Binding affinities of 438 HLA proteins to complete proteomes of seven pandemic viruses and distributions of strongest and weakest HLA peptide binders in populations worldwide
date: 2020-06-11
journal: HLA
DOI: 10.1111/tan.13956
sha: 3a2920c65fe606dd2726bc14ccb728e2b710e562
doc_id: 770517
cord_uid: vc62b5qi

We report detailed peptide‐binding affinities between 438 HLA Class I and Class II proteins and complete proteomes of seven pandemic human viruses, including coronaviruses, influenza viruses and HIV‐1. We contrast these affinities with HLA allele frequencies across hundreds of human populations worldwide. Statistical modelling shows that peptide‐binding affinities classified into four distinct categories depend on the HLA locus but that the type of virus is only a weak predictor, except in the case of HIV‐1. Among the strong HLA binders (IC(50) ≤ 50), we uncovered 16 alleles (the top ones being A*02:02, B*15:03 and DRB1*01:02) binding more than 1% of peptides derived from all viruses, 9 (top ones including HLA‐A*68:01, B*15:25, C*03:02 and DRB1*07:01) binding all viruses except HIV‐1, and 15 (top ones A*02:01 and C*14:02) only binding coronaviruses. The frequencies of strongest and weakest HLA peptide binders differ significantly among populations from different geographic regions. In particular, Indigenous peoples of America show both higher frequencies of strongest and lower frequencies of weakest HLA binders. As many HLA proteins are found to be strong binders of peptides derived from distinct viral families, and are hence promiscuous (or generalist), we discuss this result in relation to possible signatures of natural selection on HLA promiscuous alleles due to past pathogenic infections. Our findings are highly relevant for both evolutionary genetics and the development of vaccine therapies. However they should not lead to forget that individual resistance and vulnerability to diseases go beyond the sole HLA allelic affinity and depend on multiple, complex and often unknown biological, environmental and other variables.

The pandemic of the new severe acute respiratory syndrome coronavirus SARS-Cov-2 emerged in East Asia at the end of 2019 and spread across the world in a couple of months, totalizing more than 6.5 million confirmed cases and almost 400 000 deaths as of 5 June 2020 (https://covid19.who.int/). In this context, it has become crucial to get a better understanding of the mechanisms that govern our immune defences against SARS-Cov-2, a highly contagious and dangerous pathogen. The HLA classical molecules play a crucial role in our adaptive immunity 1-3 by presenting small pathogen-derived peptides at the surface of infected cells (in addition to selfpeptides that are continuously displayed at the cell surface). The HLA-peptide complex is then recognised by CD8+ or CD4+ T lymphocytes (a mechanism called Tcell restriction), which triggers an immune response. Pathogenic peptides are bound to a specific peptide-binding region (PBR), which forms a beta-pleated sheet floor bordered by two α-helices at the extracellular distal end of the HLA proteins, and is characterised by a very high level of amino acid variation due to the huge polymorphism of the DNA exons that encode this part of the molecule, that is, exons 2 and 3 for HLA Class I molecules restricted by CD8+ cytotoxic T lymphocytes (CTLs), and exon 2 for HLA Class II molecules restricted by CD4+ helper T-cells. Actually, both Class I and Class II molecules are encoded by several genes, the genomic variation of which represents altogether several thousands of different HLA alleles most often differing from each other at many single nucleotide sites (SNPs). 4, 5 Because of this remarkable genetic variation, which is unique in the human genome and thought to represent signatures of long-term balancing selection maintaining advantageous functional diversity, [6] [7] [8] [9] the molecules encoded by different HLA alleles display distinct physicochemical properties that motivated tentative alleles classification into supertypes. [10] [11] [12] These properties determine unequal levels of affinity to different pathogenic peptides and make them present such peptides efficiently or not. The HLA genetic profile of an individual may thus partly influence the strength of the immune response to an invading pathogen because the encoded HLA molecules may exhibit distinct peptide-binding properties. Moreover, as HLA alleles exhibit variable regional frequencies worldwide, 8, 9, 13 the proportion of HLA molecules displaying different peptide affinities for a given pathogen may also vary between populations. To address this issue, it is not only necessary to understand putative differences between populations in terms of immune protection, but also to have a better functional characterisation of the whole HLA polymorphism spectrum for the benefit of future vaccine developments.

Recently developed computational tools that integrate data from in vitro or mass spectrometry assays allow the prediction of peptide-binding affinities of HLA molecules, as reviewed in. 14 Such methods are mostly used to identify viral epitopes that could be considered as good candidates for peptide-based vaccines, for example, against HIV-1, 15 Ebola virus 16 and SARS-CoV-2. 17, 18 In addition to epitope identification, HLA peptide-binding predictions may be useful for population and evolutionary genetics research to understand the behaviour of specific HLA alleles in pathogen-rich environments and investigate whether such alleles might be submitted to pathogen-driven selective pressures in human evolution. [19] [20] [21] In this context, the analysis of infectious agents belonging to distinct families is expected to bring significant working hypotheses.

In this study, we used a bioinformatic approach to characterise binding affinities between 438 HLA proteins (311 Class I and 127 Class II) and the full set of 9-mer (for Class I) and 13-mer (for Class II) peptides than can be derived from the complete SARS-CoV-2 proteome. We then explored the global allele frequency distributions of the strongest and weakest HLA binders of these viral peptides through statistical modelling to identify putative differences among populations. We performed the same analyses and compared the results with SARS-CoV-2 for six other viruses: SARS-CoV-1 and MERS-CoV, which belong to the same beta-coronavirus family as SARS-CoV-2; H1N1, H3N2 and H7N9, which represent three different influenza A virus subtypes also responsible for a highly contagious respiratory illness (flu); and the lentivirus HIV-1 of the acquired immune deficiency syndrome (AIDS).

Our results showed significant differences among Class I and Class II HLA molecules in their capacity to present SARS-CoV-2 peptides at distinct affinities levels (strong, regular, weak and non-binder), a greater proportion of strongest binders being found among HLA-A proteins. However, the binding affinity profiles predicted for SARS-CoV-2 are not unique as they are very similar to those predicted for all other viruses, to the exception of HIV-1. Most interestingly, the frequencies of strongest and weakest HLA binders differ among populations from different geographic regions. In particular, Indigenous Americans show unique peptide-binding patterns that might represent past signatures of selection acting on several promiscuous HLA alleles due to ancient pathogenic infections.

We used a large database of HLA allele frequencies in world populations (with alleles defined at the secondfield level of resolution, third and fourth-field levels being recoded to second-field) including data from both the literature (1992-2017) and reports of the 11th to 16th International HLA and Immunogenetics Workshops (IHIWs). For each of the different loci (HLA-A, -B, -C, -DRB1, -DQA1 and -DQB1), the dataset comprises between 158 and 374 typed samples, classified according to the hlanet.eu guidelines, 22 into 10 sub-continental regions, that is, Sub-Saharan Africa (SAF), North Africa (NAF), Europe (EUR), South-West Asia (SWA), North-East Asia (NEA), South-East Asia (SEA), Australia (AUS), Oceania (OCE), North America (NAM) and South America (SAM). The number of populations per locus and region and the detailed list of populations are provided in Tables S1 and S2. Note that to avoid terms with possible negative connotations, we will use the most generally accepted term Indigeneous peoples to name the descendants of the earliest known inhabitants of a region, hence Indigeneous Australians and Indigeneous Americans will replace the commonly used Australian Aborigines and Amerindians (and other trivial names), respectively.

All HLA-A, -B, -C, and -DRB1 alleles that were observed in at least five populations worldwide (according to our database of allele frequencies), that is 92 HLA-A, 164 HLA-B, 55 HLA-C and 94 HLA-DRB1 were selected to assess the peptide-binding affinity of their corresponding proteins HLA-A, HLA-B, HLA-C and HLA-DR, respectively, the latter representing the HLA-DRA/DRB1 dimer as HLA-DRA is here considered monomorphic. For HLA-DQA1 and -DQB1, we selected all possible allele combinations represented in the NetMHCIIpan 23 method, that is, 33 HLA-DQA1/DQB1 proteins, hereafter named HLA-DQ. Therefore, a total of 438 different HLA proteins were analysed (Table S3 ).

To assess the HLA-peptide-binding affinity predictions, we used the whole proteome of six respiratory viruses, including three coronaviruses important for public health (severe acute respiratory syndrome coronaviruses 1 [SARS-CoV-1] and 2 [SARS-CoV-2] and Middle East respiratory syndrome-related coronavirus [MERS-CoV]) and three Influenza A viruses with pandemic behaviour (Influenza A virus subtypes H1N1, H3N2 and H7N9, reported to have a high pandemic potential 24 ). We further included the human immunodeficiency virus type 1 (HIV-1) as an outlier for respiratory viruses to contrast our results. For each virus we used the following proteins and strains (all these correspond to complete proteomes of the corresponding viruses) 25 :

Replicase polyprotein 1ab of isolates BJ01, BJ02, BJ03, BJ04, CUHK-Su10, CUHK-W1, Frankfurt 1, GD01, GZ50, HKU-39849, HSR 1, Shanghai LY, Shanghai QXC, Sin2500, Sin2677, Sin2679, SZ16, SZ3, Taiwan,  Taiwan TC1, Taiwan TC2, Taiwan TC3, Tor2, TW1,  TWC, TWH, TWJ, TWK, TWS 

The translation of the complete genome of the isolate Wuhan-Hu-1 (as reported in the NCBI Reference Sequence: NC_045512.2).

The replicase polyprotein 1ab of isolate United Kingdom/ H123990006/2012 (UniprotKB: K9N7C7).

The hemagglutinin (HA) and neuraminidase (NA) of the strain A/Mexico/InDRE4114/2009 (UniprotKB: C5MQJ6 and C5MQL2, respectively), the nucleoprotein (NP) of strain A/New York/1682/2009 (UniprotKB: C5E522), the matrix protein (M1) of strain A/Nagano/RC1/2009 (UniprotKB: D4QF89), the Matrix protein 2 (M2) and the nuclear export protein (NEP) of strain A/USA:Albany/ 12/1951 (UniprotKB: A4U7A7 and A4U7B1, respectively), the non-structural protein 1 (NS) of strain A/ Hickox/1940 (UniprotKB: Q0HD54), the polymerase acidic protein (PA), the RNA-directed RNA polymerase (RDRP) and the polymerase basic protein 2 (PB2) of strain A/Puerto Rico/8/1934 (UniprotKB: P03433, P03431 and P03428, respectively) and the Protein PB1-F2 (PB1-F2) of strain A/USA:Phila/1935 (UniprotKB: A4GCM8). 

We predicted the peptide-binding affinity of each HLA protein to all possible overlapping 9-mer (for HLA Class I) and 13-mer (for HLA Class II) peptides (the most commonly bound by these proteins, respectively) derived from all viral proteins and strains listed above. The peptide-HLA-binding affinity predictions were run using the Immune Epitope Database (IEDB) and Analysis Resource virtual machine image. 26, 27 We used the prediction algorithm from NetMHCpan v. 4.0 28 for Class I alleles and NetMHCIIpan v. 3.2 23 for Class II alleles, since these methods include all alleles described in the Table S3 . We classified the binding predictions, or binding kind, as strong (IC 50 ≤ 50 nM), regular (50 nM < IC 50 ≤ 500 nM) and weak (500 nM < IC 50 ≤ 5000 nM) binders for Class I, and strong (IC 50 ≤ 50 nM), regular (50 nM < IC 50 ≤ 1000 nM) and weak (1000 nM < IC 50 ≤ 5000 nM) binders for Class II, following the recommendations by the authors. 26, 27 Any peptide-binding prediction affinity above 5000 nM was considered as a non-binder. We validated our results against those obtained using the ANN method 29 and the NN-align-2.3 (netMHCII-2.3) method 23 for smaller subsets of Class I and Class II alleles respectively. These methods yield prediction affinites with higher accuracy, 30 but were not used for this study as they only include a fraction of the alleles analysed (data not shown).

3.1 | HLA strongest and weakest binders of SARS-CoV-2 peptides in populations worldwide Allele frequencies of population samples were added and collapsed into the four binding kinds (strong, regular, weak and non-binder). The variation of these frequencies was graphed by locus and region to identify putative patterns. Statistical modelling was used to confirm and formalise the patterns identified. Linear modelling was used to obtain estimates of the associations between the regions and the loci for each of two extreme binding kinds retained, that is, strongest (strong binder for at least 100 SARS-CoV-2-derived peptides) and weakest (weak or non-binder for more than 99% of the total set of SARS-CoV-2 derived peptides). Potential heteroscedasticity issues due to uneven sample distributions among geographic regions were addressed using mixed models 31 and the results were consistent with those of the linear model. A single model including binding kind as a third predictor was considered and provided similar results but, because three-way interactions were necessary to report the model, we preferred splitting the data set according to binding kind to simplify the presentation of results.

In order to analyse the binding repertoires for all viruses, we recoded the absolute counts of bound peptides into proportions to obtain comparable quantities. Strongest binders were thus defined as strong binders for at least 1% of the total set of peptides per virus, and weakest binders (as was performed for SARS-CoV-2 alone) as weak or non-binders for 99% (or greater) of them. Patterns were sought through graphical representations and formalised by means of linear modelling. Issues with heteroscedasticity were handled by rank transforming the proportions. The model was further confirmed using robust regression, a procedure that iteratively reweighted the observations in inverse proportion of its residuals, 32 to tame the impact of outliers.

All the reported statistical analyses were performed using R version 3.4.4 33 in a ×86_64-pc-linux-gnu (64-bit) platform. The 438 HLA molecules analysed in this study bind different numbers of SARS-CoV-2 peptides with each of the four kinds of binding affinities (strong, regular, weak or non-binding) (Data S1), with the proportions of bound peptides also varying among loci (Table 1 and Figure 1 ). The average proportion of SARS-CoV-2 peptides predicted to bind HLA molecules with strong affinity is below 1% (varying between 0.01% for HLA-DQ and 0.7% for HLA-A). The average proportion of peptides that bind with either regular or weak affinity is also low for Class I molecules (<2% and <8%, respectively) but substantially higher (6%-21% and 38%-40%, respectively) and with a much larger variance (eg, 0.03%-54.4% and 2.8%-55.3%, respectively, for HLA-DR) for Class II. The vast majority of peptides (at least >74%, and on average >90%) do not bind HLA Class I molecules, whereas larger variances are again observed for HLA Class II (eg, 9.6%-97.2% for HLA-DR).

Among HLA Class I proteins, only one HLA-A molecule (1.1%) is never classified as a strong binder (# of bound peptides = 0) and as many as 17 molecules (18.5%) are strong binders for more than 100 peptides ("strongest" binders, see below), while these proportions are reversed for HLA-B and HLA-C (18.3% and 20% of never strong binders and 3% and 0% of strongest binders, respectively) ( Table 2 ). For HLA Class II, almost half (47.9%) of HLA-DR and as many as 88.9% of HLA-DQ proteins are never strong binders and the proportions of strongest binders is moderate for HLA-DR (6.4%) and null for HLA-DQ.

Very few HLA molecules are never regular binders (the highest proportion, 9.1%, being observed at HLA-C). However, a greater proportion of HLA-A molecules (62%) are often regular binders compared to HLA-B (25%) and HLA-C (49.1%) although the great majority of regular binders are found among Class II molecules (97.9% of HLA-DR and 73.5% of HLA-DQ).

Each HLA molecule binds weakly or does not bind at least one peptide (the number of peptides is never 0 in these categories). Most HLA Class II (>97%) and a large proportion of HLA-C (56.4%) bind weakly more than 500 peptides, compared to HLA-A (31.5%) and HLA-B (26.2%). However, HLA-B displays the greatest proportion of proteins (57.3%) that bind weakly or do not bind the main bulk (>99%) of SARS-CoV-2 peptides, followed by HLA-C (38.2%), HLA-A (22.8%), HLA-DQ (20.6%) and HLA-DR (2.1%).

Overall, HLA-A proteins appear to be better binders of SARS-CoV-2 peptides than the other HLA Class I Note: The complete list of alleles with the number of peptides bound at different affinity levels is given in Data S1. and weakest binders. Strongest binders were those predicted to bind at least 100 viral peptides with strong affinity and weakest binders were those predicted to bind weakly or not at all to more than 99% of viral peptides. A total of 28 HLA were classified as strongest (Table 3) and 144 as weakest (Table 4 ) according to these criteria.

Among the strongest HLA-A binders, A*02:11 and A*02:22 are particularly successful as they bind more than 200 peptides with high affinity and are also weak or non-binders for the lowest proportion of peptides (<93% 

HLA-C proteins display weaker binding properties compared to HLA-A and -B, as none of them bind more than 100 peptides with high affinity (HLA-C*03:02 is the top strongest binder with 99 peptides). The weakest binders are C*01:03, C*07:04, C*07:11, C*18:01, C*18:02 and C*04:04, all of which either bind weakly or do not bind 100% of peptides; they are also never classified as either strong or regular, except in one case for C*04:04. C*18:01 shows moderate frequencies (rarely above 10%) in a few sub-Saharan African populations and C*04:04 reaches 20% in a single Sioux population from North America.

Among HLA-DR proteins, DRB1*01:01 is strong binder for as many as 719 peptides, followed by DRB1 

We developed an interactive tool (https://hla-net.eu/sarscov-2/) to visualise the population frequencies of HLA alleles in relation to the ability of their corresponding proteins to bind SARS-CoV-2 peptides at different affinity levels. This tool was built using R Shiny Package (version 1.4.0) and runs on the hla.net.eu server maintained at the Anthropology Unit of the University of Geneva. It allows one to select one or more HLA alleles per locus, per geographic region and per kind of binding (strong, regular, weak or non-binder), and in each case a continuous slider allows choosing a cut-off for the number of viral peptides bound (or not bound) to the corresponding molecules (default value 10% of peptides per locus). Three outputs are provided for each set of selected alleles: a global map (two for HLA-DQ, that is, for DQA1 and DQB1, respectively) showing their frequencies in all populations in the form of pie charts; box plots showing the frequencies of these alleles in each of the 10 geographic regions; and a table providing information on all population samples used in the study including detailed allele frequencies. This tool has been implemented in the -net.eu bioinformatic platform (http://hla-net.eu) first developed within the scope of the EU-funded HLA-NET BM0803 Action. 22, 36 We plotted the cumulative frequency distributions of the strongest (red dots in Figure 2 ) and weakest (blue dots in Figure 2 ) binders in each population at each locus, except HLA-DQ, which is not represented because it involves two polymorphic loci and no such joint frequencies were available (as we do not have information on populations' genotypes, we do not know the frequencies of DQ heterodimers, this is why we could not report DQ results in relation to population frequencies). This revealed notable differences both among the loci and geographical regions (Figure 2 HLA Class I molecules are mostly involved in the presentation of viral peptides and CD8+ CTL restriction, whereas HLA Class II molecules present antigenic peptides to CD4+ T-helper cells, which triggers differentiation of antibody-producing B cells. For that reason, we also plotted the averaged cumulative frequencies of HLA Class I (A + B) strongest and weakest binders separately from those of HLA Class II (DRB1) for the same subset of 124 populations (7 SAF, 6 NAF, 26 EUR, 7 SWA, 16 NEA, 17 SEA, 5 AUS, 25 OCE, 10 NAM and 5 SAM, respectively) tested at these three loci (Figure 2 bottom) . On average, strongest binders are less frequent than weakest binders for A + B, although weakest binders' frequencies sometimes show larger variances. All Indigenous Americans again display the highest frequencies of strongest and the lowest frequencies of weakest binders. The plot of HLA Class II (DRB1) frequencies clearly distinguishes sub-Saharan Africa, which displays the highest frequencies of weakest binders, and contrasts SAF, NAF, EUR, SWA, and NEA from SEA, AUS, OCE, NAM, SAM regions due to higher frequencies of strongest binders in the former.

We tested simultaneously the effects of several parameters, that is, HLA locus (HLA-A, We tried many simplifications (either automatic, via stepwise regression or handmade) of the complete maximal model (the model including all variables and their interactions) by grouping some regions together, but the resulting models were significantly worse. As our initial model presented some heteroscedasticity, not unexpected given the uneven number of samples per region, we resorted to mixed models using the samples as a random effect. The complete maximal mixed model could not be simplified without significant loss and the relative magnitudes of almost all the coefficients remained the same. We thus concluded that the structure presented by the data was relevant as the (linear) model retained explains 85% and 95% of the total variance of the frequency of strongest and weakest binders, respectively (Table 5) .

Both kinds of binding show common patterns of significant differences between Locus A (taken as reference) and Locus B (P < .01) but not Locus DR (P < .05 only for weakest). Region SAF (taken as reference) is significantly different from AUS (P < .05 for strongest binders and P < .01 for weakest), OCE (P < .01), NAM (P < .01) and SAM (P < .01), with particularly high frequency increases of strongest binders (>30%) in NAM and SAM and marked frequency decreases of weakest binders (>11%) in AUS, NAM and SAM. Region EUR shows a 10.9% significant increase (P < .01) in the frequency of strongest binders compared with SAF, while SWA and NEA show marginally significant differences (P < .1) and only a 5% increase, while for weakest binders no significant differences are observed for these regions.

The pattern of significant interactions is split, with opposite significance for strongest and weakest binders, to the exceptions of LocusDR:RegionNAF, LocusDR: RegionEUR, LocusB:RegionSWA, LocusDR:RegionSWA, LocusB:RegionNEA, LocusB:RegionSEA and LocusB: RegionOCE that present similar patterns for strongest and weakest binders.

According to the retained models for both kinds of binding affinities, allele frequencies of strongest and weakest HLA SARS-CoV-2 peptide binders thus depend both on the HLA locus and the geographic region, although not in an additive way, therefore explaining the numerous interactions that appear as statistically significant.

Using the same methods and set of alleles as was done for SARS-CoV-2, we performed peptide-binding predictions for peptides derived from SARS-CoV-1, MERS-CoV, H1N1, H3N2, H7N9 and HIV-1 (Data S2-S7). Overall, the patterns displaying the percentages of bound peptides are very similar for the seven viruses (Figure 3 ), but we also note relevant differences between the three viral families (coronaviruses, influenza viruses and the immunodeficiency virus). Among strong binders, the three coronaviruses bind a greater range of peptides than the three influenza, and the range of bound peptides is lowest for HIV-1. Regular binders show analogous ; the second column shows asterisks indicating the significance level of a test for the coefficient being zero (no effect); and the third column presents in parentheses the values of the standard errors associated with the coefficients. *P < .1; **P < .05; ***P < .01. differences among the virus families although with a greater contrast for HIV-1 at HLA-DR. The ranges observed for non-binders are also globally slightly reduced for the three influenza viruses compared with coronaviruses and for HIV-1 compared with the other two viral families.

We then looked at the classification of HLA proteins as strongest and weakest binders for each virus. In order to make the data comparable among viruses that do not display the same proteome lengths, we took a minimal threshold of 1% of peptides bound with high affinity (instead of an absolute value of 100 used before for the SARS-CoV-2 analyses) to classify HLA molecules as strongest binders. The criterion to define weakest binders remained the same as was used for the SARS-CoV-2 analysis (ie, weak or non-binder for more than 99% of viral peptides).

Among the total set of 65 HLA molecules predicted to be strongest binders for at least one virus, 16 (Table S4) . Also, among the 187 HLA molecules found to be the weakest binders for at least one virus, 121 were the weakest binders for all viruses, 25 only for HIV-1 and the remaining 41 for other combinations.

The majority of HLA proteins are thus not specific binders of SARS-CoV-2 or even coronavirus peptides but are generalist binders for viral pathogens of different families. We did not identify any strongest binder for HIV-1 alone at this threshold. In addition, a significant number (25) of the weakest binders are HIV-1specific, although the majority (121) is weakest for all viruses (Table S4) We tried many simplifications (either automatic, via stepwise regression, or handmade) of the complete maximal model (ie, the model including all variables and their interactions) by grouping together some viruses or kinds of binding, but the resulting models were significantly worse. As our initial model presented heteroscedasticity, we restarted the modelling using a non-parametric approach by replacing the proportion of bound peptides with their ranks. The model could not be simplified without significant loss. In addition, to further assess the model and reduce the effects of outliers, we used robust regression and again the maximal complete model could not be simplified, with the relative magnitudes of almost all the coefficients remaining the same. We thus concluded that the structure presented by the data was relevant as the retained model explained 90% of the total variance.

According to the retained model, both the kind of binding and the HLA locus and their interactions are highly significant (Table 6 ). This contrasts with a weak effect due to the virus (null for coronaviruses and with moderate ranks and significances for influenza viruses), except for HIV-1, which shows much higher ranks as well as strong and highly significant interactions with all kinds of bindings.

In this study, we considered a total set of 438 Class I and Class II proteins differing from each other by the amino acid sequence of their PBR. We have identified which Note: The dependent variable is the rank of the proportion of bound peptides. The left column (terms) lists all the independent variables and their interactions. For the retained model, the first column displays the coefficients of the model, that is, the differences in average ranks between the group defined by each term and the reference (Locus: A; Virus: cov2; Kind: regular, grouped on the constant term); the second column shows asterisks indicating the significance level of a test for the coefficient being zero (no effect); and the third column presents in parentheses the values of the standard errors associated with the coefficients. *P < .1; **P < .05; ***P < .01.

HLA molecules are predicted to bind all possible 9-mer (for Class I) and 13-mer (for Class II) peptides (> 7000) derived from the complete SARS-CoV-2 proteome, and we have classified them according to the proportions of peptides that they are expected to bind with different kinds of affinity (IC 50 ), i.e. strong, regular, weak or nonbinding. We have also explored the global distributions of the strongest and weakest HLA binders by using a large dataset of HLA frequencies estimated in 158-374 populations (depending on the locus) from 10 geographic regions worldwide and by using statistical modelling to detect possible patterns. We then complemented these analyses by using the complete proteomes of six additional viruses, two of them belonging to the same coronavirus family (SARS-CoV-1 and MERS-CoV), three of them being involved in another, very common, respiratory disease, that is, flu (H1N1, H3N2 and H7N9) , and the last one being the main causal pathogen of AIDS (HIV-1). We have finally compared the results obtained for the seven viruses to identify possible similarities or differences in the ability of HLA Class I and Class II proteins to present their derived peptides, and in the worldwide distribution of their strongest and weakest binders. To our knowledge, this is the first study providing a comprehensive analysis of HLA peptide-binding predictions for such a large set of highly infectious and (potentially) pandemic viruses in relation to such an extensive database of HLA-typed population samples. We are also fully confident that our results differ from what we would expect by chance, as they were fully replicated by using two independent algorithms to run the predictions (as mentioned in Material & Methods) and by running independent analyses on multiple viruses for which we found similar results within each viral family.

Our first observation is that HLA molecules, independent of the locus, are predicted to bind a limited proportion of all possible SARS-CoV-2 derived peptides with high affinity (on average 0.01% for HLA-DQ to 0.7% for HLA-A). The large majority of them (on average > 90%) do not bind Class I molecules, whereas more even proportions of regular (6.1%-21.3%), weak (38.1%-40.1%) and nonbinders (38.1-55.7%) are found among Class II proteins. Of course, we do not know, in reality, how many viral peptides may trigger an immune response among the total set of theoretical ones that we have derived in silico from the SARS-CoV-2 proteome (and further on from that of the other viruses). Nevertheless, we can confidently expect a lower number and the proportions that we have found may thus actually be much higher. Also, because we chose a very low IC 50 (≤50) and thus a very high affinity threshold to characterise peptide bindings as strong, we expect that peptide presentations by the HLA molecules that we have classified as strongest binders (IC 50 ≤ 50 for many peptides) are able to trigger efficient CD8+ and/or CD4+ immune responses. In support to our hypothesis, bioinformatic predictions combined to in vitro experimental testing and in vivo immunogenicity testing in HLA transgenic mice showed that Class I alleles displaying a higher number of predicted binders with higher-binding affinities are associated with higher magnitude of T-cell responses. 37 Peptide-binding predictions for HLA Class II molecules are also highly relevant to explore potential responses to viral infections such as SARS-CoV-2, not only in view of the crucial role of CD4+ T-helper cells in CD8+ T cell differentiations and in the production of neutralising antibodies, but also because of increasing evidence that CD4 + cytotoxic T lymphocytes may act in concert with CD8 + CTLs during viral infections thanks to a dual recognition of peptides through HLA Class I and II. 38 We thus believe that the inclusion, in our study, of both Class I and Class II peptide-binding predictions brings crucial information for the development of peptide-based vaccines, 12 although immunogenicity would need to be validated experimentally. 17, 18 Interestingly, different proportions of HLA strongest binders were seen among the loci that we analysed (up to 18.5% of HLA-A but only 6.4% of HLA-DRB1, 3% of HLA-B and no HLA-C nor HLA-DQ molecules), and other differences were found for regular and weak binders. The contrasts observed among the HLA loci may be related, at least in part, to the diverse functions that their proteins assume for immunity. First, the greater proportion of HLA Class I strongest binders may be explained by the more decisive role of these molecules in viral infections although, as stated above, Class II molecules are also essential in particular to the development of sustained, long-term humoral responses that may play a vital role in terms of vaccination and herd immunity. In addition, the major difference observed among the three Class I loci is in line with both the greater promiscuity of HLA-A proteins in peptide binding 39 (see also below), which explains why more HLA-A proteins present large numbers of peptides than HLA-B, and the greater involvement of HLA-C in KIR interactions, 40 which suggests that the peptide-binding function of HLA-C molecules could be less efficient compared with that of HLA-A and HLA-B 41 or fine-tuned differently to also accommodate peptide selectivity by KIR molecules on NK cells. 42 The strength of the immune function is also influenced by the expression levels of HLA molecules 43 -which is affected by many factors 44 -and may explain why HLA-C molecules, the abundance of which are highly variable at the cell surface, 45 here exhibited worse peptide-binding affinities.

Besides these locus-specific effects, a relevant observation of our study is that the HLA-binding patterns that we predicted for SARS-CoV-2 peptides are not unique to this virus. Indeed, we found almost identical peptidebinding patterns ( Figure 3 ) and many common HLA strongest binders (Table S4) for the other two coronaviruses SARS-CoV-1 and MERS-CoV, which could be explained by their (relatively) high level of genome-wide sequence identity (about 79% and 50%, respectively 46 ) with SARS-CoV-2. The three influenza viruses H1N1, H3N2 and H7N9 behave somewhat differently, showing slightly lower percentages of strong or regular bindings to HLA and by sharing fewer strongest binders (although still a substantial number). Our statistical model also revealed that, overall, the variety of respiratory virus (ie, coronaviruses or influenza) has little effect on the HLA peptide-binding patterns (according to Table 6 , no statistical significance is ever observed for coronaviruses, and heterogeneous significances for influenza viruses).

By contrast, the patterns observed for HIV-1 reveal that a lesser proportion of peptides derived from this non-respiratory virus binds HLA molecules with either strong or regular affinity (the difference being particularly pronounced for regular bindings), which is highly significant according to our statistical model (Table 6) . Also, although 16 HLA proteins are found to be strongest binders for all viruses including HIV-1 (Table S4) , this virus stands out by showing the greatest proportion of weakest binders (of 187 weakest binders, 154 are shared with others viruses and 25 are unique to HIV-1). Overall, these results suggest that adaptive immune responses driven by HLA are less efficient towards HIV-1 than towards respiratory viruses. In the same way, HLA proteins that are usually considered as conferring protection against HIV-1, that is, B*57:01, B*57:02, B*57:03, B*58:01, B*27:05 and B*27:02, 47 bind between 0 (B*27:02) and 16 (B*58:01) HIV-1 derived peptides (ie, 0%-0.6%) with high affinity, which is quite low compared with 48 peptides (1.7%) presented by the strongest HLA binder found for HIV-1, B*15:03 (which is actually the strongest HLA-B binder for all viruses). On the other hand, our definition of strongest binders relies on two different estimates considered simultaneously, that is, a strong affinity (IC 50 ≤ 50) and a large number of peptides bound, which prevents us from identifying more specialist alleles that would bind very few but key viral peptides with strong affinity, as might be the case for some of the alleles listed above. Moreover, another limitation of our study is that we may have missed some strong or regular HLA binders of peptides having different lengths than those that we used for our predictions. Indeed, while most HLA Class I ligands are 9-mer peptides, their lengths typically vary between 8 and 12 amino acids in relation to different HLA allele clusters (eg, A*01:01 and A*03:01 often present longer peptides), 48 and slightly shorter or longer peptides may sometimes display better affinities. This is the case for the 11-mer KAFSPEVIPMF epitope derived from the p24 capsid Gag HIV-1 protein ("KF11" p24 Gag 162-172), 49 which binds HLA-B*57 molecules with much better stability than shorter peptides 50 (see also could thus be attributed to a very specific affinity to a few conserved peptides (likely of different lengths than those that we tested), rather than a large affinity to many diverse regions of the viral proteome. This is supported by the idea that many (but not all) HLA-B proteins would be more fastidious (ie, specific) whereas many (but not all) HLA-A would be more promiscuous (ie, generalist) at presenting pathogenic peptides. 39 This agrees with our result that HLA-A (mostly A*02, which can be considered as highly generalist) molecules form a majority representation among the HLA Class I strongest binders shared by the seven viruses that we have analysed. As a consequence of the promiscuous peptide-binding behaviour of many HLA proteins that we disclose in the present study, some alleles that have been claimed as strongest and weakest binders of SARS-CoV-2 so far 51 are not unique to this virus. This is the case, for example, for HLA-B*15:03 and B*46:01 (the latter having previously been considered to confer susceptibility to SARS-CoV-1 disease by comparing severe cases to controls, 52 as recently reviewed 53 ), which are in our top list of strongest and weakest binders, respectively, for SARS-CoV-2 (in agreement with Reference 51 ), but also for the other six viruses that we have analysed. Therefore, we propose that these alleles do not confer specific protection or vulnerability to SARS, as recently suggested, 51 but more widely to different diseases. However, it is important to stress that weakest binders defined by the current work might still act as regular or strongest binders in the context of infections by other viruses not tested in this study or by other kinds of pathogens (ie, bacteria, fungi or parasites). Furthermore, weakest binders could also play a crucial role by providing more specific but significant advantages to their carriers against new virulent strains appearing in a population.

Two unexpected results also emerged from our study regarding the global distribution of strongest and weakest HLA binders to SARS-CoV-2 peptides in human populations. The first one is the opposite pattern observed for the two loci HLA-A and -B. Indeed, the cumulative frequency of strongest binders is higher for HLA-A and lower for HLA-B in most regions of the world, while the reverse is observed for weakest binders (Figure 2 ). The fact that HLA-B is more polymorphic than HLA-A 4 (164 HLA-B and 92 HLA-A alleles defined at the second field level of resolution were considered in this study) probably explains why the cumulative frequencies of weakest binders are much greater for HLA-B. However, this explanation does not hold for strong binders. Actually, the high cumulative frequencies of HLA-A strongest binders are principally due (but not only, see below) to HLA-A*02:01, an allele which is frequent almost everywhere in the world, whereas most of the strongest HLA-B binders are rare.

The second, and probably the most remarkable result, is the dual observation of particularly high and low cumulative frequencies of, respectively, strongest and weakest HLA binders in Indigenous populations from North and South America. These two independent patterns were highly significant ( Table 5) and not observed in any other geographic region (Figure 2 , see loci A + B combined). Among the strongest binders, A*02:01 is common in most regions of the world but reaches especially high frequencies (sometimes up to 50%) in Indigenous Americans and is classified as strongest binder for the three coronaviruses analysed in this study (Table S4) ; A*02:06, the strongest binder for all seven viruses, is rare globally, slightly more common in North-East Asia and sometimes very frequent in America where it reaches 20%-30% in some Mexican populations; A*68:01 is rarely above 5% globally but reaches frequencies of about 15%-20% in Indigenous populations from North America (Mixtec and Seri), and is strongest binder for all viruses except HIV-1; A*02:22, also strongest binder for all viruses, is virtually absent or very rare in the world except in some Indigenous populations from Venezuela (Bari, 6.5%) and Brazil (Terena, 15%); A*24:03, strongest binder for all coronaviruses, is another rare allele that is observed at 10% to 11% frequency in Brazil and Argentina. Other strongest binders are also found in other regions (eg, A*02:03, reaching 17% and B*15:25, reaching 15%-40% in Yami-in South-East Asia; A*02:11, reaching 9%-16% in India; or B*15:03, reaching 11% in sub-Saharan Africa) but the cumulative frequencies of strongest binders in these populations (except Yami) are always lower than in Indigenous Americans.

We reported many strongest HLA binders that are at high frequencies in multiple Indigenous American populations that are not necessarily close geographically nor related to each other. This is in contrast to other regions of the world where populations underwent similar strong bottlenecks and/or rapid genetic drift, such as in Taiwan, Australia and Oceania. Therefore, the patterns observed in the Americas might be insufficiently explained by demography alone. Remarkably, weakest HLA binders are also less frequent in Indigenous Americans (as opposed to other populations where frequencies for both strongest and weakest binders overlap), which again represents an independent result that might not be easily explained by demography. Instead, it seems plausible that strongest binders were positively selected (eg, through soft selective sweep) from the standing genetic variation, 21,54-56 by conferring protective effects against some (undefined) pathogens, although the formal testing necessary to confirm our hypothesis is beyond the scope of this study. A possible explanation is the European colonisation of the Americas five centuries ago, as it introduced new infectious diseases (eg, smallpox 57 ), which many historical records claim to have been a key factor in the decimation of Indigenous American populations. Here, as the great majority of strongest HLA binders that we have identified are not specific to a given virus among the seven that we have compared (many of them are even strongest binders for all these viruses, including HIV-1), the frequency patterns that we observe today in Indigenous Americans might be the result of selective pressures increasing the frequencies of promiscuous strong HLA binders (such as HLA-A*02:01) and decreasing the frequencies of weak HLA binders already present in these populations. We note that the HLA region harbours the highest levels of advantageous genetic diversity maintained by balancing selection and/ or recombination events for, potentially, millions of years. [58] [59] [60] [61] [62] Previous studies already suggested that high frequency HLA alleles could have been positively selected in first American populations because they would have conferred some selective advantage. 63, 64 Interestingly, recent HLA sequencing of 50 exomes of a continuous population from North-West America dating from before and after European contact (ancient DNA) identified a strong signal of negative selection at the HLA-DQA1 gene, 65 which shows that potential selective pressures on HLA genes may also be traced by other approaches.

By contrast, strongest HLA-DRB1 binders appear to be more frequent in Africa, Europe, South-West Asia and North-East Asia than in South-East Asia, Oceania, Australia and North and South America ( Figure 2 ). Some of these alleles, for example, DRB1*13:01 and DRB1*13:02, are frequent in all the regions where they are observed, while others are less evenly distributed, for example, DRB1*01:01 in Europe and Asia, DRB1*11:02 in Africa and Europe and DRB1*13:04 in West Africa. 66, 67 These results might indicate that, in addition to HLA-A, promiscuous HLA-DRB1 molecules may have been selected for by playing a protective role to endemic (eg, parasitic) diseases in populations from diverse geographic regions, as proposed for HLA-DRB1*12:02 in China. 68 Selection would have been most likely to occur if such populations were submitted to high pathogen diversity, as has been recently suggested. 20 Finally, sub-Saharan Africans display higher proportions of weakest HLA-DRB1 binders, which might be protective to other diseases (ie, strongest binders for another pathogen) or simply evolve neutrally or under the influence of different selective pressures. This fits with the known versatile evolution of HLA genes that are submitted to different kinds of selection. 21, 69, 70 The evolutionary history of the HLA region is probably particularly complicated in Africa given a potentially higher burden of infectious diseases. Importantly, our study provides a different conclusion to that recently drawn by Nguyen et al., 51 who stated that there is no correlation between HLA allele frequencies in populations and allele ability to bind SARS-CoV peptides. As SARS-CoV viruses appeared extremely recently, 71, 72 it seems clear that natural selection did not have enough time to induce allele frequency changes in populations, as potentially many generations are necessary to substantially change allele frequencies over time, depending on the selection coefficient and the population size. A more reasonable explanation for the associations that we do observe in the present study is that most of the strong HLA binders of coronavirus peptides are also strong binders of other pathogens, and hence are likely to be generalist (or promiscuous) strong binders that probably underwent selection in the past.

Thanks to an extensive analysis of peptide-binding predictions across multiple HLA genes, multiple infectious pathogens and multiple populations worldwide, the present study makes it possible to consider both HLA population variation and HLA evolution in a different light. First, the observed peptide-binding patterns are compatible with current knowledge on HLA protein function and diversity, which differ among the loci. Our results also underline the promiscuous behaviour of HLA proteins (especially HLA-A), which are able to bind peptides of various pathogens, even from distinct families, with high affinities. Finally, the global frequency distribution of HLA alleles coding for the strongest and weakest peptide binders predicted by our analyses indicates potential signatures of selective events occurring throughout humans history, although future studies are needed to confirm this hypothesis. It is important to note, however, that the characterisation of HLA proteins as strongest and weakest binders of pathogenderived peptides, as presented in this study, relies on computer-based binding affinity predictions with no experimental validation nor immunogenicity testing. Our results should thus be taken with care until combined bioinformatic (also needing improved predictive algorithms) and experimental approaches can be performed. 14, 53, 73 Moreover, although some protective or susceptibility markers to infectious diseases may be observed at varying frequencies across populations from different geographic regions of the world, the resistance and vulnerability of individuals to such diseases are multifactorial phenomena that cannot be determined by single genetic markers as they strongly depend on multiple, complex and often unknown biological (in a broad sense), environmental and other factors. This is important to remember in the context of global coronavirus outbreaks where all people may be highly vulnerable. However, this study demonstrates that knowledge on (or at least estimation of) individual epitope binding can be embedded into a population context to provide powerful clues about population and individual susceptibilities to human viral infections, at least as a crucial informed first step towards formulating working hypotheses that can be tested epidemiologically or experimentally.

This work was supported by the Swiss National Foundation for Scientific Research (grants #31003A_144180 and #310030_188820) and the EU-funded COST Action HLA-NET (BM0803) to ASM. RB is supported by the Max Planck Society. EC is supported by the Australian Government Research Training Program Stipend (RTPS). JT is supported by an Australian Research Council (ARC) Discovery Indigenous Project (IN180100017). BL is supported by an ARC Future Fellowship (FT170100448). We also thank David Roessli for his technical help, and we are most grateful to two anonymous reviewers for their useful and constructive comments on a previous version of this manuscript.

The Immune System

The HLA Complex in Biology and Medicine: A Resource Book

The HLA system. First of two parts

IPD-IMGT/HLA Database

The HLA genomic loci map: expression, interaction, diversity and disease

Advantageous diversity maintained by balancing selection in humans

The molecular descent of the major histocompatibility complex

Balancing selection and heterogeneity across the classical human leukocyte antigen loci: a meta-analytic review of 497 population studies

HLA DNA sequence variation among human populations: molecular signatures of demographic and selective events

HLA supertype variation across populations: new insights into the role of natural selection in the evolution of HLA-A and HLA-B polymorphisms

Nine major HLA class I supertypes account for the vast preponderance of HLA-A and -B polymorphism

Classification of human leukocyte antigen (HLA) supertypes

Allele frequencies database

Predicting antigen presentation-what could we learn from a million peptides? Front Immunol

Exploring T & B-cell epitopes and designing multi-epitope subunit vaccine targeting integration step of HIV-1 lifecycle using immunoinformatics approach

Conserved peptide vaccine candidates containing multiple Ebola nucleoprotein epitopes display interactions with diverse HLA molecules

In silico identification of vaccine targets for 2019-nCoV

Preliminary identification of potential vaccine targets for the COVID-19 coronavirus (SARS-CoV-2) based on SARS-CoV immunological studies

Divergent allele advantage at human MHC genes: signatures of past and ongoing selection

Pathogen diversity drives the evolution of generalist MHC-II alleles in human populations

The HLA-B landscape of Africa: signatures of pathogen-driven selection and molecular identification of candidate alleles to malaria protection

HLA-net 2013 collaboration. The HLA-net GENE[RATE] pipeline for effective HLA data analysis and its application to 145 population samples from Europe and neighbouring areas. Tissue Antigens

Improved methods for predicting peptide binding affinity to MHC class II molecules

Summary of Influenza Risk Assessment Tool (IRAT) Results. Pandemic Influenza (Flu). CDC. 2019

UniProt: a worldwide hub of protein knowledge

Peptide binding predictions for HLA DR, DP and DQ molecules

A consensus epitope prediction approach identifies the breadth of murine T(CD8 +)-cell responses to vaccinia virus

NetMHCpan-4.0: improved peptide-MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data

Gapped sequence alignment using artificial neural networks: application to the MHC class I system

Systematically benchmarking peptide-MHC binding predictors: from synthetic to naturally processed epitopes

Mixed-Effects Models in S and S-PLUS

Modern Applied Statistics with S-Plus

R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing

The Austroasiatic Munda population from India and its enigmatic origin: a HLA diversity study

HLA epitopes: the targets of monoclonal and alloantibodies defined

Strategies to work with HLA data in human populations for histocompatibility, clinical transplantation, epidemiology and population genetics: HLA-NET methodological recommendations

HLA class I alleles are associated with peptide-binding repertoires of different size, affinity, and immunogenicity

Fighting viral infections and virus-driven tumors with cytotoxic CD4+ T cells

Generalists and specialists: a new view of how MHC class I molecules fight infectious pathogens

MHC class I molecules and KIRs in human history, health and survival

HLA class I molecular variation and peptide-binding properties suggest a model of joint divergent asymmetric selection

Missing or altered self: human NK cell receptors that recognize HLA-C

Influence of HLA-C expression level on HIV control

Factors affecting HLA expression: a review

Structural and regulatory diversity shape HLA-C protein expression levels

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

A review of HLA allele and SNP associations with highly prevalent infectious diseases in human populations

The length distribution and multiple specificity of naturally presented HLA-I ligands

Novel, cross-restricted, conserved, and immunodominant cytotoxic T lymphocyte epitopes in slow progressors in HIV type 1 infection

HLA-B57-restricted cytotoxic T-lymphocyte activity in a single infected subject toward two optimal epitopes, one of which is entirely contained within the other

Human leukocyte antigen susceptibility map for SARS-CoV-2

Association of HLA class I with severe acute respiratory syndrome coronavirus infection

HLA studies in the context of coronavirus outbreaks

Soft sweeps: molecular population genetics of adaptation from standing genetic variation

The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation

Human population structure and the adaptive response to pathogen-induced selection pressures

Smallpox and the Native American

AIDS in chimpanzees: the role of MHC genes

Pinpointing a selective sweep to the chimpanzee MHC class I region by comparative genomics

Multiple instances of ancient balancing selection shared between humans and chimpanzees

Limited MHC class II gene polymorphism in the West African chimpanzee is distributed maximally by haplotype diversity

Long-term balancing selection in LAD1 maintains a missense trans-species polymorphism in humans, chimpanzees, and bonobos

Tracking human migrations by the analysis of the distribution of HLA alleles, lineages and haplotypes in closed and open populations

HLA diversity, differentiation, and haplotype evolution in Mesoamerican Natives

A time transect of exomes from a Native American population before and after European contact

Deciphering the fine nucleotide diversity of full HLA class I and class II genes in a well-documented population from sub-Saharan Africa. HLA

Mandenka from Senegal: next generation sequencing typings reveal very high frequencies of particular HLA class II alleles and haplotypes. HLA

The adaptive change of HLA-DRB1 allele frequencies caused by natural selection in a Mongolian population that migrated to the south of China

Excess of deleterious mutations around HLA genes reveals evolutionary cost of balancing selection

How pathogens drive genetic diversity: MHC, mechanisms and misunderstandings

Emergence of a novel coronavirus, severe acute respiratory syndrome coronavirus 2: biology and therapeutic options

Viral evolution and the emergence of SARS coronavirus

Robust prediction of HLA class II epitopes by deep motif deconvolution of immunopeptidomes

Binding affinities of 438 HLA proteins to complete proteomes of seven pandemic viruses and distributions of strongest and weakest HLA peptide binders in populations worldwide

The authors have declared no conflicting interests.

The data that support the findings of this study are available in the Supplementary Materials and from the corresponding authors upon request.