key: cord-0985911-i0haoky6 authors: Bose, Tungadri; Pant, Namrata; Pinna, Nishal Kumar; Bhar, Subhrajit; Dutta, Anirban; Mande, Sharmila S. title: Does immune recognition of SARS-CoV2 epitopes vary between different ethnic groups? date: 2021-09-21 journal: Virus Res DOI: 10.1016/j.virusres.2021.198579 sha: 11f83009eefd40c791d568caa4202fd229796f6f doc_id: 985911 cord_uid: i0haoky6 The SARS-CoV2 mediated Covid-19 pandemic has impacted humankind at an unprecedented scale. While substantial research efforts have focused towards understanding the mechanisms of viral infection and developing vaccines/ therapeutics, factors affecting the susceptibility to SARS-CoV2 infection and manifestation of Covid-19 remain less explored. Given that the Human Leukocyte Antigen (HLA) system is known to vary among ethnic populations, it is likely to affect the recognition of the virus, and in turn, the susceptibility to Covid-19. To understand this, we used bioinformatic tools to probe all SARS-CoV2 peptides which could elicit T-cell response in humans. We also tried to answer the intriguing question of whether these potential epitopes were equally immunogenic across ethnicities, by studying the distribution of HLA alleles among different populations and their share of cognate epitopes. Results indicate that the immune recognition potential of SARS-CoV2 epitopes tend to vary between different ethnic groups. While the South Asians are likely to recognize higher number of CD8-specific epitopes, Europeans are likely to identify higher number of CD4-specific epitopes. We also hypothesize and provide clues that the newer mutations in SARS-CoV2 are unlikely to alter the T-cell mediated immunogenic responses among the studied ethnic populations. The work presented herein is expected to bolster our understanding of the pandemic, by providing insights into differential immunological response of ethnic populations to the virus as well as by gauging the possible effects of mutations in SARS-CoV2 on efficacy of potential epitope-based vaccines through evaluating ∼40000 viral genomes. The ongoing outbreak of severe acute respiratory syndrome (SARS), commonly known as Covid-19, has already infected over 221 million and has led to the death of 4.57 million people worldwide ("WHO Coronavirus Disease (COVID-19) Dashboard," n.d.). SARS-CoV2, a newly identified lineage B Betacoronavirus has been held responsible for this global pandemic (Letko et al., 2020) . While the current pandemic is unprecedented in scale as compared to the earlier coronavirus outbreaks (Pathan et al., 2020) , they are quite alike considering the ongoing quest for an effective preventive/ treatment regime to counter the disease (Ayukekbong et al., 2020) . A significant amount of research has been directed at understanding various facets to the pathogenesis. These include the viral attachment and entry (Harvey et al., 2021; Hoffmann et al., 2020; Letko et al., 2020) , formation of the coronavirus replication/ transcription complex (Y. Krichel et al., 2020; Slanina et al., 2021) , replication, transcription and translation of viral proteins (Hillen et al., 2020; Satarker and Nampoothiri, 2020; Wang et al., 2020) , virion assembly and release (Kumar et al., 2020; Li et al., 2020) and the commonalities and differences of this disease with seasonal flu and previous SARS infections (Song et al., 2020; Xu et al., 2020) . In spite of these efforts, factors affecting susceptibility to infection of SARS-CoV2 and manifestation of Covid-19 are yet to be properly understood and demands more attention. It may be noted that SARS-CoV2"s rate of genomic mutation is similar to most RNA viruses, (De Maio et al., 2021; Mercatelli and Giorgi, 2020; Pathan et al., 2020) , and recent evidence seem to suggest that the mortality rates among Covid-19 patients may be associated with the mutation profile of the infecting virus (Toyoshima et al., 2020 ; "WHO | Variant analysis of SARS-CoV-2 genomes," n.d.). It has been perceived to be particularly higher for certain SARS-CoV2 variants which have been designated as Variants of Concern (VOC) (Challen et al., 2021; Grint et al., 2021) . While a pre-existing medical condition is likely to increase the risk of severity of Covid-19 infection (CDC, 2020 ), yet another factor influencing the susceptibility to the disease could be the genetic makeup of an individual. The Human Leukocyte Antigens (HLA) system, a major determinant of our ability to detect and neutralize an invading pathogen, is encoded by the Major Histocompatibility Complex (MHC) genes located on chromosome-6. This system has been shown to play a crucial role in the manifestation and outcome of infection (Amoroso et al., 2021; Correale et al., 2020; Langton et al., 2021; Malkova et al., 2021; Pisanti et al., 2020; Shkurnikov et al., 2021) . MHCs, which are further categorized into classes-I and II, are highly polymorphic and are known to vary significantly among individuals of different ethnicities. The outcome of an infection event is therefore dependent on both the genotype of the virus as well as the host cell surface (MHC) molecules destined to present viral antigenic peptides to the human T-cell receptor (TCR) of T-lymphocytes (also called killer Tcells or CD8-positive cytotoxic T-cells) and T-helper cells (also called CD4-positive T cells) (Murray and McMichael, 1992; van Montfoort et al., 2014) . In this work, we have investigated how the genetic variations across ethnicities are likely to influence the ability of their immune system in timely recognition of the virus, and in turn, their susceptibility to . To this end, state-of-the art bioinformatic tools were used to (a) identify the probable antigenic peptides on the SARS-CoV2 proteomes, and (b) identify HLA alleles which could recognize and present these epitopes to the T-cells, along with the prevalence of these alleles in different ethnic groups. In addition, 40,342 fully sequenced SARS-CoV2 genomes (which were isolated from patients across the globe) were analyzed to probe the possible effect of viral genetic variations on antigenic recognition. Whether the variations in the viral genome over time are likely to change the susceptibility of an ethnic group to infection was evaluated. In this context it may be noted that, given the computational challenges of identifying B-cell epitopes with adequate confidence, the current study did not focus on this aspect of human immunity. Results presented in this work also assumes additional importance when viewed in the context of a recent publication which highlighted the prospect and benefits of considering non-spike proteins for future Covid-19 vaccine designs (Peng et al., 2020) . One of the key mechanisms of identification of an invading pathogen by the host immune system involves MHC class-I and MHC class-II proteins presenting the pathogenic protein fragments (epitopes) on the surface of the CD8-positive cytotoxic T-cells and CD4-positive T-helper cells respectively. Given this, efforts were first made to identify SARS-CoV2 epitopes that could be recognized by the HLA allelic variants. A total of 505 epitopic regions from all the proteins encoded by the SARS-CoV2 reference genome (GenBank Accession no. MN908947) could be identified (see Materials and Methods) . Of these, 487 epitopic regions which qualified our criteria for further analysis were predicted to bind to 180 HLA allelic variants (Supplementary Table 1 ) with reasonable confidence (see Materials and Methods) . This comprised of 391 CD8 (MHC class-I) and 96 CD4 (MHC class-II) epitopes. From the list of predicted epitopes and the corresponding HLA alleles, it was observed that some of the HLA alleles could recognize higher number of SARS-CoV2 epitopes and thus might play a more significant role in immune response. Of the 155 MHC class-I associated HLA alleles, that were predicted to be involved in the antigenic recognition of SARS-CoV2 proteins in (CD8-positive) cytotoxic T-lymphocyte cells, the highest number of epitopes were identified by HLA-A*02:11, HLA-B*15:17, HLA-A*24:03, HLA-A*26:02 and HLA-A*68:01 (50, 41, 33, 33 and 26 epitopes respectively). In contrast, among the MHC class-II associated 25 HLA alleles involved in the antigen recognition of SARS-CoV2 proteins in (CD4-positive) T-helper cells, HLA-DRB1*01:01, HLA-DRB1*15:01, HLA-DRB1*15:06 and HLA.DRB1*01:02 were predicted to present the highest number of epitopes (28, 22, 22 and 11 epitopes respectively). It was also noted that few of the epitopes could be recognized by multiple HLA alleles (Supplementary Table 1 ). It therefore appeared that the potential of a population/ ethnic group to cope with Covid-19 infection could be determined by their MHC gene pool. The diversity in the allelic makeup of 82 different ethnic groups constituting seven superpopulations were studied using data available from the Allele Frequency Net Database (AFND) and the 1000 genomes project (TGP) (see Materials and Methods and Supplementary Table 2 ). Only those HLA alleles which occurred with a frequency ≥ 0.01 in at least one of the ethnicities and were predicted to recognize one or more SARS-CoV2 epitopes were considered for the presented analyses. In terms of MHC class-I associated HLA alleles (henceforth termed as MHC-I alleles), Middle East and Africans (MEA) showed the highest richness (Fig. 1) . Amerindians (AMR) and Oceanians (OCN) had the least MHC-I allelic diversity. In contrast, while Europeans (EUR) and Africans (AFR) demonstrated the highest MHC class-II diversities, South Asians (SAS) and OCN were noted to be least diverse (Fig. 1) . The diversity of both MHC-I and MHC-II alleles were found to vary by a large extent among ethnicities comprising the AMR super-population. In general, the HLA allele richness among super-populations was seen to be more diverse in case of MHC-I as compared to MHC-II (Supplementary Table 3 and Methods) . In addition to this, certain MHC-I alleles specific to each of the super-populations (except SAS and OCN) were also found. MEA lacked 13 MHC-I and 2 MHC-II alleles that were present among all other super-populations. As expected, intra-population variations existed and the occurrence of each of the (SARS-CoV2 associated) MHC-I and MHC-II alleles were not found to be uniform among the samples constituting the ethnic groups (Supplementary Table 3 ). To account for this, frequency of occurrence for every MHC allele in each of the ethnic groups was computed (see Materials and Methods). Results obtained were used to construct heat maps (Supplementary Figs. 1 and 2) for gauging the distribution of the potent MHC alleles which could play a role in immune response against SARS-CoV2. While some of the alleles were seen to be more frequent across ethnicities, it was interesting to note that the alleles with the highest MHC-II alleles (like HLA-DRB1*01:01 and HLA-DRB1*15:06) with high antigen recognition capability were seen to be less common across ethnicities and in most cases were also sparsely distributed even among samples from same ethnicity. In summary, noticeable variability in the potential of the HLA alleles to recognize and present the SARS-CoV2 epitopes to the immune cells was observed. We subsequently investigated if this could have a bearing on the level of antigenic recognition among different ethnic groups. The overall trend observed with respect to diversity of the predicted CD8 and CD4 epitopes were comparable to those of their corresponding alleles (MHC-I and MHC-II alleles respectively) across ethnicities (Fig. 1) . However, certain subtle differences, specifically with respect to their relative richness, were observed. For example, the change in relative richness (number of observed alleles or epitopes for an ethnicity) among South Asian (SAS) to East Asian (EAS) super-populations in the plots associated with MHC-I alleles and the corresponding CD8 epitopes was apparent. Similarly, a change in richness of MHC-II alleles and CD4 epitopes in case of Oceanian (OCN) to EAS super-populations was also visible. The richness of the antigenic recognition potential among Amerindians (AMR) and OCN ethnicities was seen to be quite diverse, particularly for CD8 epitopes (Supplementary Table 3 Table 4 ), 324 CD8 and 96 CD4 epitopes were found to be common across seven super-populations. While CD4 epitopes were seen to be equally recognized among all super-populations, the CD8 epitopes were found to be differentially recognized. In line with the MHC-I alleles, EAS, EUR and OCN were found to have the highest overlap in terms of recognizing CD8 epitopes. The trends in recognition of epitopes by different alleles present across various ethnicities provided further insights (Supplementary Figs. 4 and 5) . The antigenic peptide FLLPSLATV (Epitope_1 from nsp6) appeared to be the most recognized CD8-specific epitope across ethnicities. Notably, this epitope could be recognized by 15 HLA allelic variants, the highest among all the CD8-specific SARS-CoV2 antigens recognized in this study (Supplementary Figs. 4 and Supplementary Table 1 ). However, for both the CD8 and CD4 epitopes, there were no observable correlations between the epitopes which were recognized by higher number of HLA variants and those which were most common across ethnicities. For example, while the CD4-dependent antigenic peptides ESPFVMMSAPPAQYE and TQEFRYMNSQGLLPP (Epitope_94 and Epitope_123 respectively) were recognized by five HLA variants each, the maximum among the CD4-specific SARS-CoV2 antigens recognized in this study, (Supplementary Figs. 5 and Supplementary Table 1 ), Epitope_123 was not as common across different ethnicities as Epitope_94. The above observations indicate that there can be certain discernable differences at an overall population level, with respect to the CD4 and CD8 cell mediated immune response against SARS-CoV2. The frequency of occurrence of the 342 HLA alleles (of the classes HLA-A, HLA-B, HLA-C, HLA-DRB1 and HLA-DQB1) constituting AFND and TGP were computed for each of the 82 ethnic groups (Supplementary Table 5 ). Based on this, the overall epitope recognition potential of the ethnicities and super-populations was computed (see Materials and Methods). Supplementary Figs. 6 and 7 depict the average count of epitopes recognized by individuals representing super-population and ethnic groups respectively. East Asians (EAS), Africans (AFR) and Oceanians (OCN) reported low potentials to recognize both CD8 and CD4 epitopes. In contrast, ethnicities comprising the European (EUR) and South Asian (SAS) superpopulations, showed high potentials to recognize all forms of SARS-CoV2 epitopes. The Peruvians from Lima (PEL) exhibited an interesting pattern with extremely low CD4, but very high CD8 epitope recognition potential. Based on the above observations, we further probed if there were any statistically significant differences between the epitope recognition potential among the various ethnicities and super-populations (see Materials and Methods). The p-value for the one-way ANOVA test among all the super-populations was seen to be less than 2e-16 at 95% confidence interval, thereby implying significant difference in the epitope recognition potential of at least one of the seven super-populations. Indeed, the results obtained from t-test (see Materials and Methods) indicated substantial differences in the recognition potentials of both CD8 and CD4 epitopes among super-populations as well as ethnicities except for AMR and MEA super-population (Supplementary Table 6 ). Further, to check for any major differences in the epitope recognition potentials between individuals of different ethnicities, a Principal Component Analysis (PCA) was performed (see Materials and Methods). The AFR ethnicities were found to be the most distinct among all super-populations (Fig 3) . Further, ethnicities from SAS and EAS formed more compact clusters when compared to ethnicities from AMR and EUR, with ethnicities from AMR being the most dispersed. The epitope recognition potentials of the Mexican and the USA appeared to be distinct from the rest of the Amerindians. Aborigine Australians (AUS) and the Malay from Singapore (SGM) were found to cluster more closely with ethnicities of other super-populations. AUS and SGM clustered with ethnicities of AMR and EAS super-population respectively. While the prevalence of specific HLA alleles in the gene pool of a population or an ethnic group provides an overall idea about how well recognized the SARS-CoV2 epitopes would be in that population, the immune response of each of the individuals would be independently governed by their own allelic make-up. The above results provide an average estimate of (CD4 and CD8 cell mediated) immune sensitivity to SARS-CoV2 for individuals belonging to population and highlights inter-ethnic differences therein. In addition to the HLA makeup, yet another factor which could determine the fate of the infection process is the genetic variation in the SARS-CoV2 genomes. Genetic variation could lead to epitope modifications, which in turn, might alter the binding affinity/ recognition of the viral epitopes by MHC-I/II alleles. To accommodate for this factor, the epitope signature of 40,342 SARS-CoV2 genomes, were generated (see Materials and Methods) based on the 487 epitopes predicted for the reference genome and their 4194 variants, henceforth referred to as variants of "reference" epitopes (VREs) ( Table 1 and Supplementary Table 7) . Apart from the original 487 epitopes, only 25 of the 4194 VREs were seen to be present in more than 0.5% of the studied SARS-CoV2 genomes (more frequent VREs or MFVREs). Moreover, some of the variants were found to exclusively co-occur among a sub-set of strains isolated from certain geographies, most prominently among those isolated from India or United Kingdom. Even more significantly, a particular variant (SEVGPEHSL at position 376 on ORF1a) was noted to be exclusively absent among the SARS-CoV2 genome sequences isolated and sequenced in Iran (Supplementary Table 7) . It was further observed that while the MFVREs (those observed in more than 0.5% of the studied SARS-CoV2 genomes) exhibited a mixed pattern of altered immunological behaviors, the human MHC-I/II alleles have a lower binding affinity/ recognition potential for a large proportion (2072 out of 4169) of the lesser frequent VREs (LFVREs) ( Table 1) . While it would be interesting to understand the effect of these variations in the viral genome on the antigenic recognition potentials among different ethnicities no major effects were expected, given that only 25 of the variants (MFVREs) were present in at least 0.5% of the studied SARS-CoV2 genomes. Even in the hypothetical scenario, wherein every population/ ethnicity is simultaneously exposed to all the 40,342 viral variants, the individual antigenic recognition potential would be driven by the originally identified 487 epitopes, and not by the LFVREs. Supplementary Table 8 lists the HLA alleles which could recognize the 25 MFVREs or their corresponding reference epitopes. The most frequently observed MFVREs were STVFPLTSF and TVFPLTSFGPLVR (variants of Epitope_461 and Epitope_479 respectively), both from the NSP12 coding region of the SARS-CoV2 genome and are present in over 73% of the studied genomes. In case of STVFPLTSF, the allele HLA-B*15:17 was found to recognize both the reference as well as the variant epitope. In addition to this, HLA-A*26:02 could recognize the variant epitope, but not the reference epitope. HLA-A*26:02 has the maximum observed frequency among the Japanese in Tokyo (EAS_JPT) (allele frequency 2.4%), but at the same time not observed in 74 of the 82 sub-populations studied (Supplementary Table 8 Table 8 ). So, one may expect that the variant epitope escaping recognition by HLA-A*68:01 could be a contributor in altered immune response. However, HLA-A*68:01 has earlier been reported as one of the strongest binders of most viruses (Barquera et al., 2020) and it is likely that the loss of recognition of a single potential epitope may not influence its overall binding affinity to the virus. Considering the continuously evolving genome of SARS-CoV2, which entails selection/ retention of specific epitope variants over time in the newly emerging genomes, an analysis was performed to check if there were any temporal changes in the susceptibility to Covid-19 in any of the ethnicities (see Materials and Methods). Even in this case (Fig. 4) , it was noted that temporal variations in the SARS-CoV2 genome, while resulting in certain changes in its epitope signature, did not appreciably alter the epitope recognition ability among ethnic groups, at least at a population level (also see Supplementary Figs. 8). Since ethnic groups (and people from same geography) are known to predominantly encode for certain HLA gene variations, it may be perceived that the SARS-CoV2 epitope recognition potential of an ethnic group is not likely to change owing to the temporal variations in the viral genome. Further, it will be appreciated that although the said changes in the viral genomes might alter the biological functioning of the viral genes, yet they remain largely inconsequential with respect to immune determination in the host. There have been three major coronavirus associated SARS outbreaks in the last 20 years. While there have been extensive research regarding the pathophysiology of these viruses (Y. Harvey et al., 2021; Hillen et al., 2020; Hoffmann et al., 2020; Krichel et al., 2020; Kumar et al., 2020; Letko et al., 2020; Li et al., 2020; Satarker and Nampoothiri, 2020; Slanina et al., 2021; Song et al., 2020; Wang et al., 2020; Xu et al., 2020) , as yet we have limited knowledge into the factors affecting susceptibility to SARS-CoV2 infection and manifestation of Covid-19. Some scientists have opined that the mortality rate in Covid-19 could be linked to the genomic profile of the infecting virus (Challen et al., 2021; Grint et al., 2021; Toyoshima et al., 2020 ; "WHO | Variant analysis of SARS-CoV-2 genomes," n.d.) which seem to mutate at a rate similar to most RNA viruses (De Maio et al., 2021; Mercatelli and Giorgi, 2020; Pathan et al., 2020) . Further, an underlying/ pre-existing medical condition might be considered to be an added risk towards increasing the severity/ complications associated with Covid-19 infection (CDC, 2020) . In addition, the host genetic makeup could play a decisive role in disease manifestation. (Reynisson et al., 2020) . It may be noted that most of other available tools and web-services for predicting MHC-I and MHC-II epitopes (Jespersen et al., 2017; Paul et al., 2016; Saha and Raghava, 2004; Zhang et al., 2009) were not equipped to handle the entire HLA allelic set that had been reported in the AFND and TGP. Further, most of the other tools also did not have a provision to identify variable length MHC-I epitopes. Consequently, some of the earlier studies associated with in silico identification of SARS-CoV2 epitopes (Abdelmageed et al., 2020; Dong et al., 2020; Lin et al., 2020; Naz et al., 2020; TOPUZOĞULLARI et al., 2020) are likely to have suffered from the mentioned limitations associated with these tools (Jespersen et al., 2017; Paul et al., 2016; Saha and Raghava, 2004; Zhang et al., 2009) . In other words, these studies might not have arrived at a list of epitopes as comprehensive as the one mentioned in the current study. Further, studies following a combinatorial approach (H.-Z. Zaheer et al., 2020) are also expected to miss some of the epitopes reported in this study. It may however be noted that like most previous studies, the current study refrained from predicting potential B-cell epitopes, due to the computational complexities involved in the process and may be considered as a limitation of this exercise. Most B-cell epitopes are conformational in nature and comprise of discontinuous amino acid stretches which are difficult to be identified (with adequate confidence) from genomic information alone (Wang et al., 2011) . Table 3 ) comprising these super-populations. While the super-populations were largely seen to cluster in terms of the MHC-I genes encoded in the genome, such patterns could not be observed for MHC-II genes ( Supplementary Figs. 1-2 ). In addition, certain HLA alleles were found to be more efficient in recognizing SARS-CoV2 epitopes as compared to others. Given the skewed distribution of these HLA alleles among a few of the ethnic groups, contrasting characteristics with respect to the ratio of the number of epitopes to the number of HLA alleles involved in their recognition, among each of the ethnic groups were observed, even among ethnicities from the same super-population ( Supplementary Fig. 3 ). The role played by HLA genes (differentially occurring among ethnicities/ super-populations) in the SARS-CoV2 immune response is under-explored with respect to other aspects affecting the manifestation and outcome of the Covid-19 infection. Some of the available reports are mentioned herein with respect to our presented results. Our results concur with the report on the susceptibility of genotypes to predisposition of coronavirus infection (Malkova et al., 2021) . The genotypes with lower susceptibility (viz., HLA-A*02:02, B*15:03, C*12:0) were seen to recognize higher number of epitopes than the high susceptibility genotypes (viz., HLA-A*25:01, B*46:01, C*01:02). The HLA-C*04:01 allele has been reported to be associated with an increased risk of intubation and severe clinical symptoms of Covid-19 ("Increased risk of severe clinical course of COVID-19 in carriers of HLA-C*04:01 -EClinicalMedicine," n.d.). We found that this could recognize only two SARS-CoV2 epitopes, thereby hinting at a potential vulnerability to Covid-19 infection among individuals harboring this gene. HLA-C*04:01 was found to be most frequent among African (AFR), followed by Amerindian (AMR) and Oceanian (OCN) super-populations. We were intrigued by the contradicting literature evidence on the relationship of HLA-A*02:01 to the risk of infection (Migliorini et al., 2021; Shkurnikov et al., 2021) . Given its widespread occurrence (one of the most frequent HLA genes in most of the super-populations), we believe additional research is required for ascertaining its relation to the disease outcome. In a study conducted in Italy, HLA-C*06:02 was shown to be better correlated with the patient group, indicating at its association with disease severity (Novelli et al., 2020) . In this regard, the higher abundance of HLA-C*06:02 among the Middle East and African (MEA) and South Asian (SAS) super-populations assume importance. Among the MHC-II alleles HLA-DRB1*03:01 was observed to be the second most frequent allele in AMR and MEA. This allele has been earlier reported to increase the risk of infection (Shen et al., 2021) . HLA-DRB1*01:01 and HLA-DRB1*15:01 was observed to be able to recognize highest number of CD4 epitopes in all seven super-populations. Noticeably, HLA-DRB1*01:01 has been shown to be negatively associated with mortality rate of hospitalized patients (Romero-López et al., 2021) and was found to be most abundant among European (EUR) super-population. Yet another study reported a significantly lower frequency of the haplotype DQA1*01:01-DQB1*05:01-DRB1*01:01 among the asymptomatic group when compared to patients with high severity (Langton et al., 2021) . In contrast to our observations, this would indicate at a diminished capability to identify and counter SARS-CoV2 invasion among individuals harboring HLA-DRB1*01:01. Given that the haplotype information was not available in AFND, the above could not be pursued further. For the same reason, we could not compare our results with some of the other literatures which had used HLA haplotype data for drawing inference (Khor et al., 2021; Pisanti et al., 2020) . In this study, an attempt was further made to gauge the variations in the SARS-CoV2 epitope regions resulting from temporal changes in the viral genome and its effect on the epitope recognition potential across the ethnicities worldwide. The SARS-CoV2 genomic variation data which was obtained for different geographies over a six-month period indicated that although the SARS-CoV2 genomic variations altered the overall predicted SARS-COV2 epitope signature profile (Table 1 and Supplementary Table 7 Table 9 ). For instance, in case of the Alpha (B.1.1.7) variant, out of three epitope variants, two showed a decrease in the number of alleles that could recognize the epitopes, whereas one of the variants showed an increase in the number of alleles that could recognize it. Four out of five variants in Beta (B.1.351) and three out of four variants in Gamma (P.1) were also observed to be recognized by lesser number of alleles. As for Delta (B.1.617.2) strain, none of the four variants of the reference epitope could be recognized by the HLA alleles. It may appear that reduction in the number of HLA allele capable of identifying the mentioned epitope variants in the VOCs would aid the virus to evade the host immune system. However, it would be difficult to comment on the aspect given that the set of mutations defining a VOC are only affecting a handful of predicted epitopes (a total of 15 for the four VOCs taken together) with respect to the total set of 487 predicted epitopes in the reference genome. Elaborate genetic and epidemiological studies would be required to derive any conclusive evidence on the above aspects. Nonetheless, our study provides initial clues and intriguing insights relevant to hostvirus interactions. Moreover, a minimal set of SARS-CoV2 epitopes was identified, which can be recognized by the HLA repertoire of individuals from all ethnicities ( (Peng et al., 2020) . Moreover, the outcome of this study can also aid in disease prognosis and designing of personalized therapy regimes for Covid-19 patients. The idea of providing personalized therapy to Covid-19 patients, especially those requiring critical care is already under clinical consideration (Cacciapuoti et al., 2020; Fang and Schooley, 2020; Garcia-Vidal et al., 2020) . In addition, results presented herein could also prove beneficial in understanding/ countering a future flu/ pneumonia outbreak involving a similar lineage B Betacoronavirus. Given that the world has already witnessed two such major outbreaks in less than 20 years, it would be prudent to prepare blueprints of a more potent coronavirus vaccine, particularly against the lineage B Betacoronavirus, before the next outbreak. The context and results of the presented work is somewhat aligned to another intriguing aspect of the prevailing pandemic with respect to disease severity and case fatality rate (CFR). Appreciable spatial and temporal variations in CFR have been observed across geographies and suitable explanation(s) for these variations have eluded researchers, given the wide array of possible confounding factors. A key observation in this regard pertains to the economic status of a country. Death rates due to Covid-19 has been observed to be positively associated with GDP (per-capita) in multiple studies Sorci et al., 2020) . Higher death rates despite (expected) better access to healthcare in high-income populations seems outright counter intuitive. While some scientists have tried to explain these observations citing higher lifeexpectancy and consequently an older population in high-income countries, who would be more susceptible to Covid-19, others have hinted at the possibility of the "so called hygienehypothesis" at play (Bloomfield et al., 2016; Chatterjee et al., 2021) . The results of the current study unravel yet another interesting aspect in this context pertaining to the genetic diversity. Severe cases of Covid-19 disease have been seen to be characterized by higher levels of inflammatory cytokines and CD8+ T cell exhaustion . Reports have also indicated that in case of milder Covid-19 infections, not leading to death and other complications, higher proportions of SARS-CoV2 specific CD8+ T cells have been identified (Peng et al., 2020) . Our results indicate that the heavily affected European population (EUR) tends to harbor a larger fraction of MHC-II alleles in their gene-pool, which are specific to SARS-CoV2 epitopes. On the other hand, the South Asian (SAS) population, having a relatively lower fatality rate, exhibited a relatively larger proportion of MHC-I alleles specific to the SARS-CoV2 epitopes. It is worth further investigation whether a larger pool of MHC-I alleles presenting SARS-CoV2 antigens to CD8+ T cells in the SAS population indeed can be linked to the apparently lower fatality rate in this region. A recent study while reporting an inverse association between MHC-I epitopes and mortality rates also indicated at this possibility (Wilson et al., 2021) . Moreover, any possible relationship of the larger pool of the MHC-II alleles in the EUR population with CD4+ T Cell activation, cytokine production leading to a disproportionate immune response (or a so called "cytokine-storm") will also be intriguing to explore. While the current work is limited by the number of representative individuals genotyped for each population in AFND and TGP, the results nonetheless provide an overview of possible trends in protective immunity and T-cell responses against SARS-CoV2 across different geographies/ ethnicities. Overall, to our knowledge, this is the first ever account to capture the effects of the evolution of SARS-CoV2 genome on its potential interactions with the human HLA gene products in a global perspective. This assumes immense importance in the context of vaccine development and its efficacy against the newer lineages of SARS-CoV2. It highlights the need of understanding the crosstalk of the pathogen with the components of HLA system could have far reaching consequences in our coexistence with SARS-CoV2 and other RNA viruses which we may encounter in future. It is pertinent to note that the nature of our findings, obtained through a bioinformatic/computational exercise, were dependent on (a) the availability of data in public repositories and (b) the efficiency of the available bioinformatic tools. As a result, insights into some of the aspects of immune response, such as, if there are any B-cell specific conformational epitopes in SARS-CoV2, expression levels of the MHC alleles in response to Covid-19 infection and its variations across ethnicities, age groups, co-morbidities, etc., if any, could not be obtained. Furthermore, while it would have been ideal to validate the derived conclusions in an experimental setting, such an exercise lies beyond the scope and reaches of the current endeavor given the scale of the project, which involved analyses of over 40,000 viral genomes and the HLA gene over data of nearly 150,000individuals across 82 ethnicities spread across the globe. Nonetheless, the perspectives presented are interesting and quite pertinent with respect to the ongoing pandemic and deserve to be shared with the larger scientific community. We hope that a larger collaboration/ consortium may be forged in the days to come towards validating the insights derived in silico. Full length sequences of SARS-CoV2 proteins (reference sequences) were obtained from NCBI (GenBank Accession no. MN908947). Further, individual non-structural protein sequences were also obtained from NCBI (NCBI Accession no. YP_009725300.1 to YP_009725312.1). These protein sequences were used for predicting antigenic peptides (epitopes) on the viral proteome. In addition, high quality fully sequenced genome sequences of SARS-CoV2 were obtained from GISAID (https://www.gisaid.org/) (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) . Genomic data and corresponding metadata for 40,342 genomes (which were deposited to GISAID till 11 th June 2020) have been used in this analysis (see Supplementary File 2 for details). Table 5 ) belonging to 82 ethnic groups that were arranged into seven super-populations (Supplementary Table 2 ). NetMHCpan 4.1a (http://www.cbs.dtu.dk/services/NetMHCpan/index_4.1a.php) and NetMHCIIpan 4.0 (http://www.cbs.dtu.dk/services/NetMHCIIpan/) were used with default parameters for identifying strong and weak binders (i.e., 0.5 and 2 for NetMHCpan and 1 and 5 for NetMHCIIpan) (Reynisson et al., 2020) Table 1 ). In addition, the epitopes corresponding to the less abundant MHC-I and MHC-II alleles in the TGP were also predicted using the above protocol. This data was used for a subset of analysis (see Fig. 4 ). All the analysis, computations and visualizations were performed using R version 4.0. The allele frequency distributions across ethnicities (Supplementary Supplementary Table 3 . In addition, the data represented in the heat maps were also used to hierarchically cluster the ethnic groups based on the HLA alleles they encode and the epitopes they could identify ( Supplementary Figures 1-2 , 4-5) . The hclust function from stats package was used for this purpose. The average epitope count (Supplementary Fig. 6 and 7) for each ethnicity was calculated as the summation of the epitope recognition potentials of all the epitopes identified by the HLA repertoire of that ethnicity. In order to probe for any statistical differences between the average number of epitopes recognized by the individuals among the various super-populations, one way ANOVA test was conducted after grouping the ethnicities into super-populations. Additionally, t-test was also performed to test whether the average number of epitopes recognized in a super-population (and/or ethnicity) differed with respect to that of all other super-populations (and/or ethnicities) combined (Supplementary Table 6 ). The above tests were conducted both for the CD8 and CD4 epitopes individually as well as in combination. The p-values were corrected for multiple testing using Benjamini-Hochberg (BH) correction method. Further, a Principal Component Analysis (PCA) was performed to check the coherence of the antigen recognition potentials (in terms of the distribution of relevant HLA alleles) among the ethnicities (Fig. 3) . The dudi.pca function from ade4 package was used for the purpose. Nextstrain/augur pipeline (accessed on 21 st April, 2020 from https://github.com/nextstrain/ncov) was used to align 40,342 SARS-CoV2 genome sequences downloaded from GISAID using Wuhan-Hu-1/2019, as a reference sequence using MAFFT (Katoh and Standley, 2013) . Conforming to the Nextstrain protocol, 130 nucleotides from 5"-end and 50 nucleotides from the 3"-end as well as single nucleotide positions 18529, 29849, 29851, 29853 were masked from multiple sequence alignment (MSA). An initial maximum likelihood (raw) phylogenetic tree (GTR model) was built using IQ-TREE (Nguyen et al., 2015) and further refined using default parameters in the pipeline. The tree was then processed to construct a TimeTree having the ancestor of the following two genomes, Wuhan-Hu-1/2019 and Wuhan/WH01/2019, at its root. Finally, using augur"s translation step, a translated MSA of all 14 SARS-CoV2 proteins were retrieved. The 487 epitopic regions (w.r.t. reference genome Wuhan/Hu-1/2019) were cropped out from the translated MSA, and amino acid variations occurring in each epitope across 40,342 SARS-CoV2 isolates were obtained. A total of 4194 variants of "reference" epitopes (VREs) could be identified from the 487 reference epitopes (see Table 1 ). Further, 4328 VREs for the 505 epitopes which could be recognized by the alleles encoded by any of the individuals (Samples) belonging to TGP were also obtained and used for the analysis depicted in Fig. 4 . To evaluate any possible effect of mutation occurring/accumulating in the SARS-CoV2 genome over-time on immune recognition of the virus among different ethnicities, the mean epitope recognition potential for each individual belonging to an ethnic group was calculated with epitopes predicted from the first 10,000 and last 10,000 SARS-CoV2 genomes arranged in the chronological order of their collection dates (Fig. 4) . For the purpose, the epitope recognition potential of each individual was computed as the sum of the number of CD8 and CD4 epitopes an individual can recognize using their HLA repertoire. Further, we also tried to estimate the variations of epitope recognition potential within each ethnic group for all the 40,342 variants of SARS-CoV2 genomes (Supplementary Figure 8) , where the mean epitope recognition potential of the ethnicities for each genome was calculated and arranged in the chronological order of their collection dates (in months). In this case, the epitope recognition potential of an ethnicity for a SARS-CoV2 variant was computed as was mentioned earlier (Supplementary Fig. 6 ). TB, AD and SSM conceptualized the work and designed the protocol. NP, NKP and SB performed all the analyses. TB, NP, and AD wrote the manuscript. SSM supervised the project and reviewed the manuscript. All authors read and approved the final version of the manuscript. TB, AD and SSM conceptualized the work and designed the protocol. NP, NKP and SB performed all the analyses. TB, NP, and AD wrote the manuscript. SSM supervised the project and reviewed the manuscript. All authors read and approved the final version of the manuscript. The presented work is based on SARS-CoV2 genome sequence data retrieved from the GISAID repository (www.gisaid.org). We also gratefully acknowledge the originating laboratories responsible for obtaining the specimens and the submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative (Supplementary File 2). All submitters of data may be contacted directly via GISAID. Only those HLA alleles which could recognize the predicted SARS-CoV2 epitopes were considered. The variance explained by each of the first two principal components (PC1 and PC2) are indicated in brackets beside respective axis labels. Factor loadings of key HLA alleles explaining maximum variance along the first two principal components are also tabulated. Representation of the average of predicted SARS-CoV2 epitope recognition potential of individuals from different populations (ethnic groups), considering different SARS-CoV2 genome variants observed across geographies. For each population (ethnic-group), there are two box-plots indicating their average "weighted" epitope recognition potential (see Materials and Methods). For each population, the color-filled box on the left corresponds to the epitopes predicted from the first 10,000 SARS-CoV2 genomes (out of a total 40,342 genomes analyzed in the study) when arranged in the chronological order of their collection dates. The corresponding unfilled boxes on the right corresponds to the epitopes predicted from the last 10,000 SARS-CoV2 genomes based on their collection dates. Although the SARS-CoV2 genomes may be evolving over time, no observable variation in the way its epitope repertoire is recognized by individuals at an overall population level could be noted. Each data-point on the plot represents the average of the number of epitopes in a given SARS-CoV2 genome that could be identified by all the individuals in an ethnic population. Dark black lines, for each of the ethnic groups, indicate the mean value for these data-points for the studied 40,342 SARS-CoV2 genomes. HLA data from 1000 genomes project (TGP) database, which provided individual specific allelic information, was used in this analysis. A total of 505 predicted epitopes from the reference SARS-CoV2 genome along with 2716 epitope variants identified across 40,342 genomes recognized by 292 HLA alleles were used to calculate the individual epitope recognition potential. Super-population names are abbreviated as AFR-Africans; AMR-Amerindians; EAS-East Asians; EUR-Europeans; SAS-South Asians, and prefixed to each of the population names in the plots, as well as represented with boxes drawn in specific colors. The abbreviations used for names of different populations/ ethnic groups are provided in Supplementary Table 2. Design of a Multiepitope-Based Peptide Vaccine against the E Protein of Human COVID-19: An Immunoinformatics Approach Immune diversity sheds light on missing variation in worldwide genetic diversity panels Italian Network of Regional Transplant Coordinating Centers, 2021. HLA and AB0 Polymorphisms May Influence SARS-CoV-2 Infection and COVID-19 Severity COVID-19 compared to other epidemic coronavirus diseases and the flu Binding affinities of 438 HLA proteins to complete proteomes of seven pandemic viruses and distributions of strongest and weakest HLA peptide binders in populations worldwide Time to abandon the hygiene hypothesis: new perspectives on allergic disease, the human microbiome, infectious disease prevention and the role of targeted hygiene Immunocytometric analysis of COVID patients: A contribution to personalized therapy? COVID-19 case-fatality rate and demographic and socioeconomic influencers: worldwide spatial regression analysis based on country-level data COVID-19) [WWW Document Risk of mortality in patients infected with SARS-CoV-2 variant of concern 202012/1: matched cohort study Mortality due to COVID-19 in different countries is associated with their demographic character and prevalence of autoimmunity Bioinformatics analysis of epitopebased vaccine design against the novel SARS-CoV-2. Infectious Diseases of Poverty 9 Emerging coronaviruses: Genome structure, replication, and pathogenesis HLA-B*44 and C*01 Prevalence Correlates with Covid19 Spreading across Italy Mutation Rates and Selection on Synonymous Mutations in SARS-CoV-2 Contriving Multi-Epitope Subunit of Vaccine for COVID-19: Immunoinformatics Approaches VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines Data, disease and diplomacy: GISAID's innovative contribution to global health Treatment of COVID-19 -Evidence-Based or Personalized Medicine? COVID19-Researchers, 2020. Personalized therapy approach for hospitalized patients with COVID-19 Allele frequency net database (AFND) 2020 update: gold-standard data classification, open access genotype data and new query tools Case fatality risk of the SARS-CoV-2 variant of concern B.1.1.7 in England SARS-CoV-2 variants, spike mutations and immune escape Structure of replicating SARS-CoV-2 polymerase Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor Increased risk of severe clinical course of COVID-19 in carriers of HLA-C*04 BepiPred-2.0: improving sequence-based Bcell epitope prediction using conformational epitopes MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability Age and Sex Are Associated With Severity of Japanese COVID-19 With Respiratory Failure Processing of the SARS-CoV pp1a/ab nsp7-10 region Morphology, Genome Organization, Replication, and Pathogenesis of Severe Acute Respiratory Syndrome Coronavirus COVID-19) 23-31 The influence of HLA genotype on the severity of COVID-19 infection Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses. Nat Microbiol 1-8 Regulation of the ER Stress Response by the Ion Channel Activity of the Infectious Bronchitis Coronavirus Envelope Protein Modulates Virion Release, Apoptosis, Viral Fitness, and Pathogenesis Epitope-based peptide vaccines predicted against novel coronavirus disease caused by SARS-CoV-2 Immunogenetic Predictors of Severe COVID-19. Vaccines (Basel) 9 Geographic and Genomic Distribution of SARS-CoV-2 Mutations Association between HLA genotypes and COVID-19 susceptibility, severity and progression: a comprehensive review of the literature Antigen presentation in virus infection Designing Multi-Epitope Vaccines to Combat Emerging Coronavirus Disease 2019 (COVID-19) by Employing Immuno-Informatics Approach IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies HLA allele frequencies and susceptibility to COVID-19 in a group of 99 Italian patients Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool Time series prediction of COVID-19 by mutation rate analysis using recurrent neural network-based LSTM model TepiTool: A Pipeline for Computational Prediction of T Cell Epitope Candidates Broad and strong memory CD4 + and CD8 + T cells induced by SARS-CoV-2 in UK Correlation of the two most frequent HLA haplotypes in the Italian population to the differential regional incidence of Covid-19 A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data A bioinformatic prediction of antigen presentation from SARS-CoV-2 spike protein revealed a theoretical correlation of HLA-DRB1*01 with COVID-19 fatality in Mexican population: An ecological approach BcePred: Prediction of Continuous B-Cell Epitopes in Antigenic Sequences Using Physico-chemical Properties Structural Proteins in Severe Acute Respiratory Syndrome Coronavirus-2. Archives of Medical Research 51 Identification of risk and protective human leukocyte antigens in COVID-19 using genotyping and structural modeling Association of HLA Class I Genotypes With Severity of Coronavirus Disease-19 GISAID: Global initiative on sharing all influenza data -from vision to reality Coronavirus replication-transcription complex: Vital and selective NMPylation of a conserved site in nsp9 by the NiRAN-RdRp subunit Comparison of Clinical Features of COVID-19 vs Seasonal Influenza A and B in US Children Explaining among-country variation in COVID-19 case fatality rate An insight into the epitope-based peptide vaccine design strategy and studies against COVID-19 SARS-CoV-2 genomic variations associated with mortality rate of COVID-19 Understanding MHC Class I Presentation of Viral Antigens by Human Dendritic Cells as a Basis for Rational Design of Therapeutic Vaccines Prediction of B-cell Linear Epitopes with a Combination of Support Vector Machine Classification and Amino Acid Propensity Identification [WWW Document Structural Basis for RNA Replication by the SARS-CoV-2 Polymerase WHO | Variant analysis of SARS-CoV-2 genomes COVID-19) Dashboard [WWW Document Total predicted MHC-I epitope load is inversely associated with population mortality from SARS-CoV-2 Systematic Comparison of Two Animal-to-Human Transmitted Human Coronaviruses: SARS-CoV-2 and SARS-CoV Increased circulating level of interleukin-6 and CD8+ T cell exhaustion are associated with progression of COVID-19 Anti-COVID-19 multi-epitope vaccine designs employing global viral genome sequences The PickPocket method for predicting binding specificities for receptors based on receptor pocket similarities: application to MHC-peptide binding Supplementary Table 1: Epitopes predicted from the SARS-CoV2 reference proteome. 487 epitopes predicted by NetMHCpan and NetMHCIIpan from the SARS-CoV2 reference proteome along with corresponding HLA alleles involved in their recognition are presented in the table.Only those epitopes identified with prediction scores ≥ 0.95 are reported. For each epitope the number of alleles interacting with it, along with the best scoring allele and the corresponding prediction score is also mentioned. Table 2 : Populations considered in the study. Population names, ethnicity and sample sizes corresponding to the HLA allele data used in the study are provided in this table. The respective data sources (AFND or TGP) are also indicated. Table 3 : Observed richness of HLA alleles and epitopes. Table denoting observed richness of MHC-I and MHC-II alleles among the 82 ethnic populations belonging to seven super-populations, as well as the observed richness of CD8-specific and CD4-specific SARS-CoV2 epitopes recognized by the respective HLA alleles. Table 4 : Intersecting and unique sets of HLA alleles and SARS-CoV2 epitopes. (A) Intersecting and unique sets between the MHC-I and MHC-II alleles in the seven super-populations. (B) Intersecting and unique sets between CD8-specific and CD4-specific SARS-CoV2 epitopes recognized by the HLA alleles prevalent in the seven super-populations. Super-population names are abbreviated as AFR-Africans; AMR-Amerindians; EAS-East Asians; EUR-Europeans; SAS-South Asians; MEA-Middle East and Africans, OCN-Oceanians. Table 6 : t-test results depicting variations in the SARS-CoV2 epitope recognition potential of ethnic groups and super-populations. (A) Differences in the epitope recognition potential of a specific ethnic group w.r.t. the overall epitope recognition potential of all other ethnic groups taken together are depicted. (B) Differences in the epitope recognition potential of a specific super-population w.r.t. the overall epitope recognition potential of all other super-populations taken together are depicted. The p-values obtained from t-test were corrected for multiple testing using Benjamini-Hochberg (BH) method. Significant differences are highlighted in red (BH corrected p-value < 0.05). Table listing variants of epitopes identified from the reference SARS-CoV2 genome along with their distribution frequencies in the genomes isolated and sequenced in different geographies. Sequences of the reference epitopes and variants along with their positions in the respective protein are also mentioned. Table listing variants of epitopes which are present in >= 0.5% of sampled genomes. The respective alleles recognizing the reference epitopes (RE) and the variant epitopes (VE) are also presented in the table and the alleles differentially recognizing the RE and VE are highlighted in bold fonts. The sub-populations reporting the maximum frequencies for the alleles differentially recognizing RE and VE as well as the number of sub-populations where they are not reported at all are also indicated in the Table. Supplementary Table 9 : Mutation in epitopic regions observed in Variants of concern. Table listing the variants of reference epitopes observed in four variants of concern, viz. alpha (UK), beta (South Africa), gamma (Brazil) and Delta (India). The respective HLA alleles recognizing the variant epitopes are also indicated. (s) where the epitope has been predicted to be well recognized as well as the VaxiJen scores for the respective epitopes are indicated. Epitopes (and their variants) which were found to be present in at least 10 studied SARS-CoV2 genomes and were recognized by the HLA-system of at least 5% of individuals from any of the seven super-populations are presented in the table. The presence of these peptides in SARS coronavirus and MERS coronavirus proteomes is also indicated. Supplementary Fig. 4 : Distribution of CD8-specific epitopes recognized by the HLA-system among the ethnic groups. Heat-map depicting the distribution of CD8-specific epitopes that could be recognized by the HLA alleles prevalent among the 82 ethnic groups involved in this study. Both the ethnic groups and the CD8-specific epitopes have been hierarchically clustered. An additional colour-key along the vertical axis indicates the number of human HLA-types capable of recognizing the epitopes. Along the horizontal axis ethnic groups have been tagged with different colours based on their affiliations to respective super-populations. Supplementary Fig. 5 : Distribution of CD4-specific epitopes recognized by the HLA-system among the ethnic groups. Heat-map depicting the distribution of CD4-specific epitopes that could be recognized by the HLA alleles prevalent among the 82 ethnic groups involved in this study. Both the ethnic groups and the CD4-specific epitopes have been hierarchically clustered. An additional colour-key along the vertical axis indicates the number of human HLA-types capable of recognizing the epitopes. Along the horizontal axis ethnic groups have been tagged with different colours based on their affiliations to respective super-populations. Representation of the mean SARS-CoV2 epitope recognition potential among different ethnic groups with respect to SARS-Cov2 genomes sequenced till 11th June 2020, as obtained from the GISAID (https://www.gisaid.org/). Each box plot denotes an ethnic population; each data-point on the plot represents the average of the number of epitopes from SARS-CoV2 genomes (binned according to month of collection provided by GISAID), that could be identified by individuals in that ethnic population. Supplementary Fig. 9 : Variations observed in epitopes and corresponding proteins in SARS-CoV2 genomes. Plot representing number of variations in the amino acid sequence of the epitopes and corresponding proteins across 40,342 SARS-CoV2 genomes, normalized by the epitope and protein lengths respectively. Number of predicted epitopes identified in each individual protein is mentioned alongside the protein name. Supplementary Fig. 10 : Distribution of epitopes shortlisted as vaccine peptides across super-populations. Euler representation of (A) CD8-specific and (B) CD4-specific SARS-CoV2 epitopes recognized by the HLA alleles in the seven super-populations. Epitopes observed in at least 10 SARS-CoV2 genomes and those potentially recognized by at least 5% of the individuals representing a super-population have been considered. The CD8-specific and CD4-specific epitopes at the intersection of the seven super-populations could serve as potential vaccine candidates. Super-population names are abbreviated as AFR-Africans; AMR-Amerindians; EAS-East Asians; EUR-Europeans; MEA-Middle East and Africans; OCN-Oceanians; SAS-South Asians.