key: cord-0844891-ayo2na5v authors: Romero‐López, JP; Carnalla‐Cortés, M; Pacheco‐Olvera, DL; Ocampo‐Godínez, JM; Oliva‐Ramírez, J; Moreno‐Manjón, J; Bernal‐Alferes, B; López‐Olmedo, N; García‐Latorre, E; Domínguez‐López, ML; Reyes‐Sandoval, A; Jiménez‐Zamudio, L title: A bioinformatic prediction of antigen presentation from SARS‐CoV‐2 spike protein revealed a theorical correlation of HLA‐DRB1*01 with COVID‐19 fatality in Mexican population: an ecological approach date: 2020-09-28 journal: J Med Virol DOI: 10.1002/jmv.26561 sha: 319f28ff28cc26279263ccfd51440b05bc1e8b08 doc_id: 844891 cord_uid: ayo2na5v SARS‐CoV‐2 infection is causing a pandemic disease that is reflected in challenging public health problems worldwide. HLA‐based epitope prediction and its association with disease outcomes provide an important base for treatment design. A bioinformatic prediction of T cell epitopes and their restricted HLA class I and II alleles was performed to obtain immunogenic epitopes and HLA alleles from the spike protein of the SARS‐CoV‐2 virus. Also, a correlation with the predicted fatality rate of hospitalized patients in 28 states of Mexico was done. Here, we describe a set of ten highly immunogenic epitopes, together with different HLA alleles that can efficiently present these epitopes to T cells. Most of these epitopes are located within the S1 subunit of the spike protein, suggesting that this area is highly immunogenic. A statistical negative correlation was found between the frequency of HLA‐DRB1*01 and the fatality rate in hospitalized patients in Mexico. This article is protected by copyright. All rights reserved. The coronavirus disease was declared as a pandemic by the World Health Organization (WHO) in March of 2020. 1 It is estimated that by June the 10 th of 2020 there were over 6.19 million confirmed cases and 370 thousand deaths worldwide. COVID-19 is a disease generated by the novel severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2), with a wide range of clinical manifestations, like fever (88.7%), cough (67.8%), fatigue (38.1%), and acute respiratory distress syndrome (ARDS) in severe cases. 2 Interestingly, the molecular and clinical manifestations of the disease vary between asymptomatic, mild-symptomatic, and severe patients, requiring hospitalization in some cases to prevent fatal outcomes. 3 Currently, the SARS-CoV-2 genome has been characterized as a new betacoronavius, which shares around 87% of genomic identity with bat-SL-CoVZC45 and bat-SL-CoVZXC21 viruses. 4 A recent analysis by Zhou et al. reported that there is a 96.2% identity with BatCoVRaTG13 and a 79.5% identity with SARS-CoV. 2, 5 The genomic characterization of the virus not only provides information about its taxonomy and probable origin but also offers opportunities to perform deeper analysis using bioinformatics tools. The angiotensin-converting enzyme-2 (ACE-2) receptor and the transmembrane serine protease 2 (TMPRSS2) are essential components of the human host for the virus entry into the upper respiratory epithelial cells. The virus recognizes ACE-2 through the viral spike glycoprotein (S), and this event leads to the virus-cell membrane fusion. 6 The S glycoprotein is found as a homotrimer of three identical monomers, each one of which is divided into two subunits: S1 and S2. The first subunit folds in four domains: A, B, C, and D. The B domain possesses a receptor-binding domain (RBD) that recognizes ACE2, hence it is important for viral entry. 7 The S2 subunit sequence has two tandem domains, named HR1 and HR2, that play an essential role in the viral fusion to the membrane. 8 Furthermore, analysis of the spike protein showed that it is conserved among SARS-CoV and SARS-CoV-2 with 76.3% of identity and 87.3% of similarity. 9 Several studies focused on viral diseases have shown that clinical severity is closely associated with some individual factors, such as genetic background and immune response. The human leukocyte antigen HLA is responsible for the antigen presentation to T cells and, therefore, a key component for adaptive immune response initiation. The HLA genes are the most polymorphic genes in the human genome, and these polymorphisms influence the ability to present different sets of epitopes to T cells. Some HLA molecules are more efficient than others presenting certain antigens, which may lead to a better induction of immune responses. This fact has already been proven for some viral diseases like A H1N1 influenza 10 and HIV. 11 It has been previously reported an association between SARS-CoV infection and HLA-B*07:03, 12 HLA-Cw*08:01, 13 HLA-B*46:01, and HLA-B*54:01. Specifically, it has been reported that the individuals who are HLA-B*46:01 positive have a higher risk of severe infection, 14 whereas the frequency of HLA-DRB2*03:01 is lower among COVID-19 patients. 12 Mexico is one of the top ten countries with higher mortality, and its number of cases and deaths keeps increasing significantly. 15 So far, the control of the COVID-19 pandemic remains a challenge, resulting in thousands of new cases and deaths reported daily. It is necessary to find prophylaxis and specific treatments to contain this uncontrolled infection and to reduce the global morbidity and mortality. The generation of a vaccine that targets this virus remains as the primary solution , 19 however, the lack of knowledge regarding the immune response, such as the HLA-virus interactions, makes it a challenging task. Additionally, the genetic variations among different populations and their possible link with SARS-CoV-2 viral responses remain unknown. In this report, we analyze which epitopes of the SARS-CoV-2 spike protein are highly immunogenic and able to be presented by HLA class I and II in different populations using bioinformatic tools. We also demonstrate an ecological correlation between HLA allele frequency and the predicted fatality rate in hospitalized patients of 28 Mexican states. A bioinformatic epitope prediction of the spike glycoprotein was performed. This gave information about the most immunogenic peptide-HLA matches and the HLA alleles that are more likely to present these epitopes efficiently. Also, an ecological study was made to look for correlations between the HLA allele frequencies and the predicted fatality rate of hospitalized COVID-19 patients to May 29th, 2020. Bioinformatic analyses were performed to predict HLA class I and II epitopes using the sequence of the Table 1 ). 21 Once the total epitope list was obtained, it was submitted to the T cell class I pMHC immunogenicity predictor server (http://tools.iedb.org/immunogenicity/) to get the immunogenicity score, which is predicted according to the aminoacid residues of the peptide. 22 The peptide-HLA pairs with a positive immunogenicity score and a predicted IC50 level lower than the established cut-off (Supplementary Table 2 ) from the complete list were chosen, 23 considering that the lower the IC50 value, the higher the binding affinity. The ten more immunogenic peptide-MHC combinations from this list were selected. The epitopes for HLA class II molecules were also predicted using the same sequence as before and submitting it to the IEDB MHC class II epitope prediction tool (http://tools.iedb.org/mhcii/) using the IEDB recommended 2.2 algorithm and the most common HLA-DP, DQ and DR alleles (Supplementary Table 1 ). 24 The predicted epitopes with an SMM-predicted IC50 value higher than 50 were excluded and the sequences were ordered by the percentile rank. 25 The MHC-II prediction tools use a core of nine aminoacids to predict the best peptide binding affinity, even when the class II molecules bind peptides with 15 aminoacid length, so the ten SMM cores with the minor percentile rank -what means the highest affinity binding-were selected. To provide a graphical representation of the epitopes location, we used structural model the full-length SARS-CoV-2 spike glycoprotein (ID:6VSB_1_1_1). The full-length SARS-CoV-2 structural model is available at CHARMM-GUI13 COVID19 Archive. 26 The 3D structure was obtained and analyzed using PyMOL® software (Schrödinger LLC. Molecular Graphics System (PyMOL). Version 1·80 LLC, New York, NY. 2015). The basic local alignment search tool online (https://blast.ncbi.nlm.nih.gov/Blast.cgi) was used to assess the position of the predicted peptides in the glycoprotein and the protein sequence was adjusted manually using the PyMOL tools. We selected 28 states of Mexico considering the homogeneity in epidemiological reports and registered the allele frequency of the main capital city of each state. All the states were included except for Mexico state, Baja California Sur, and Tamaulipas because no information was found. We used the Allele Frequency Net Database (AFND, http://www.allelefrequencies.net/default.asp) and searched for populations in North America's geographical region and used Mexico's (132) database. The total states samples reported on the databases was of 5840. For HLA class I, the subgroup alleles were not reported for 26 of the states. However, Mexico City Mestizo and Veracruz Xalapa did contain subgroup data. In the selection of the class II molecules, the HLA-DPA1*03:01, DPB1*04:02, HLA-DPA1*01:03, DPB1*02:01, HLA-DPA1*02:01, DPB1*01:01, and HLA-DQA1*05:0 alleles were not found in the database of any population. All the frequency data is summarized in the Supplementary Table 3 organized per city. We used national public data reporting all individuals with a result for SARS-CoV-2 in Mexico to July 8 th , The total database contained 684 804 records. We only included records of phase 3 (614 370). We excluded 60 520 patients who were admitted for hospitalization after July 1 st to allow the presentation of the outcome "death", since the median from hospitalization to death was 7 days. Of the 553 850 remaining records, 307 state of birth, and 1 026 were indigenous people. We excluded 448 pregnant women because the immune response is expected to be different. 28 Finally, 9 records were eliminated because the date of death was before the admission date. Hence, our final sample was 71 099 records. To create a predictive model of the hospitalized fatality rate -number of deaths caused by COVID-19-, we performed a stepwise approach with all the variables reported in the SARS-CoV-2 Mexico database in a Table 4 ). Afterward, a Spearman rank correlation was performed between the seven HLA allele frequencies and the risk of death at state level. A p-value <0.05 was considered statistically significant. The analyses were performed in Stata v14 and figure were created using Graphpad Prism version 6.0®. To assess the best Spike protein epitope-HLA class I matches, its sequence was analyzed looking for epitope predictions in the most frequent HLA-A and HLA-B alleles. The ten most immunogenic peptides with a higher affinity binding to its restricted HLA are shown in Table 1 . Although the most immunogenic peptide from this list is GTHWFVTQR, the match with the highest affinity was between the peptide FIAGLIAIV and HLA-A*02:03. Of note, here we analyzed the most frequent class I A and B alleles, so this analysis reveals epitopes that can be used for vaccine development and the HLA alleles that best present epitopes of this particular protein. The best epitopes and HLA class II alleles were also predicted, as shown in Table 2 To track down and illustrate the specific location of the peptides in the SARS-CoV-2 spike glycoprotein, the corresponding 3D model was obtained. In this model, the different predicted epitopes ( Table 1 and Table 2 ) were searched in the protein structure considering its subunits and domains (Figure 1) . Notably, HLA class I peptides WTAGAAAYY, SANNCTFEY, and YLQPRTFLL -7, 8, and 9-are located in the A domain, which is highly conserved among other coronavirus species 8 , suggesting that these could also be epitopes for other coronaviruses. On the other hand, it was found that the class II epitopes FELLHAPAT, VVVLSFELL, FLVLLPLVS, VLSFELLHA, and FTISVTTEI -a, b, c, d, and h-and the HLA class I EVFNATRFA -4-are preferentially found in the B domain. After factorial analysis, we found a significant negative correlation between the frequency of the HLA-DRB1*01:01 allele and the predicted fatality rate in hospitalized patients (R = -0.44, p-value=0.02) ( Figure 2 ). No other significant correlations were observed (Table 3) . Determining HLA interactions with epitopes for optimal presentation is crucial for understanding the immunological response to SARS-CoV-2. Here, we present a group of epitopes of the spike protein that Remarkably, our structural analysis of the protein shows a higher abundance of epitopes in the A and B domains of the S1 subunit of the virus, indicating that, in the case of this part of the protein being processed by the host cells, it could represent a highly immunogenic region. In this analysis, we did not look for B cell epitopes in the structure of the protein. We cannot confirm that the specific target of the presented epitopes could interfere with its viral function, as would be the case of neutralizing antibodies. HLA peptide groove sequence determines which epitopes from an antigen are presented to the immune system to elicit an effective response. The high rate of polymorphisms in the HLA locus can indicate a different ability to respond to certain antigens by different individuals. Furthermore, some HLA alleles can be more efficient in presenting certain antigens, thus also in protecting from certain infections 11 . Our analysis from the most representative HLA alleles revealed those that present more effectively the spike Accepted Article protein antigens of SARS-CoV-2, hence, one can hypothesize that their presence in an individual might confer an enhanced ability to defend against the virus. To assess this, we analyzed the frequency of these alleles and their relation to the disease dynamics in different states of Mexico. Although it would be interesting to extrapolate these results to several countries with different epidemiological behavior of the disease, epidemiological reports would be highly heterogeneous and data at an individual level associated with risk of death would be needed to adjust the fatality rate. While there is a myriad of factors related to the lethality of the disease, little is known about the involvement of the immune system in this regard. It has been proposed that many patients develop an exaggerated immune response against the infection, accompanied by a cytokine releasing syndrome 33 or autoinflammatory syndromes. 34 Also, Grifoni et al. showed that T helper cell responses (initiated by HLA class II molecules) seem to be protective against the infection through a strong correlation with the production of virus-specific antibodies, and also that they are highly represented by S-protein specific clones. 35 A significant negative correlation was found between the frequency of the class II HLA-DRB1*01 allele and the fatality rate in hospitalized patients from the states that were included. Remarkably, this correlation was weak, suggesting that other important factors apart from HLA could be involved in the protection. Therefore, it is plausible that the correlation we found based on bioinformatic predictions, would mean that these alleles could show some degree of protection against lethal outcomes of the disease. Although, the frequency of this specific allele is low in the different states, so the overall effect in fatality rates might be small. Thus, further experimental studies are needed to reinforce these outcomes. HLA-DRB1*01 alleles have been previously associated with multiple sclerosis resistance 36 . Nevertheless, its role in the susceptibility to viral diseases remains poorly understood. A recent report demonstrated, using molecular docking, that this molecule can interact with the VYQLRARSV epitope from the ORF-7a protein of the SARS-CoV-2 virus 37 . Our results revealed an ecological negative correlation of this allele and that it can present a set of epitopes. Previous reports have identified that this allele can present at least Accepted Article nine epitopes of the M protein and 11 of the N protein (Supplementary Table 4) , revealing that this molecule can be highly relevant for SARS-CoV-2 immunity. A remarkable characteristic of this study is that we narrowed it to the S protein, which has been the most used target for vaccine development. Considering that we did not include other viral proteins, we made an exhaustive bibliographic review that allowed us to compile a total of 77 T cell epitopes for the M protein and 87 for the N protein that were already evaluated experimentally and included an analysis of the HLA alleles used for its prediction. As shown in the Supplementary Table 4 Several limitations need to be acknowledged. First, the association of the frequency of the HLA allele and fatality rate is ecological and cannot be applied at an individual level. Other studies need to be conducted to explore if the association persists at an individual level in hospitalized patients. Second, the predictive model of the fatality case was conducted using only data from hospitalized patients. Given that different comorbidities can lead to hospitalization, we cannot exclude the possibility of collider bias. That is, the conditioning of analysis on hospitalization can produce biased associations between the risk factors and the outcome "fatality rate" in this case. Third, we do not rule out the possibility of misclassification since the information on comorbidities is self-reported. However, our aim was not to make an inference of the fatality rate at an individual-level factor, but rather to create a predictive model that was as less biased as possible. Fourth, there may be other state characteristics that are associated with death, such as the health infrastructure or the number of available specialized medical staff that are not considered in the model. Finally, the HLA allele frequencies do not include minorities like the indigenous population, who might have different HLA alleles frequencies. Acknowledgments shown in red (C) and the suggested peptides for HLA class II in blue. The peptides are marked individually, listed from 1-10 for class I and a-j for class II, corresponding to the immunogenicity Table 1 and Table 2 . World Health Organization declares global emergency: A review of the 2019 novel coronavirus (COVID-19) Clinical Characteristics of Coronavirus Disease 2019 in China COVID-19 Illness in Native and Immunosuppressed States: A Clinical-Therapeutic Staging Proposal Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding A pneumonia outbreak associated with a new coronavirus of probable bat origin SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor Structural insights into coronavirus entry Receptor recognition by novel coronavirus from Wuhan: An analysis based on decade-long structural studies of SARS Immunoinformatics-aided identification of T cell and B cell epitopes in the surface glycoprotein of 2019-nCoV An increased frequency in HLA class I alleles and haplotypes suggests genetic susceptibility to influenza A (H1N1) 2009 pandemic: A case-control study HLA-B57/B*5801 Human Immunodeficiency Virus Type 1 Elite Controllers Select for Rare Gag Variants Associated with Reduced Viral Replication Capacity and Strong Cytotoxic T-Lymphotye Recognition Association of Human-Leukocyte-Antigen Class I (B*0703) and Class II (DRB1*0301) Genotypes with Susceptibility and Resistance to the Development of Severe Acute Respiratory Syndrome Epidemiological and genetic correlates of severe acute respiratory syndrome coronavirus infection in the hospital with the highest nosocomial infection rate in Taiwan in 2003 Association of HLA class I with severe acute respiratory syndrome coronavirus infection Download Today's Data on the Geographic Distribution of COVID-19 Cases Worldwide Allele frequency net database (AFND) 2020 update: gold-standard data classification, open access genotype data and new query tools Genetic diversity of HLA system in six populations from Mexico City Metropolitan Area HLA class I and class II haplotypes in admixed families from several regions of Mexico Epitope-based vaccine target screening against highly pathogenic MERS-CoV: An In Silico approach applied to emerging infectious diseases TepiTool: A pipeline for computational prediction of T cell epitope candidates Comprehensive analysis of dengue virus-specific responses supports an HLA-linked protective role for CD8+ T cells Properties of MHC Class I Presented Peptides That Enhance Immunogenicity HLA Class I Alleles Are Associated with Peptide-Binding Repertoires of Different Size, Affinity, and Immunogenicity Peptide binding predictions for HLA DR, DP and DQ molecules Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method Coronavirus Disease 2019 (COVID-19) and pregnancy: what obstetricians need to know The COVID-19 vaccine development landscape The starting line for COVID-19 vaccine development Human leukocyte antigen susceptibility map for SARS-CoV-2 Binding affinities of 438 HLA proteins to complete proteomes of seven pandemic viruses and distributions of strongest and weakest HLA peptide binders in populations worldwide COVID-19: immunopathology and its implications for therapy Autoimmune and inflammatory diseases following COVID-19 Targets of T Cell Responses to SARS-CoV-2 Coronavirus in Humans with COVID-19 Disease and Unexposed Individuals Epitope based vaccine prediction for SARS The authors did not receive any funding for this study. All authors declare not to have any conflict of interest. The data that supports the findings of this study are available in the supplementary material of this article All authors declare not to have any conflict of interests.