key: cord-0845419-8p1agcm2 authors: Grifoni, Alba; Sidney, John; Zhang, Yun; Scheuermann, Richard H; Peters, Bjoern; Sette, Alessandro title: Candidate targets for immune responses to 2019-Novel Coronavirus (nCoV): sequence homology- and bioinformatic-based predictions date: 2020-02-20 journal: bioRxiv DOI: 10.1101/2020.02.12.946087 sha: f6f1ec12291f743de1504003929ea654217f2af6 doc_id: 845419 cord_uid: 8p1agcm2 Effective countermeasures against the recent emergence and rapid expansion of the 2019-Novel Coronavirus (2019-nCoV) require the development of data and tools to understand and monitor viral spread and immune responses. However, little information about the targets of immune responses to 2019-nCoV is available. We used the Immune Epitope Database and Analysis Resource (IEDB) resource to catalog available data related to other coronaviruses, including SARS-CoV, which has high sequence similarity to 2019-nCoV, and is the best-characterized coronavirus in terms of epitope responses. We identified multiple specific regions in 2019-nCoV that have high homology to SARS virus. Parallel bionformatic predictions identified a priori potential B and T cell epitopes for 2019-nCoV. The independent identification of the same regions using two approaches reflects the high probability that these regions are targets for immune recognition of 2019-nCoV. We identified potential targets for immune responses to 2019-nCoV and provide essential information for understanding human immune responses to this virus and evaluation of diagnostic and vaccine candidates. 4 map corresponding regions in the 2019-nCoV sequences, and predict likely epitopes. We also used validated bioinformatic tools to predict B and T cell epitopes that are likely to be recognized in humans, and to assess the conservation of these epitopes across different coronavirus species. Coronaviruses belong to the family Coronaviradae, order Nidovirales, and can be further subdivided into four main genera (Alpha-, Beta-, Gamma-and Deltacoronaviruses). Several Alpha-and Betacoronaviruses cause mild respiratory infections and common cold symptoms in humans, while others are zoonotic and infect birds, pigs, bats and other animals. In addition to 2019-nCoV, two other coronaviruses, SARS-CoV and MERS-CoV, caused large disease outbreaks that had high (10-30%) lethality rates and widespread societal impact upon emergence (Fig 1) (3, 4) . The immune response to 2019-nCoV in humans awaits characterization, but human immune responses against other coronaviruses have been investigated. As of January 27, 2020, the IEDB has curated 581 linear, and 81 as discontinuous, B cell epitopes that have been reported in the peer reviewed literature. In addition, 320 peptides have been reported as T cell epitopes ( Table 1A) . The vast majority of these epitopes are derived from Betacoronavirues, and more specifically from SARS-CoV, which alone accounts for over 60% of them. In terms of the host in which the various B and T cell epitopes were recognized (Table 1B) , most epitopes (either B or T) were defined in humans or murine systems. Notably, all but two of the 417 B and T cell epitopes described in humans are from Betacoronaviruses, with 398 of them coming from SARS-CoV. Comparison of a consensus 2019-nCoV protein sequence to sequences for SARS-CoV, MERS-CoV and bat-SL-CoVZXC21 revealed a high degree of similarity (expressed as % identity) between 2019-nCoV, bat-SL-CoVZXC21 and SARS-CoV, but more limited similarity with MERS-CoV (Figure 1 ). In conclusion, SARS-Cov is the closest related virus to 2019-nCoV for which a significant number of epitopes has been defined in humans (and other species), and that also causes human disease with lethal outcomes. Accordingly, in the following analyses we focused on comparing known SARS-CoV epitope sequences to the 2019-nCoV sequence. We first assessed the distribution of SARS-derived epitopes as a function of protein of origin (Table 1C ). In the context of B cell responses, most of the 12 antigens in the SARS-CoV proteome are associated with epitopes, with the greatest number derived from spike glycoprotein, . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.12.946087 doi: bioRxiv preprint nucleoprotein and membrane protein ( Table 1C) . The paucity of B cell epitopes associated with the other proteins is likely because, on average, B cell epitope screening studies to date have probed regions constituting less than 20% of each respective sequence, including <1% of the Orf 1ab polyprotein. By comparison, the complete span of the spike glycoprotein, nucleoprotein and membrane protein sequences have been probed at least to some extent in B cell assays. A similar situation was observed in the case of T cell epitopes. Here we only considered epitopes whose recognition is restricted by human (HLA) MHC, since MHC polymorphism typically results in different epitopes being recognized in humans and mice. B cell epitopes derived from SARS-CoV, were mapped back to a SARS-CoV reference sequence using the IEDB's Immunobrowser tool (5) . This tool combines all records available along a reference sequence and produces a Response Factor (RF) score that accounts for the positivity rate (how frequently a residue was found in a positive epitope) and the number of records (how many independent assays are reported). Dominant regions were identified considering residues stretches where the RF score was ≥ 0.3. Analyses of the spike glycoprotein, membrane protein and nucleoproteins are shown in Figure 2 . In the case of the spike glycoprotein (Fig 2A) , we identify five regions of potential interest (residues 274-306, 510-586, 587-628, 784-803 and 870-893), all representing regions associated with high immune response rates. Two regions were identified for membrane protein (1-25 and 131-152) (Fig 2B) and three regions were identified for nucleoprotein (43-65, 154-175 and 356-404) (Fig 2C) . Table 2 summarizes these analyses, identifying the specific regions associated with dominant B cell responses. Next, we aligned the SARS-CoV B cell epitope region sequences to the 2019-nCoV sequence to calculate the percentage identity between each of the SARS-CoV dominant regions and 2019-nCoV ( Table 2) . Of the ten regions identified, six had 90% or more identity with 2019-nCoV, two were between 80-89% identical, and two had lower but still appreciable homology (69% and 78%). Because of the overall high level of sequence similarity of SARS-CoV and 2019-nCoV we infer that the same regions that are dominant in SARS-CoV have high likelihood to also be dominant in 2019-nCoV, even if the actual sequences are different. In a similar analysis, T cell epitopes were also found to be predominantly associated with spike glycoprotein and nucleoprotein (Table 1C) . However, in these cases, epitopic regions and individual epitopes were more widely dispersed throughout the respective proteins, which made . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.12.946087 doi: bioRxiv preprint identification of discrete, dominant epitopic regions more difficult. This outcome is not unexpected since T cells recognize short peptides generated from cellular processing of viral antigens that can be derived from any segment of the protein. Table 3 shows a listing of the most dominant SARS-CoV individual epitopes identified to date in humans. We also aligned the SARS-CoV T cell epitope sequences and calculated for each epitope the percentage identity to 2019-nCoV. For each T cell epitope, Table 3 shows the antigen of origin, the epitope sequence, the homologous 2019-nCoV sequence, and corresponding percentage of sequence identity. Overall, the nucleocapsid phosphoprotein and membranederived epitopes were most conserved (8/10 and 2/3, respectively, had ≥ 85% identity with 2019-nCoV). The Orf1ab and surface glycoprotein epitopes were moderately conserved (3/7 and 10/23, respectively, had ≥ 85% identity with 2019-nCoV), and Orf 3a epitopes were the least conserved. To define potential B cell epitopes by an alternative method, we used the predictive tools provided with the IEDB Analysis Resource. B cell epitope predictions were carried out using the 2019-nCoV surface glycoprotein, nucleocapsid phosphoprotein, and membrane glycoprotein sequences, which, as described above, were found to be the main protein targets for B cell responses to other coronaviruses. In parallel, we performed predictions for linear B cell epitopes with Bepipred 2.0 (6), and for conformational epitopes with Discotope 2.0 (7 To predict CD4 T cell epitopes, we used the method described by Paul and co-authors (9) , as implemented in the Tepitool resource in IEDB (10) . This approach was designed and validated to predict dominant epitopes independently of ethnicity and HLA polymorphism, taking advantage of the extensive cross-reactivity and repertoire overlap between different HLA class II loci and allelic variants. Here, we selected peptides having a median consensus percentile ≤ 20, a threshold associated with epitope panels responsible for about 50% of target-specific responses. Using this threshold we identified 241 candidates in the 2019-nCoV sequence (see Table S3 ). In previous experiments, we showed that pools based on similar peptide numbers can be generated by sequential lyophilization (11) . These peptide pools (or megapools) incorporate predicted or experimentally-validated epitopes and allow measurement of magnitude and characterization of the phenotype of human T cell responses in infectious disease indications such as Bordetella pertussis, Mycobacteria tuberculosis, Dengue and Zika viruses (11) (12) (13) (14) . The 2019-nCoV CD4 megapool covers all 10 predicted proteins, with the number of potential epitopes proportional to the size of each protein (Table S4) . In parallel, we also sought to define likely CD8 epitopes. Here a different approach was required since the overlap between different HLA class I allelic variants and loci is more limited to specific groups of alleles, or supertypes (15) . Following a previously validated approach (16), we assembled a set of the 12 most prominent HLA class I alleles which have been shown to allow broad coverage of the general population, as described in the Methods (see also Table S5) . We then performed HLA class I binding predictions using the Net MHC pan 4.0 EL algorithm (17) available at the IEDB. For each allele, we selected the top 1% scoring peptides in the 2019-nCoV sequence, as ranked based on prediction. After eliminating redundancies and nested peptides, we obtained a final "in silico" megapool of 628 unique predicted epitopes. Table S6 lists those unique predicted epitopes per protein, indicating for each their respective HLA restriction(s). The epitopes identified by homology to the experimentally defined SARS-CoV epitopes shown in Tables 2-3 The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.12.946087 doi: bioRxiv preprint approaches are presumed to be the most valuable leads. We first compared B cell immunodominant regions identified in SARS-CoV, and mapped to the homologous 2019-nCOV proteins ( Table 2) , with the predicted linear (Table S2 ) and conformational (Table S1) B cell epitopes. Out of the five B cell immunodominant regions from the SARS spike glycoprotein that were mapped to 2019-nCOV, three regions overlapped with those identified by BebiPred 2.0, and one overlapped with the 809-812 region predicted by Discotope 2.0 (Table S1 and Fig 3) . No overlap was observed for the five regions of SARS-CoV membrane protein and nucleoprotein that mapped to 2019-nCOV and those predicted by BebiPred 2.0. As stated above, no Discotope 2.0 prediction was available for those two proteins. When we compared the SARS-CoV T cell epitopes that mapped to 2019-nCOV ( Table 3 ) with the predicted CD4 and CD8 T cell epitopes (Table S3 and Table S6 , respectively), we found that 12 of 17 2019-nCOV T cell epitopes with high sequence identity (≥90%) to the SARS-CoV were independently identified by the two methods. Another 7 of 16 epitopes with moderate sequence identity (70-89%), and 6 of 12 epitopes with low sequence identity (<70%) were also identified by both methods. The lack of absolute correspondence is not surprising, given that the experimental data is derived from a skewed set of HLA restrictions (largely HLA A*02:01), and that our HLA class I prediction strategy targeted a more limited set of alleles that were selected to represent the most frequent worldwide variants; at the same time, the class II predictions are expected to cover 50% of the class II responses (18) . In conclusion, the use of available information related to SARS-CoV epitopes in conjunction with bionformatic predictions points to specific regions of 2019-nCoV which have a high likelihood of being recognized by human immune responses*. The observation that many B and T cell epitopes are highly conserved between 2019-nCoV and SARS-CoV is important. Protein regions that are conserved across relatively long evolutionary distances suggest that they are structurally or functionally constrained. Vaccination strategies designed to target the immune response toward these conserved epitope regions could generate immunity that is not only crossprotective across Betacoronaviruses but also relatively resistant to ongoing virus evolution. * The corresponding peptide sets are being synthesized and will be made available for use by the scientific community upon request to the LJI team. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.12.946087 doi: bioRxiv preprint T and B cell epitopes for coronaviruses were identified by searching the IEDB at the end of January 2020. Queries were performed broadly for coronaviruses (taxonomy ID no. 11118), selecting positive assays in T cell, B cell and/or ligand contexts. Characteristics of each unique epitope (i.e., species, protein of provenance, positive assay type(s), MHC restriction) were tabulated, as well as the total number of donors tested and corresponding total number of donors with positive responses in B or T cell assays, and as a function of host. Finally, T or B cell assay specific response frequency scores (RF) were calculated broadly (i.e., any host), or for specific contexts (e.g., T cell assays in humans). Specifically, RF = [(r -sqrt(r)]/t, where r is the total number of responding donors and t is the total number of donors tested (11)). SARS-CoV (tax ID no. 694009) sequence epitope density was visualized with the IEDB Immunobrowser tool (5) . To identity contiguous dominant regions, RF scores for each residue were recalculated to represent a sliding 10 residue window. All full-length protein sequences from SARS-CoV and MERS-CoV were retrieved from ViPR (https://www.viprbrc.org/brc/home.spg?decorator=corona) on 31 January 2020. In order to exclude sequences of experimental strains, sequences from "unknown," mouse, and monkey hosts were excluded from analysis. Remaining sequences were aligned using the MUSCLE algorithm in ViPR. Sequences causing poor alignments in a preliminary analysis were removed before computing the final alignment. The consensus protein sequences of each virus group were determined from the final alignments using the Sequence Variation Analysis tool in ViPR. MERS-CoV consensus were selected for use in epitope sequence analysis. Each Wuhan-Hu-1 (MN908947) protein sequence was compared against the consensus protein sequences from SARS-CoV and MERS-CoV and the protein sequences from closest bat relative (bat-SL-CoVZXC21) using the BLAST algorithm (ViPR; . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.12.946087 doi: bioRxiv preprint https://www.viprbrc.org/brc/blast.spg?method=ShowCleanInputPage&decorator=corona) to compute the pairwise identity between Wuhan-Hu-1 proteins and their comparison target. Linear B cell epitope predictions were carried out on three different coronavirus proteins: Fig S1) . Epitope prediction was carried out using the ten proteins predicted for the reference The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.12.946087 doi: bioRxiv preprint For CD4 T cell epitope prediction, we applied a previously described algorithm that was developed to predict dominant HLA class II epitopes, using a median consensus percentile of prediction cutoff ≤ 20 as recommended (18) . For CD8 T cell epitope prediction, we selected the 12 most frequent HLA class I alleles in the worldwide population (20, 21) (17) . For each HLA class I allele analyzed, we selected the top 1% epitopes ranked based on prediction score. To generate a final set for synthesis, duplicate peptides (i.e., those selected for multiple alleles) were reduced to a single occurance, and nested peptides were ensconced within longer sequences, up to 14 residues in length, before assigning the multiple corresponding HLA restrictions for each region. We thank Erica Ollmann Saphire, Sharon Schendel, and Mitchell Kronenberg for critical reading of the manuscript, and numerous helpful suggestions. Support for the work included funding provided through NIH-NIAID contracts 75N9301900065 (AS) and 75N93019C00001 (AS and BP). Additional support was provided through NIH-NIAID contract 75N93019C00076 (YZ and RS). The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.12.946087 doi: bioRxiv preprint The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.12.946087 doi: bioRxiv preprint The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.12.946087 doi: bioRxiv preprint The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.12.946087 doi: bioRxiv preprint The Immune Epitope Database (IEDB): 2018 update ViPR: an open bioinformatics database and analysis resource for virology research SARS and MERS: recent insights into emerging coronaviruses From SARS to MERS, Thrusting Coronaviruses into the Spotlight ImmunomeBrowser: a tool to aggregate and visualize complex and heterogeneous epitopes in reference proteins BepiPred-2.0: improving sequence-based Bcell epitope prediction using conformational epitopes Reliable B cell epitope predictions: impacts of method development and improved benchmarking SWISS-MODEL: homology modelling of protein structures and complexes A population response analysis approach to assign class II HLA-epitope restrictions TepiTool: A Pipeline for Computational Prediction of T Cell Epitope Candidates Automatic Generation of Validated Specific Epitope Sets Global Assessment of Dengue Virus-Specific CD4+ T Cell Responses in Dengue-Endemic Areas Cutting Edge: Transcriptional Profiling Reveals Multifunctional and Cytotoxic Antiviral Responses of Zika Virus-Specific CD8(+) T Cells Correction: Definition of Human Epitopes Recognized in Tetanus Toxoid and Development of an Assay Strategy to Detect Ex Vivo Tetanus CD4+ T Cell Responses HLA class I supertypes: a revised and updated classification Comprehensive analysis of dengue virus-specific responses supports an HLAlinked protective role for CD8+ T cells NetMHCpan-4.0: Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data Development and validation of a broad scheme for prediction of HLA class II restricted T cell epitopes Immune epitope database analysis resource (IEDB-AR) HLA class I alleles are associated with peptide-binding repertoires of different size, affinity, and immunogenicity New allele frequency database