key: cord-0685971-grnibz2t authors: Dumonteil, Eric; Herrera, Claudia title: Polymorphism and selection pressure of SARS-CoV-2 vaccine and diagnostic antigens: implications for immune evasion and serologic diagnostic performance date: 2020-06-18 journal: bioRxiv DOI: 10.1101/2020.06.18.158329 sha: a9b9eee804a39fbb7947614413844128cabeff16 doc_id: 685971 cord_uid: grnibz2t The ongoing SARS-CoV-2 pandemic has triggered multiple efforts for serological tests and vaccine development. Most of these tests and vaccines are based on the Spike glycoprotein (S) or the Nucleocapsid (N) viral protein. Conservation of these antigens among viral strains is critical to ensure optimum diagnostic test performance and broad protective efficacy, respectively. We assessed N and S antigen diversity from 17,853 SARS-CoV-2 genome sequences and evaluated selection pressure. Up to 6-7 incipient phylogenetic clades were identified for both antigens, confirming early variants of the S antigen and identifying new ones. Significant diversifying selection was detected at multiple sites for both antigens. Some sequence variants have already spread in multiple regions, in spite of their low frequency. In conclusion, the N and S antigens of SARS-CoV-2 are well conserved antigens, but new clades are emerging and may need to be included in future diagnostic and vaccine formulations. The emergence and rapid spread of a novel Coronavirus, referred to as SARS-CoV-2, is resulting in one of the worst pandemic in the world, causing an unprecedented health and economic crisis. About seven months after the first cases were identified, over 8 million cases have been reported worldwide, with over 400,000 deaths according to the Johns Hopkins Coronavirus Resource Center. The pandemic has triggered multiple efforts at developing serological tests, able to detect both acute infections by detecting virus-specific IgM, as well as recovered individuals by detecting virus-specific IgG. Several immunochromatographic rapid tests are already available (1) , and several more will become available in the next few months. Such tools would be critical to increase testing for the accurate and rapid identification of cases and their isolation to limit further transmission of the virus. However, their performance needs to be evaluated, and initial testing suggested variable performance of these tests (1, 2) . Test performance relies in part on the antigen used, and its conservation among virus strains circulating in the population being tested. Currently, most of these tests are based on the Spike glycoprotein (S) or the Nucleocapsid (N) viral proteins (1) . The receptor-binding domain (RBD) of the S protein, which mediates binding to the angiotensin-converting enzyme 2 (ACE2) receptor in human cells (3) , is also widely used as a diagnostic antigen. Similarly, vaccine development efforts have been very intense and a growing number of vaccine candidates are being quickly moved into clinical trials. These are based on different technological platforms, ranging from recombinant proteins, RNA and DNA vaccines, or recombinant viral vectors (4, 5) . A first RNA vaccine candidate recently completed clinical phase 1 evaluation, and is expected to move into Phase 2 shortly. Most of these vaccine candidates are based on the viral S protein, or the RBD as antigen. Multiple potential vaccine epitopes have also been identified in the S as well as in the N viral proteins (6) . As for diagnostics, conservation of these vaccine antigens among viral strains is critical to ensure broad protection and avoid immune evasion by the virus. As an RNA virus, SARS-CoV-2 is prone to frequent mutations, in spite of some proof-reading abilities of its RNA polymerase complex (7, 8) . An early assessment of genomic changes SARS-CoV-2 showed a mutation hot-spot in the virus RNA dependent RNA polymerase (RdRp), but a few mutations were also detected in other parts of the viral genome, including the N and S proteins (9) . The growing availability of a large number of complete genome sequences gathered since the beginning of the pandemic provides a unique tool to assess the extent of viral antigen polymorphisms, and potential selection pressures on these. A first analysis of polymorphisms in the S glycoprotein until early April 2020 identified a handful of variant sites, including D614G, S943P, and possibly L5F and L8V (10) . Variant sites V367F, G476S, and V483A were also identified in the RBD. We analyzed here the sequence variation in a broader set of viral proteins N and S, which represent the main diagnostic and vaccine antigens to date. We examined the implications of the identified sequence variants on vaccine and serological diagnostic performance. Whole genome sequences from 18,247 SARS-CoV-2 virus were obtained from GISAID (Supplemental Table 1), covering virus isolates from multiple continents, including Asia, Africa, Europe, Oceania, and America. These sequences included those from initial human cases in Wuhan, China from December 2019 up to sequences from May 11, 2020. Viral genome sequences were aligned using MAFFT (11) as implemented in Geneious 11, and alignments were edited to exclude partial or low quality sequences. A final alignment including 17,853 quality sequences were used to construct phylogenetic trees using FastTree (12) for a global analysis of viral diversity across the world. FastTree infers approximately-maximum-likelihood phylogenetic trees. Sequence conservation across genome alignment was calculated using s sliding window of one in Geneious. Separate analyses were then performed using S and N genes, as well as the RBD from the S protein (positions 319-540 within the S protein). For these, translated sequences were aligned with the MAFFT algorithm using Blossum62 matrix and the frequency of variants at each site was calculated. Unique sequences from these proteins were then selected and phylogenetic trees were constructed using FastTree as above. Predicted epitopes from these antigens (6, 13) were mapped in the alignments, as well as glycosylation sites (14) to assess their conservation among viral sequences. Finally, evolutionary selection pressures on the antigens were analyzed using the Fast, Unconstrained Bayesian AppRoximation (FUBAR), as implemented in HyPhy (15) and statistical significance was considered at a threshold of P<0.1. Analysis of over 17,000 genome sequences confirmed the SARS-CoV-2 is a fast evolving virus, as it is rapidly accumulating mutations. Indeed, in the less than 5 months that viral sequences have been available, we detected sequence variants scattered throughout the viral genome, rather than clustered in specific genes ( Figure 1A and B) and some virus circulating now in multiple countries has somewhat diverged from some of the isolates initially sequenced in December 2019 in Wuhan, China ( Figure 1C ). Importantly, some sequence variation could be detected within both the N and S genes. These genes were then analyzed in detail and separately. For the N protein, we included a dataset of 16,656 sequences, and significant sequence diversity was detected, with up to 326 distinct protein sequences. For a clearer assessment of their phylogenetic relationship, these variant sequences were analyzed independently ( Figure 2A ). Notably, a structuring including up to seven incipient clades was found emerging, with sequences from the first virus from Wuhan, China included in Clade 1 ( Figure 2C ). There was no specific geographic clustering of the sequence variants, illustrating the widespread multidirectional spreading of the virus across the world. A notable exception was observed for Clade 3, which included mostly sequences from Europe. Analysis of sequence variation along the protein sequence indicated that about half of the protein on the amino side was mostly conserved, except in two regions at sites 13 and 203-204, respectively ( Figure 2B ). On the other hand, the carboxy half of the protein appeared more variable, but this also reflected some sequencing ambiguities. A total of 178/419 (42.5%) sites presented variation in the N protein. This included seven sites with four variant amino acids, seven sites with three variant amino acids, and 13 sites with two variant amino acids that were found under significant diversifying selection pressure (26/419 (6.2%), Table 1 ). Because of these changes, the N protein is slowly diverging from the sequence from some of the early virus, belonging to Clade 1, and up to six additional major clades (Clades 2-7) are emerging for the N antigen ( Figure 2C ). Site D144 that can be substituted by E, H, Y or N may disrupt a predicted epitope (ALNTPKDHI 138-146). Importantly, most variants were still found at relatively low frequency among the viral population (0.018 to 0.541%), with only R203X and P13X variants detected at higher frequency (18.108 and 1.589%, respectively, Table 1 With the exception of the D614G substitution which has taken over and is now widespread in virus populations across the globe (over 63% of sequences carry this substitution), the other variants under selection still represent a low proportion of viral sequences, ranging from 0.017 to 0.586% (Table 3) . A few of these variants likely correspond to limited clusters of infections, as they come from a single geographic region and are grouped in time. This is the case for the G1124V variant, which is limited to 50 cases from Victoria, Australia, between March 20-27, 2020. Similarly, the N439K variant is limited to 40 cases from Scotland, identified between March 16-April 5, 2020. However, most of the other variants have already spread to multiple countries and regions, such as Q675X, which has been found in Denmark, England, Finland, Iceland, Norway, Scotland, Spain, and the USA over March and April 2020. Similarly, L5F variants have been found on 102 cases from Australia, Belgium, Canada, England, France, Iceland, India, Italy, Japan, Netherlands, Portugal, Scotland, Singapore, Taiwan, Thailand, USA, and Wales and H49X variants have been found in 36 cases from Australia, China, England, Mexico, Taiwan, and the USA, for example. As mentioned above, some of the sequence variation affecting the S protein was detected within the RBD, which is a key functional domain of the protein and one of the most used targets for serological diagnostic. We thus analyzed in detail its polymorphism. Sequence analysis of RBD revealed that it represented a highly conserved region of the S protein. Nonetheless, up to 54 RBD sequence variants were identified, with again some significant divergence from the first sequences from Wuhan, China ( Figure 5 ). Importantly, divergence seemed to increase with time as more variants accumulate and become established. A total of seven sites from the RBD were found under significant diversifying selection pressure, and variants sites within the RBD were observed in each of the major clades of the S protein (Table 2) . Nonetheless, while possible RBD clades are emerging, these do not match the S protein major clades described above. Antigen polymorphism from pathogens has the potential to impair serological diagnostic test performance, as well as vaccine efficacy. It is thus of key importance to consider these aspects for serological test and vaccine development, to ensure their usefulness and broad efficacy. This is commonly done for influenza vaccines for example, that are updated each year based on circulating viral strains, as cross protection among strains is still elusive (16) . We investigated here the sequence diversity of two major antigens of the novel SARS-CoV-2 virus, the N and S proteins. Importantly, a significant level of sequence diversity was detected for both antigens, with incipient clades emerging as multiple sites were found under significant diversifying selection pressure. The N protein, mostly used in serological diagnostic tests (1) had a large number of sequence variants, and 6.2% of its residues were found under diversifying selection. Overall up to seven major sequence clades have been emerging in recent months for this antigen, and these did not show any geographic clustering. A notable exception was Clade 3 of the N protein, which appeared over-represented in sequences from Europe so far. Importantly, predicted epitopes appeared conserved so far, although a more detail epitope mapping is still needed for this antigen. Nonetheless, N protein variants diverging from the initial sequences from Wuhan, China are now circulating in most geographic regions. While these changes are so far limited to a relatively small proportion of sequences (23.4%) and may not interfere with protein antigenicity, the inclusion of some of the variants in serological tests would ensure optimum sensitivity of tests, particularly if some of these variants become more frequent. The S glycoprotein is the main vaccine candidate currently tested in multiple vaccine platforms/formulation (4, 5) . Compared to the N antigen, it is more conserved and only 2.5% of its sites were found under diversifying selection pressure. We confirmed the importance of most of the variant sites previously identified in this antigen. These include D614G, S943P, as well as L5F and L8V and variant sites V367F, G476S, and V483A in the RBD (10). However, multiple additional variants were also identified here, leading to the identification of up to six major clades of the S glycoprotein that are emerging. Most of these variants appeared in the past weeks/months and may be slowly replacing the virus presenting sequences similar to that of the initial isolates from Wuhan, China. Indeed, while most of the variants still have a low frequency in the viral population, several have already spread to multiple countries and regions, where they may reach higher frequencies in the near future if they are successfully transmitted. Importantly, none of the substitutions identified affected the glycosylation pattern of the S protein, and none of the predicted epitopes appear affected. While the functional impact of these variants is unknown, the D614G mutation has been associated with potential increased viral transmission and/or fitness (10) , which may explain why it became so frequent. A recent comparison of functional properties of the S proteins with aspartic acid (SD614) and glycine (SG614) confirmed a greater infectivity correlated with less S1 shedding and greater incorporation of the S protein into the pseudovirion with the SG614 variant (17) . Similar functional studies of the additional variants identified here may help evaluate their impact on virus fitness. Future studies will also provide data on how the different clades identified here may be successfully transmitted or go extinct. While the RBD is particularly well conserved, some sequence variation was also detected in this region within the S glycoprotein, with up to 54 sequence variants. Because these differ by only 1-2 amino acids, the overall antibody recognition of the RBD can be expected to be mostly preserved so far, but some specific epitopes may nonetheless be lost. Also, our phylogenetic analysis suggested that possible clades may be emerging within the RBD as well, and newer sequences may diverge further from the sequence from the initial isolates from Wuhan. In conclusion, we found that the N and S antigens of SARS-CoV-2 are so far highly conserved, so that both are good antigens for both diagnostic and vaccine development. However, some sequence variation is also emerging and 6-7 phylogenetic clades could be identified for both antigens. Some of these sequence variants have already spread in multiple countries and regions, in spite of their low frequency. Sequence variants may arise by random substitutions in the viral genome during replication, but the significant diversifying selection detected at multiple sites in both antigens suggests that immune selection pressure and adaptation to human hosts may be driving some of these changes, which may lead to the establishment of some of these variants. New variants are also likely to emerge with time. The recent identification of potential co-infections with more than one viral strain suggests that recombination could also contribute to the generation of SARS-CoV-2 genetic diversity (18) . Therefore, further monitoring of antigen drift over time will be needed to ensure that diverging antigens can be identified in a timely manner and included in future diagnostic and vaccine formulations. Test performance evaluation of SARS-CoV-2 serological assays. medRxiv: medRxiv Using prenatal blood samples to validate COVID-19 rapid serologic tests Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein The SARS-CoV-2 Vaccine Pipeline: an Overview Vaccine designers take first shots at COVID-19 In silico identification of vaccine targets for 2019-nCoV Coronaviruses lacking exoribonuclease activity are susceptible to lethal mutagenesis: evidence for proofreading and potential therapeutics RNA 3'-end mismatch excision by the severe acute respiratory syndrome coronavirus nonstructural protein nsp10/nsp14 exoribonuclease complex Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. bioRxiv: bioRxiv MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution FastTree 2--approximately maximum-likelihood trees for large alignments Preliminary Identification of Potential Vaccine Targets for the COVID-19 Coronavirus (SARS-CoV-2) Based on SARS-CoV Immunological Studies. Viruses Site-specific glycan analysis of the SARS-CoV-2 spike HyPhy 2.5-A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies. Molecular biology and evolution Vaccine approaches conferring cross-protection against influenza viruses. Expert review of vaccines The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity Characterization of SARS-CoV-2 viral diversity within and across hosts. bioRxiv: bioRxiv