key: cord-0964745-3z34pkr7
authors: Lo Presti, Alessandra; Rezza, Giovanni; Stefanelli, Paola
title: Selective pressure on SARS-CoV-2 protein coding genes and glycosylation site prediction
date: 2020-09-21
journal: Heliyon
DOI: 10.1016/j.heliyon.2020.e05001
sha: 8255e044b0fff4be72e3880cf9af2156c2000891
doc_id: 964745
cord_uid: 3z34pkr7

BACKGROUND: An outbreak of a febrile respiratory illness due to the newly discovered Coronavirus, SARS-CoV-2, was initially detected in mid-December 2019 in the city of Wuhan, Hubei province (China). The virus then spread to most countries in the world. As an RNA virus, SARS-CoV-2 may acquire mutations that may be fixed. The aim of this study was to evaluate the selective pressure acting on SARS-CoV-2 protein coding genes. METHODS: Mutations and glycosylation site prediction were analyzed in SARS-CoV-2 genomes (from 464 to 477 sequences). RESULTS: Selective pressure on surface glycoprotein (S) revealed one positively selected site (AA 943), located outside the receptor binding domain (RBD). Mutation analysis identified five residues on the surface glycoprotein, with variations (AA positions 367, 458, 477, 483, 491) located inside the RDB. Positive selective pressure was identified in nsp2, nsp3, nsp4, nsp6, nsp12, helicase, ORF3a, ORF8, and N sub-sets. A total of 22 predicted N-glycosylation positions were found in the SARS-CoV-2 surface glycoprotein; one of them, 343N, was located within the RBD. One predicted N-glycosylation site was found in the M protein and 4 potential O-glycosylation sites in specific protein 3a sequences. CONCLUSION: Overall, the data showed positive pressure and mutations acting on specific protein coding genes. These findings may provide useful information on: i) markers for vaccine design, ii) new therapeutic approach, iii) information to implement mutagenesis experiments to inhibit SARS-CoV-2 cell entry. The negative selection identified in SARS-CoV-2 protein coding genes may help the identification of highly conserved regions useful to implement new future diagnostic protocols.

Human coronaviruses (CoV) are enveloped positive-stranded RNA viruses belonging to the order Nidovirales, mostly responsible for upper respiratory and digestive tract infections (Fehr and Perlman, 2005) .

An outbreak of a febrile respiratory illness due to the newly discovered Coronavirus (officially named by the World Health Organization as SARS-CoV-2) occurred in mid-December 2019, in the city of Wuhan, Hubei province (China). The virus spread to most countries in all the continents, causing a pandemic event WHO a; WHO b) .

Previous studies have examined the SARS-CoV-2 mutations, even though the studies were based on small sample size (Benvenuto et al., 2020; Phan, 2020; Tang et al., 2020; Pachetti et al., 2020) . Selective pressure analysis on all the SARS-CoV-2 gene portions and on a large data set are still lacking.

At the molecular level, amino-acid changes that result in reduced fitness are generally removed by negative selection, whereas changes that increase virus fitness are maintained by positive selection. Differently, when amino-acid changes do not decrease or increase fitness, the changes are considered neutral. Thus, it is important to understand which sites evolve under selective pressure, especially in case of a new pathogen, because the presence of negative or positive selection implies that the sites are functionally important.

Hereby, we report data regarding the selective pressure on SARS-CoV-2 protein coding genes and their glycosylation site prediction on a large number of SARS-CoV-2 genomes (ranging from 464 to 477) downloaded from Gen Bank (NCBI, https://www.ncbi.nlm.nih.gov/pubmed) and from the GISAID platform (GISAID, https://www.gisaid.org/). We described the main results of a molecular evolutionary analysis aimed to: i) identify the selective pressure on the SARS-CoV-2 protein coding 1 2 3 4 5 6 7 8 9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56 genes; ii) identify the mutations in SARS-CoV-2 surface glycoprotein (also known with the synonym: spike glycoprotein) sequences; iii) compare the specific positions belonging to the surface glycoprotein, among SARS-CoV-2, SARS-CoV and Bat SARS like sequences, previously reported to be critical for cross-species, human-to-human transmission in SARS-CoV (Li et al., 2005a,b) ; iv) evaluate and predict potential glycosylation sites, as already considered in the case of SARS-CoV (Chakraborti et al., 2005; Zhou et al., 2010 Q2 ).

A total of 500 SARS-CoV-2 sequences (complete and partial sequences) were downloaded from Gen Bank (NCBI, https://www.ncbi .nlm.nih.gov/pubmed) and GISAID database (GISAID, https://www.gi said.org/) to constitute the starting dataset (Table S1 ) represented geographically and temporally and suitable in number to computational calculation time. To the purpose of selective pressure and mutation analysis the following protein -coding genes sequence sub-sets were defined, after excluding short sequences or those showing extensive presence of ambiguity codes:

ORF10 (n ¼ 467). All the nucleotide sequence alignments were performed by using the multiple sequence alignment program MAFFT v.7 (Katoh and Standley, 2013) with the Galaxy platform (Galaxy, htt ps://usegalaxy.org/; Afgan et al., 2018) and manually edited by Bioedit program (Hall, 1999) .

The selective pressure analysis was performed on the above reported SARS-CoV-2 protein coding sequence sub-sets through the Datamonkey Adaptive Evolution Server (Delport et al., 2010; Pond and Frost, 2005; Weaver et al., 2018) , in order to characterize the SARS-CoV-2 variations, the evolutionary dynamics and to identify and localize statistically supported positive and negative selective pressure sites. If sites are statistically significant for a positive value of non synonymous to synonymous substitution ω > 1, positive diversifying selection is inferred, while purifying selection is inferred for ω < 1 . On the contrary, neutrality is inferred for ω ¼ 1 . Three models were applied and the results were merged: i) FEL (Fixed Effects Likelihood): uses a maximum-likelihood (ML) approach to infer nonsynoymous (dN) and synonymous (dS) substitution rates on a per-site basis for a given coding alignment and corresponding phylogeny. This method assumes that the selection pressure for each site is constant along the entire phylogeny; ii) FUBAR (Fast, Unconstrained Bayesian AppRoximation): uses a Bayesian approach to infer non-synoymous (dN) and synonymous (dS) substitution rates on a per-site basis for a given coding alignment and corresponding phylogeny. This method assumes that the selection pressure for each site is constant along with the entire phylogeny; iii) SLAC (Single-Likelihood Ancestor Counting) uses a combination of maximum-likelihood (ML) and counting approaches to infer non-synonymous (dN) and synonymous (dS) substitution rates on a per-site basis for a given coding alignment and corresponding phylogeny. This method assumes that the selection pressure for each site is constant along with the entire phylogeny (Pond and Frost, 2005) .

The positively selected sites with the corresponding amino acid variations identified in the Italian sequences of our sub-sets were highlighted.

The surface glycoprotein sub-set (gene S) was also analyzed for the identification of mutations.

A p-value < 0.1 for SLAC and FEL, and a posterior probability >0.90 for FUBAR have been used as statistical support for the amino acids sites found under selection, as previously reported (Lo Presti et al., 2016; Hu et al., 2016; Ebranati et al., 2015; Pond and Frost, 2005c) and these sites were considered candidates for selection. Only the statistically supported selective pressure sites were reported.

The positions of the selective pressure sites and mutations in the different SARS-CoV-2 sub-sets were referred respect to the protein products obtained from the SARS-CoV-2 Reference Sequence isolate Wuhan-Hu-1, Accession Number: NC_045512.2 and specifically respect to the protein _ id: YP_009725297.1 (nsp1), YP_009725298.1 (nsp2), YP_009725299.1 (nsp3), YP_009725300.1 (nsp4), YP_009725301.1 (3Clike proteinase), YP_009725302.1 (nsp6), YP_009725303.1 (nsp7), YP_009725304.1 (nsp8), YP_009725305.1 (nsp9), YP_009725306.1 (nsp10), YP_009725312.1 (nsp11), YP_009725307.1 (nsp12), YP_009725308.1 (helicase), YP_009725309.1 (3 0 -to-5 0 -exonuclease), YP_009725310.1 (endoRNAse), YP_009725311.1 (2 0 -O-ribose methyltransferase), YP_009724390.1 (surface glycoprotein), YP_009724391.1 (ORF3a), YP_009724392.1 (envelope), YP_009724393.1 (membrane glycoprotein), YP_009724394.1 (ORF6), YP_009724395.1 (ORF7a), YP_009724396.1 (ORF8), YP_009724397.2 (nucleocapsid phosphoprotein) and YP_009725255.1 (ORF10).

The glycosylation pattern of the SARS-CoV-2 surface glycoprotein, M and E protein sequences were analyzed through the N-GlycoSite l ; N-Glycosite, https://www.hiv.lanl.gov/content/sequence/ GLYCOSITE/glycosite.html) to characterize and predict potential N-linked glycosylation sites. Furthermore, we aimed to perform the prediction of the potential O-glycosylation sites in the SARS-CoV-2 protein 3a, surface glycoprotein, E and M protein sub-sets by using NetOGlyc v. 4.0.0.13 software (Steentoft et al., 2013) .

Overall, the selective pressure analysis varied considerably across the genes.

The analysis conducted on nsp1, 3C-like proteinase, nsp10, 3 0 -to-5 0 exonuclease, endoRNAse, 2 0 -O-ribose methyltransferase, E, M, ORF6, ORF7a, and ORF 10 sub-sets indicated only negatively selected sites (positions and amino acids reported in Table 1 ). In contrast, nsp7, nsp8, nsp9 and nsp11 showed neither positive nor negative sites. Table 1 showed five supported positively and three negatively selected sites in nsp 2, in contrast to nsp3, where a major number of negatively sites (n ¼ 15) and fewer (n ¼ 3) positively sites, were found.

Selective pressure analysis conducted on nsp4 sub-set revealed one positive and four negative selective sites (Table 1) .

Nsp 6 revealed one positive 37 (L; F) and two negative sites 222 (T), 289 (V).

Selective pressure analysis conducted on nsp12 found three positively selected sites 25 (G; Y); 323 (P; L); 644 (T; M) and eight negative.

Selective pressure analysis on the helicase sub-set indicated two positive sites 504 (P; L); 598 (A; S; V) and four negative sites 337 (R); 521 (V); 547 (T), 553 (A) and in the SARS-CoV-2 S (surface glycoprotein) protein coding gene sub-set revealed one positive 943 (S; P) and 11 negatively selected sites. SARS-CoV-2 bind to ACE2 through the RBD (receptor binding domain for virus entry into the cells) of the spike protein in order to initiate membrane fusion and enter human cell. The positively selected site here identified (AA 943) appeared located outside the RBD of the spike glycoprotein . Moreover, this site (AA 943) when compared through an alignment with Bat-SARS like  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67 coronavirus sequence (Accession Number: MG772933), corresponded to the amino acid 916 S. Only one positively selected site 99 (A; S; V) has been identified in ORF3a. In ORF8 the AA position 62 (V; L) has been found subjected to positive selection (Table 1) . Finally, the nucleocapsid phosphoprotein subset revealed two positive: 13 (P; L; S); 103 (D; Y) and two negative selective sites 173 (A); 274 (F) ( Table 1) .

The positively selected sites, identified in this study, were represented in the Italian sequences included in our sub-sets (14 sequences from Italy for N sub-set; 13 sequences for nsp2, nsp3, nsp4, nsp6, nsp12, helicase, S, ORF3a, ORF8sub-sets) in order to monitor the variations.

All the Italian sequences showed the 198V; 248S; 347K; 348S; 559I aminoacid sites in the nsp2 gene and 1454N, 1507A, 1527A in the nsp3 gene with the exception of EPI_ISL_417446 genome showing the 1507E, and EPI_ISL_417446 showing 1527E variation.

In nsp4 all the Italian strains showed amino acid 33M. In nsp6 all the Italian isolates showed 37L except for the genomes EPI_ISL_410546 and EPI_ISL_412974 showing the 37F variation.

All the genomes presented the 25G, 644T, 323L in the nsp12 gene except for three genomes (EPI_ISL_410546, EPI_ISL_410545 and EPI_-ISL_412974) presenting 323P. In the helicase protein coding gene all the Italian isolates presented 504P and 598A. At amino acidic position 943 of surface glycoprotein all the Italian genomes showed the amino acid S. The residues 99A and 62V were found in all the Italian genomes for orf3a and ORF8, respectively. Regarding N sub-set all the genomes showed 13P (except sequence Id. EPI_ISL_408068 showing a gap) and 103D.

The detailed results of mutation analysis performed on the surface glycoprotein (S) sub-set alignment are reported in Table 2 and Table S2 .

Overall, 41 AA residues (41/1273) representing the 3.2 % of the entire surface glycoprotein length has been found undergoing variation, indicating the presence of different variants.

The amino acidic position 614 (mutation D -G) has been found most frequently mutated in the sequences of our subset (Table S2) .

Five residues (367, 458, 477, 483 and 491) which belonged to the RDB of the surface glycoprotein are subjected to variations in the sequences reported in Table 2 . These amino acidic positions are subject to variations, they were not located within the residues interacting with ACE2 in the SARS-CoV RBD and conserved in SARS-CoV2 as highlighted in red in Chen et al. (2020) . The AA position 49, 483 and 943 were also found most frequently mutated in our sub-set ( Table 2) .

The surface glycoprotein protein must likely be cleaved at both S1/S2 sand S2 0 cleavage sites for virus entry, as previously described (Coutard et al., 2020) . We investigated the surface glycoprotein sub-set alignment in the AA regions of the protein cleavage sites (SARS-CoV-2 S1/S2 site 1, site 2 and S2 0 ) that appeared conserved in all the sequences of our sub-set. We analyzed the protein alignment of surface glycoprotein sub-set of SARS-CoV-2, compared to two sequences from SARS-COV (AAP41037.1 and AAS10463.1) and two Bat SARS-like coronavirus spike protein (AVP78031.1 and AVP78042.1), focusing the attention on the relevant positions 472 (amino acid L or P in SARS COV), 479 (amino acid N in SARS CoV) and 487 (amino acid T or S) of SARS CoV (Figure 1 ). These amino acid positions were previously reported (Li et al., 2005a,b) to be critical for cross-species and human-to-human transmission in SARS-COV. In the comparison of the paired positions in our alignment (Figure 1) , differences in the amino acids harbored by SARS-CoV-2 surface glycoprotein sequences were identified that is: 486F, 493Q and 501 N (referred to SARS-CoV-2 Accession YP_009724390.1) aligned respectively to the amino acidic positions 472, 479 and 487 of SARS -CoV. In Bat SARS-like coronavirus spike protein sequences, in the paired positions of the previous alignment, we found: a gap (paired with the position 472 of SARS-COV-2 and with the position 486F of SARS-CoV-2), 470S (referred to Accession Number: AVP78031.1 and paired with the position 479N of SARS-COV and 493Q of SARS-CoV-2), and 478V (referred to Accession Number: AVP78031.1, paired with 487 T/S of SARS-COV and to 501N of SARS-CoV-2 (Figure 1 ).

A total of 22 predicted N-glycosylation positions were found in SARS-CoV-2 surface glycoprotein sub-set by using N-GlycoSite. The positions, number and fraction of the predicted N-glycosylation sites in the alignment of SARS-CoV-2 surface glycoprotein sub-set were reported ( Figure 2A) . A total of 10087 N-glycosylation sites in 460 sequences have been found (considering that some sequences have deletions). In particular, we found that the sequence Id: EPI_ISL_408978 (derived from a throat swab collected from a 65 years old, female patient from Hubei/ Wuhan) did not have a predicted N-glycosylation site on position 165. We also noted that the sequence Id: EPI_ISL_417439 (derived from an oro-pharyngeal swab from a 38 years old male patient, from Democratic Republic of the Congo/Kinshasa) did not show a predicted N-glycosylation site on position 1074.

In particular, three SARS-CoV-2 N-Glyc predicted sites 234N, 343N and 603N, corresponded to the SARS CoV N-glycosylation sites 227N, 330N and 589N (by exploring the paired alignment positions) (Zhou et al., 2020) . Of these sites, one (343N) was located within the SARS-CoV-2 RDB Q3 .

Regarding the M protein sub-set, one predicted N-glycosylation position was found for SARS-CoV-2 M sub-set by using N-GlycoSite tool ( Figure 2B ). A total of 467 N-glycosylation sites in 470 sequences have been found (three sequences Id: EPI_ISL_406959, 406960 and 416464 were shorter). The position, graphic, number and fraction of the predicted N-glycosylation sites for M sub-set were reported ( Figure 2B ).

The analysis of the N-Glycosylation pattern on E protein sub-set revealed two potential predicted N-Glycosylation sites (AA. 48 and 66, Figure 3 ). A total of 934 N-glycosylation sites in 468 sequences have been found. The sequence EPI_ISL_418200 (derived from a 57 years old male patient from USA/New York/Manhattan), did not show a predicted Nglycosylation site at amino acidic position 48. Meanwhile, the N-Glyco-Site analysis performed on the protein 3a sub-set showed no N-glycosylation sites predicted for this protein.

In contrast, the results obtained through Net O-Glyc 4.0 on protein 3a sub-set indicated the following four potential O-glycosylation sites, with confidence scores higher than 0.5: amino acid position 32 in the sequence Id number: EPI_ISL_416464 (USA); amino acid position 253 in the sequences Id number: EPI_ISL_419690 and EPI_ISL_419683(Spain/Valencia); finally, amino acid position 171 in sequence Id number: EPI_ISL_408978 (Wuhan, China). The predicted O -glycosylation sites for SARS-CoV-2 surface glycoprotein sub-set indicated sites 673 (serine), 678 (threonine) and 686 (serine) the most frequently predicted as glycosylated in our sub-set (~89-90% of the sequences). Other sites (19T, 22T, 29T, 250T, 349S) were found predicted O -glycosylated at lower frequency (between 2% and 6 % of the sequences) (data not shown). The M and E protein sub-set were not predicted to be O -glycosylated by Net O-Glyc. 

This work provides a large-scale genomics analysis towards understanding the selective pressure, mutation and glycosylation patterns of SARS-CoV-2.

Selective pressure analysis on the SARS-CoV-2 nsp2 and nsp3 sub-set revealed positive selection in five sites in nsp2 and three in nsp3. Nsp2 may have a role in modulating host cell survival, likely by altering host cell environment (Cromwell et al., 2009 ; SWISS-MODEL Repository, https://swissmodel.expasy.org/repository/species/2697049; UNIPROT, https://www.uniprot.org/uniprot/?query¼taxonomy:2697049). Nsp3 is the papain-like protease that plays an important role in viral genome replication and in antagonize the host's innate immunity (Dong et al., 2020) . In this study, a large set of genomes confirmed some amino acid changes, as previously described, i.e. amino acid change V198I in nsp2 (Pachetti et al., 2020) , but described also the occurrence of hotspot Here, the analysis conducted on nsp1, 3C-like proteinase, nsp10, 3 0 -to-5 0 exonuclease, endoRNAse, 2 0 -O-ribose methyltransferase, E, M, ORF6, ORF7a, and ORF 10 sub-sets indicated only negatively selected sites, suggesting a scenario of purifying selection. By contrary, nsp7, nsp8, nsp9 and nsp11 showed neither positive nor negative sites indicating that evolution and divergence can be constant across all the evolutionary lineages and that these genes can be considered neutral. These data can help identifying highly conserved regions, useful for implementing new diagnostic protocols.

In this study, for the first time, one positive selected site (33 M; I) in nsp4 protein was identified. This protein acts in the assembly of virallyinduced cytoplasmic double-membrane vesicles, essential for viral replication. This finding may imply a genetic "hot-spot" in SARS-CoV-2 viral replication and need to be further evaluated.

Here, we confirmed the positive selective site at amino-acid position 37 (L; F) in nsp6 previously reported on a smaller dataset by some authors (Benvenuto et al., 2020; Pachetti et al., 2020) , but also observed in a recent study performed on a large dataset (Mercatelli and Giorgi, 2020) . This protein plays a role in the initial induction of autophagosomes from host reticulum endoplasmic and later limits the expansion of these phagosomes, that are no longer able to deliver viral components to lysosomes (SWISS-MODEL Repository, https://swissmodel.expasy.org/re pository/species/2697049).

Interestingly, two additional positive selective sites (25 G-Y; 644 T -M) in the RNA-dependent RNA polymerase (nsp12) were identified, in addition to confirming the residue at position 323 (P; L), previously reported (Pachetti et al., 2020) . RNA-dependent RNA polymerase is an optimal target of choice for treatment because of its crucial role in RNA synthesis, lack of homolog host and high sequence and structural conservation. In particular, Remdesivir has recently been advanced to phase 3 clinical trials for SARS-CoV-2 (Shannon et al., 2020) due to its mechanism to interact with the active replication site and to the viral genome, thus inhibiting the replication. The identification of positively selected sites in the RNA-dependent RNA polymerase could be useful for therapeutic approaches.

We were able to update the evolutionary changes on the helicase by reporting an additional "hot-spot" (598 A-S-V) as well as confirming the residue in the previously reported residue at position 504 (P; L) (Pachetti et al., 2020) .

The SARS-CoV-2 surface glycoprotein (S) is subjected to both positive and negative selection. Other authors, in agreement with our study, have identified some mutations within the surface glycoprotein (Phan, 2020; Tang et al., 2020; Pachetti et al., 2020; Mercatelli and Giorgi, 2020) , confirming that this portion is subject to more frequent variation on position 614 D-G. This mutation is consistent with several hypotheses regarding a fitness advantage, a greater susceptibility to re-infection (with the new G614 change of the virus), a greater infectivity due to its spread, and a probable greater transmissibility with a potential impact on the severity of the disease, as previously reported (Korber et al., 2020) . Surface glycoprotein plays a crucial role in binding of virus to the host receptor and subsequent membranes fusion for virus entry . The positive selection identified here is in agreement with the studies conducted on SARS-CoV (Chinese, 2004; Song et al., 2005; Zhang et al., 2006; Tang et al., 2009) . We highlighted a positive selected site at position 943 (S; P), located outside the SARS-CoV-2 RBD for ACE2 (Wan et al., 2020; Li et al., 2003; Wrapp et al., 2020) , suggesting this probably does not affect the RBD structure and the binding capacity of the virus to the host cell receptor, but has been linked to a suggestive model of recombination (Korber et al., 2020) . Five additional mutations in the surface glycoprotein were reported in this study, which appeared within the RBD, indicating that changes in this portion may occur and should be carefully monitored, given the potential impact on viral binding capacity and infectivity. Among these mutations, the V367 site deserves attention because it is located on the same face as the epitope of CR3022, a neutralizing antibody isolated from a SARS-CoV convalescent patient though no direct contacts between V367 and CR3022 were observed, and for a potential interaction with ACE2 (Korber et al., 2020; Yuan et al., 2020) .

The 62 (V-L) mutation in ORF8 (Tang et al., 2009 ) was confirmed as positive selected site, furthermore we were able to highlight an additional positive selected residue 99 A-S-V in orf3a. As for the N sub-set, we found two new sites (13 P-L-S and 103 D-Y) subjected to positive selection. This gene has been used in SARS-CoV-2 diagnostic tests. For this reason, it is important to monitor the selective pressure to highlight new variations useful to update, eventually, the diagnostic protocols.

A recent study (Mercatelli and Giorgi, 2020 ) analyzed a large SARS-CoV-2 dataset focusing the attention at single-nucleotide polymorphisms (SNPs). These authors highlighted a massive prevalence of SNPs over short insertion/deletion events (indels) worldwide and in every country. Moreover they reported that the aa-changing SNPs are the most prevalent mutational events in SARS-CoV-2 genomes, supporting our study and confirming the importance to monitor selective pressure and mutations.

Compared to our data, even if the two studies were based on different methodological approaches, we were able to confirm six mutation events as subjected to positive or negative selection, among the mutations that occur most frequently according to Mercatelli and Giorgi (2020) . In addition, we also found the D614G mutation as the most frequent in our surface glycoprotein dataset.

The comparative analysis of the S protein alignment between SARS-CoV-2, SARS-CoV and Bat SARS -like coronavirus, was analyzed in three critical positions, previously described by Li et al. (2005a,b) , to be crucial for cross-species and human-to-human transmission in SARS-CoV, the authors highlighted differences in the amino acids present at these sites.

All three positions were located within the SARS-CoV-2 RBD, the critical determinant of virus-receptor interaction and, therefore, of the viral host range and tropism (Li et al., 2005a,b) . A previous study conducted on SARS (Chakraborti et al., 2005) identified some RBD amino acid residues that influence the binding with ACE2 expressing and testing their binding to ACE2). A similar procedure could also be hypothesized for SARS-CoV-2, performing the expression of mutants and trying to identify the residues that could significantly reduce the RBD-ACE2 interaction. 2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67 SARS-CoV-2 uses a densely glycosylated surface protein to gain entry into host cells. This study identified 22 N-glycosylation predicted positions for SARS-CoV-2 surface glycoprotein alignment which must be confirmed by mass spectrometric or biochemical analyses, as already done for SARS-CoV (Chakraborti et al., 2005; Ying et al., 2004; Krokhin et al., 2003) . Interestingly, one of the predicted N-glycosylation position in SARS-CoV-2 surface glycoprotein is located inside the RBD. Seventeen of the twenty-two predicted N-glycosylation sites had previously been reported from a study conducted on a small sample (Kumar et al., 2020) and some of them were also reported as unique of SARS-CoV2 compared to SARS -CoV (Vankadari and Wilce, 2020) . In this study, we therefore identified some additional predicted N-glycosylation sites of the SARS-CoV-2 spike glycoprotein, suggesting that the virus may use different glycosylation to interact with its receptors and may underlie the differences in host immunity.

A literature article defined the glycomics-informed, site-specific micro heterogeneity of 22 N-linked sites (confirming our predicted sites) using a combination of mass spectrometry approaches coupled with evolutionary and variant sequence analyses. These authors have suggested essential roles for glycosylation in mediating receptor binding, antigenic shielding, and potentially the evolution/divergence of these glycoproteins. The 22 predicted N-glycosylation positions here investigated in the spike glycoprotein, were also in line with those reported in a previous study (Shajahan et al., 2020) which identified by high resolution mass spectrometry the composition of glycans at 17 out of the 22 SARS-CoV-2 predicted sites of the spike glycoprotein reporting the remaining five sites as unoccupied. Other authors (Watanabe et al., 2020) have focused attention on the 22 N-linked gly-can sites, confirming our prediction results, but they have used a site-specific mass spectrometric approach revealing the glycan compositions on a recombinant SARS-CoV-2 S immunogen.

Four SARS-CoV-2 N-glycosylation predicted sites (234N, 343N, 370N and 603N) here identified, corresponded to the following aligned positions of the SARS-CoV N-glycosylation sites (227N, 330N, 357N and 589N) (Zhou et al., 2010) . Mannose-binding lectin (MBL) is an important serum protein in the host's defenses. Zhou et al. (2010) reported the specificity of the site for glycosylation at position N330 (SARS-CoV) in the ability of MBL to inhibit SARS-CoV entry and infection in susceptible cell lines and it could be assumed a similar model for SARS-CoV-2 (Zhou et al., 2010) .

Our study may indicate that sitedirected mutagenesis and in vitro studies must be applied in order to clarify whether individual SARS-CoV-2 glycosylation sites are directly involved in DC-SIGN(R)-mediated binding and entry (Zhou et al., 2010) and if the glycan at 343N or others reported in this study, were critical in the ability of MBL to inhibit SARS-CoV-2 entry.

The predicted N-glycosylation sites here identified in SARS-CoV-2 M and E sub-set need to be confirmed by experiments and their role better clarified in further studies. The N-glycosylation profile and the absence of O-glycosylation on M protein refer to the SARS-CoV data (Nal et al., 2005) . In contrast, the SARS-CoV E protein is not glycosylated.

The expected O-glycosylation sites must be confirmed through specific experiments, together with their roles. Many different functions have been assigned to the side chains of oligosaccharide. Carbohydrates have been shown to be important for the folding, structure, stability, and intracellular sorting of proteins and to play a role in evoking the immune responses. Our data are in agreement with O-glycosylation profile of SARS-CoV 3a protein (Nal et al., 2005) , but in contrast with SARS-CoV surface glycoprotein that seems not to be O-glycosylated. A previous study (Shajahan et al., 2020) confirmed our O-glycosylation results on SARS-CoV-2 surface glycoprotein for sites 673, 678 and 686. In contrast, we did not identify O-glycosylation on surface glycoprotein at sites Thr 323 and Ser 325 but we found the predicted O-glycosylation, at lower percentage, in different positions. The O-glycosylation on the SARS-CoV-2 surface glycoprotein is also predicted in several recent reports (Andersen et al., 2020) . Although it is unclear what the functions of these predicted O-linked glycansis, it has been suggested to create a 'mucin-like domain' capable of protecting SARS-CoV-2 spike protein epitopes or key residues (Bagdonaite and Wandall, 2018) . Since Q4 some viruses may use mucin-like domains as glycan shields for immunoevasion, further studies and experiments could better clarify the specific role of SARS-CoV-2 spike protein O-glycosylation and if predicted sites can be experimentally confirmed.

Limits and possible bias of the study should be mentioned. First, the analysis here presented depends on the genomes available in the database at the time of the last access. Second, the circulation period of the virus can affect the evaluation of the evolution of the virus.

The goal of this study was to identify the evolutionary differences between a large set of SARS-CoV-2 available genomes and to predict their possible implications. The data, which show positive selective pressure and mutations that act on specific gene encoding protein (i.e. surface glycoprotein), could provide markers for vaccine design and/or for therapeutic agents (i.e. nsp12). The negative selection identified in some SARS-CoV-2 protein encoding genes could help to implement new diagnostic protocols. Finally, the identification of specific SARS-CoV-2 glycosylation sites could help to understand the interaction of the virus with its receptor and implement future mutagenesis experiments that are fundamental for strategies aimed at inhibiting the entry of SARS-CoV-2 in the cells.

Alessandra Lo Presti: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Wrote the paper.

Giovanni Rezza, Paola Stefanelli: Conceived and designed the experiments; Analyzed and interpreted the data; Wrote the paper.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors .  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67 

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses

The proximal origin of SARS-CoV-2

Global aspects of viral glycosylation

Evolutionary analysis of SARS-CoV-2: how mutation of Non-Structural Protein 6 (NSP6) could affect viral autophagy

The SARS coronavirus S glycoprotein receptor binding domain: fine mapping and functional characterization

Structure analysis of the receptor binding of 2019-nCoV

Molecular evolution of the SARS coronavirus during the course of the SARS epidemic in China

The spike glycoprotein of the new coronavirus 2019-nCoV contains a furin-like cleavage site absent in CoV of the same clade

Severe acute respiratory syndrome coronavirus nonstructural protein 2 interacts with a host protein complex involved in mitochondrial biogenesis and intracellular signaling

Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology

A guideline for homology modeling of the proteins from newly discovered betacoronavirus, 2019 novel coronavirus(2019-nCoV)

Reconstruction of the Evolutionary Dynamics of A(H3N2) Influenza Viruses Circulating in Italy from

Coronaviruses: an overview of their replication and pathogenesis

Global Initiative on Sharing All Influenza Data

SARS coronavirus replicase proteins in pathogenesis

BioEdit: a user-friendly biological sequence alignment editor and analysis program for windows 95/98/NT

Genetic diversity and positive selection Analysis of classical swine Fever Virus Envelope protein gene E2 in East China under C-Strain vaccination

MAFFT multiple sequence alignment software version 7: improvements in performance and usability

on behalf of the Sheffield COVID-19 Genomics Group, LaBranche CC, and Montefiori DC

Mass spectrometric characterization of proteins from the SARS virus: a preliminary report

Structural, glycosylation and antigenic variation between 2019 novel coronavirus (2019-nCoV) and SARS coronavirus (SARS-CoV). Virus Dis

Structure of SARS coronavirus spike receptor-binding domain complexed with receptor

Angiotensin-converting enzyme 2 is a functional receptor for the SARS coronavirus

Receptor and viral determinants of SARS-coronavirus adaptation to human ACE2

Origin and evolution of nipah virus

Geographic and genomic distribution of SARS-CoV-2 mutations

Differential maturation and subcellular localization of severe acute respiratory syndrome coronavirus surface proteins S, M and E

Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant

Genetic diversity and evolution of SARS-CoV-2

Datamonkey: rapid detection of selective pressure on individual sites of codon alignments

Not so different after all: a comparison of methods for detecting amino acid sites under selection

Remdesivir and SARS-CoV-2: structural requirements at both nsp12 RdRp and nsp14 Exonuclease active-sites

Deducing the N-and O-Glycosylation Profile of the Spike Protein of Novel Coronavirus SARS-CoV-2 this version

Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human

Precision mapping of the human O-GalNAc glycoproteome through SimpleCell technology

On the origin and continuing evolution of SARS-CoV-2

Differential stepwise evolution of SARS coronavirus functional proteins in different host species

Emerging WuHan (COVID-19) coronavirus: glycan shield and structure prediction of spike glycoprotein and its interaction with human CD26

Receptor recognition by novel coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS

Site-specific glycan analysis of the SARS-CoV-2 spike

Datamonkey 2.0: a modern web application for characterizing selective and other evolutionary processes

Emergency Committee Regarding the Outbreak of Novel Coronavirus (2019-nCoV) (Press release) Archived from the original on 31

World Health Organization (WHO), 2020b. WHO Director-General's Opening Remarks at the media Briefing on

Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation

A new coronavirus associated with human respiratory disease in China

Proteomic analysis on structural proteins of severe acute respiratory syndrome coronavirus

A highly conserved cryptic epitope in the receptor-binding domains of SARS-CoV-2 and SARS-CoV

Adaptive evolution of the spike gene of SARS coronavirus: changes in positively selected sites in different epidemic groups

Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level

Tracking global patterns of N-linked glycosylation site variation in highly variable viral glycoproteins: HIV, SIV, and HCV envelopes and influenza hemagglutinin

Virus-receptor interactions of glycosylated 1 SARS-CoV-2 spike and human ACE2 receptor

A single asparagine-linked glycosylation site of the severe acute respiratory syndrome coronavirus spike glycoprotein facilitates inhibition by mannose-binding lectin through multiple mechanisms

We gratefully acknowledge the Authors, the Originating and Submitting Laboratories for their sequence and metadata shared through GISAID, on which this research is based. All submitters of data may be contacted directly via www.gisaid.org. We gratefully acknowledge the Authors, the Originating and Submitting Laboratories for their sequence and metadata shared through NCBI database. The Acknowledgment Table is reported as Table S1 .We gratefully acknowledge Dr. Maria Rita Gismondo -Clinical Microbiology, Virology and Bioemergency, L. Sacco University Hospital, Milan, Italy.

Galaxy platform, 2020, GISAID Platform, Graham et al., 2008, National Center for Biotechnology Information, N-Glycosite, , UNIPROT, 

The authors declare no conflict of interest.

Supplementary content related to this article has been published online at https://doi.org/10.1016/j.heliyon.2020.e05001.