key: cord-0817791-tp94lmsf authors: Kakakhel, Sehrish; Khan, Hizbullah; Nigar, Kiran; Khan, Asifullah title: Genomic stratification and differential natural selection signatures among human norovirus genogroup II isolates date: 2022-03-23 journal: Arch Virol DOI: 10.1007/s00705-022-05396-9 sha: d18e2b50560d50d0732c1d51b66e43f3e0d750ca doc_id: 817791 cord_uid: tp94lmsf Noroviruses (NoVs), which are members of the family Caliciviridae, are the most common cause of gastroenteritis in humans. Ten NoV genogroups have been reported so far. Of these, genogroup II (GII) is the most prevalent, and it causes serious infections worldwide. The complete genome sequences of NoV GII isolates from different geographical regions were retrieved from the public database. The model-based clustering approach, implemented in the STRUCTURE resource, was employed for assessment of genetic composition. The MEGA X and IQ Tree tools were used for phylogenetic analysis. Genome-wide natural selection analysis was performed using maximum-likelihood-based methods. The demographic features of NoV GII genome sequences were assessed using the BEAST package. All of the NoV GII sequences initially clustered into two main subpopulations at significant K = 2, where the genotype GII.4 samples clearly split from the rest of the genotypes. This indicates a marked genetic distinction between norovirus GII.4 and non-GII.4 samples. Phylogenetic analysis showed the presence of five distinct subclades for genotype GII.2 and seven subclades for GII.4 samples. Several isolates with admixed ancestry were identified that constituted distinct subclusters in the phylogenetic tree. No continental-specific genetic distinctions were observed among the NoV GII samples. Significant genomic signatures of both positive and negative natural selection were identified across the NoV GII genes. A differential pattern of positive selection signals was inferred between the GII.4 and non-GII.4 genotypes. The demographic analysis revealed an increase in the effective population size of NoV GII during 2009-2010, followed by a rapid fall in 2015. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00705-022-05396-9. Noroviruses (NoVs) are the most important pathogens causing viral gastroenteritis in humans, causing about 50% of all acute gastroenteritis cases [1, 2] . The World Health Organization (WHO) has estimated that NoV caused 684 million cases and 212,489 deaths worldwide in 2010 [3] . NoV infections are most prevalent in children and the elderly, causing severe symptoms and prolonged shedding [4] . NoV infection spreads among humans through multiple routes, A multiple sequence alignment (MSA) was performed using Clustal Omega 3 [20] . MEGA X was used to extract parsimony-informative (PI) sites from the alignment. A total of 4069 PI sites were acquired from aligned data. The LIAN v3.5 tool was employed to examine the null hypothesis and the linkage equilibrium within NoV GII genomic data [21] . This program calculates the standardized index of association (I S A) to quantify the haplotype-wide linkage derived from the dataset. In addition, |D'| and r 2 were computed via DnaSpv6.0 [22] to measure the linkage disequilibrium (LD). |D'| represents the absolute value of the difference between the observed and expected haplotype frequency in the absence of LD. The variance of the allele frequency between the observed and expected haplotype is represented by r 2 [23] . The genetic structure of NoV GII was analyzed using a Bayesian model-based clustering program, i.e., STRUC-TURE v2.3.4 [24] . The STRUCTURE program identifies the genetically distinct subpopulation in a given dataset based on differences in allelic frequency and probabilistically assigns individuals to subpopulations. STRUCTURE operates via an admixture model with the correlated allele frequency. The admixture model accounts for the individual holding mixed ancestry and allocates such admixed strains to their specific subpopulations probabilistically [25] . The analysis was performed with a burn-in length of 100,000, followed by 100,000 MCMC iterations with default parameters (i.e., Dirichlet parameter α and allele frequency parameter). Five independent runs were performed for each value of K (1 to 15). The K opt (optimum number of subclusters) was determined by the Evanno ΔK approach using the STRUCTURE HARVESTER resource [26, 27] . A plot of K vs ΔK was used to determine K opt . The value of K opt was confirmed using various combinations of burn-ins burnin lengths, including 50,000-50,000, 70,000-70,000, and 100,000-100,000. The NoV GII genetic composition estimates were additionally corroborated by F-statistics known as fixation index (F ST ) calculation and principal component analysis (PCA). The F ST was calculated by analysis of molecular variance genogroups I, II, IV, GVIII, and GIX have been reported to be associated with human diseases [10] , with GII being most commonly responsible for outbreaks worldwide [11] . Within genogroup II (GII), genotype GII.4 is predominant, and the emergence of new genetic variants has resulted in a pandemic [12] . The major global variants characterized so far include the Sydney_2012, Den Haag_2006, and New Orleans_2009 variants [10, 4] . The predominance of genotype GII.4 has persisted for over two decades due to its fast rate of mutation and evolution [13] . Non-GII.4 genotypes have also caused massive epidemics and transiently surpassed genotype GII.4. These include the recently emerged GII.17 and GII.2 lineages. A novel GII.17 variant, termed the Kawasaki genotype, appeared as the primary cause of outbreaks in some Asian countries and replaced the Syd-ney_2012 variant [14] . However, among children, genotype GII.3 commonly causes irregular NoV infections [15] . The NoV genetic repertoire substantially expands within and between genotypes through recombination events [16] . There are no antiviral medicines or vaccines available so far to combat NoV infection [6, 17] . The complete genome sequences of NoV GII isolates from different geographical regions are available in the public genome sequence repositories. The high prevalence of NoV GII along with the recent emergence of novel strains provoked us to examine the complete genome sequences of this genogroup to understand their genetic composition and distinguishing features and the extent of possible genetic admixtures. In addition, natural selection and recombination analyses were performed to understand the possible role of these events in shaping the genetic structure of NoV GII. The findings of the current study, based on genomic features of NoV GII isolates worldwide, may have implications for devising effective vaccines against NoV infection. Complete genome sequences of human NoV GII isolates were obtained from the Virus Pathogen Resource database hosted by NCBI [18] . Several NoV genome sequences are present in public databases with no genotype information. The genotype information for such sequences was obtained using the NoV automated online genotyping tool (version 2.0) [19] . Genome sequences that were submitted without location, host, or sampling time information were excluded. Finally, a dataset comprising 822 complete genome sequences of NoV GII was generated (Supplementary Table S1 ). vpg, protease, RdRp, ORF2, and ORF3. The accuracy of the selection pressure calculation mainly depends upon the quality of the MSA. Therefore, the quality of the MUSCLEgenerated MSA was checked using the GUIDANCE server, which identifies the unreliable alignment regions within an MSA using a confidence threshold score of ∼1 [44] . All eight datasets were analyzed separately using different ML-based methods with a default value of 0.1. The methods included single-likelihood ancestor counting (SLAC) [45] , internal fixed effects likelihood (IFEL) [46] , and fixed effects likelihood (FEL) [47] , accessible through the Datamonkey webserver in the HYPHY package [48, 49] . These three methods identify sites that are under the influence of pervasive positive selection across all the lineages in a phylogenetic tree. The run for the identification of the best model was carried out using an automated model selection tool on the Datamonkey server. Episodic positive selection signatures were detected using the MEME (mixed-effects model of evolution) method available on the Datamonkey server. Episodic positive selection affects a few lineages even when the majority of the lineages undergo purifying selection [50] . In order to assess the genetic composition using the STRUCTURE program, it is first necessary to evaluate the pattern of linkage of loci. LD is the nonrandom association of alleles at different polymorphic sites. In the case of free recombination, the value of I S A, calculated using LIAN 3.5, is assumed to be zero. The I S A value obtained for NoV GII sequences was 0.0000 (P < 10 −4 , 10,000 replicates), indicating a signal of linkage equilibrium and weak LD. To confirm the low LD, plots of |D| and r 2 were computed using DnaSP v5. D is the function of LD measurement. The average value of |D| and r 2 was found to be 0.8206 and 0.0522, respectively, indicating that the loci were weakly linked and that the STRUCTURE program was therefore appropriate for analysis of the NoV dataset. The admixture model implemented in STRUCTURE was built for K = 1 to 13, with five independent simulation runs to confirm the consistency of parameter estimates and the reproducibility of the clusters (see Methodology). A K opt of 2 was obtained from the plot of K vs. ΔK (Fig. 1A) . This (AMOVA) implemented in ARLEQUINv3.11 with 1,000 permutations [28] . AMOVA calculates the partitioning variance at different levels of population subdivision and yields F ST . PCA was performed using PLINKv1.9 [29] , and the output results were visualized with the built-in function "prcomp". A neighbor-joining (NJ)-based tree was constructed using MEGA X with a minimum of 1,000 bootstrap replicates. A maximum-likelihood (ML) tree was constructed using IQ tree [30] , employing the GTR + I + G substitution model and ultrafast bootstrap replicates [31] . The tree topology was visualized and annotated using FigTree.v1.4.4 [32] . The aligned complete genome dataset was used for the identification of potential recombination events using the seven different methods implemented in the RDP4 package, including RDP [33] , GENECONV [34] , BOOTSCAN [35] , MaxChi [36] , CHIMAERA [37] , SiSCAN [38] , and 3SEQ [39] . A recombination event was considered likely if it was identified by at least three of these methods, with a p-value of 0.00001. Fluctuations in the effective population size of NoV GII with respect to time were inferred for the available isolates using the Bayesian skyline model [40] in BEAST2 [41] . The selection of the best-fit nucleotide substitution model was achieved using jModelTest [42] . GTR+I+G was chosen as the best model of nucleotide substitution. The best clock model was determined using path sampling (PS) and stepping stone sampling (SS), implemented in the BEAST v1.10.4 program by calculating marginal likelihood values. A relaxed uncorrelated clock model was selected as the best-fit model. The MCMC steps were run for a chain length of 300 million generations to ensure convergence. The convergence of the MCMC log output files and effective sample size (ESS) > 200 was analyzed using the Tracerv1.7 program [43] . A dataset of 538 NoV GII sequences was prepared for natural selection analysis. Potentially recombinant samples were excluded from the analyses to avoid inferential biases. A total of eight datasets were generated, corresponding to eight protein coding sequences, i.e., p48, NTPase, p22, and GII.26 constitute C-1.3b. Likewise, at a K opt of 5, the subclustering of the GII.2 genotype, i.e. formerly comprising the C-1.1 cluster at K opt 3, further stratified into two lineages, i.e. C-1.1a and C-1.1b (Fig. 2B ). The overall clustering pattern of samples obtained at K opt of 3, 4, and 5 did not reveal any geography-based distinction among the NoV GII isolates, and genetic stratification was mainly based on genotype identity. The GII.4 samples, initially split at K = 2, stratified further during subsequent Bayesian clustering analysis (Fig. 2) . The samples of genotype GII.4 were stratified into two major lineages and five minor lineages (Fig. 2C ). The C-2.1 cluster corresponds to the GII.4-Sydney_2012, GII.4-New Orleans_2009, and GII.4-Apeldoorn_2007 strains, whereas, the C-2.2 cluster corresponds to the Den Haag_2006b strain. At K = 5, C-2.1 stratified into three lineages, i.e., C-2.1a, C-2.1b, and C-2.1c, while C-2.2 split into two sublineages, i.e. C-2.2a and C-2.2b (Fig. 2D ). The C-2.1a subpopulation includes the Sydney_2012 strains, and C-2.1b includes the Sydney_2012 and New Orleans_2009 GII.4 strains, whereas the C-2.1c cluster includes the GII.4 Sydney_2012 samples. The GII.4-Den Haag_2006b strains initially clustered in C-2.2, but this stratified into two additional lineages, designated as C-2.2a and C-2.2b (Supplementary Table S2 ). PCA was used to confirm the genetic composition and stratification pattern of NoV GII isolates. The PCA estimated 26.4% of the total genetic variance, with 9.09% of the first PC and 17.31% of the second PC. The principal components (PCs) split the GII.4 samples from the rest of non-GII.4 genotypes (Fig. 3) . The genotype GII.4, GII.2, GII.3, and GII.12 samples clustered separately, while the GII.26, GII.17, GII.6, and GII.7 samples clustered closely. The stratification pattern observed in the PCA plot is consistent with the STRUCTURE results. ML-and NJ-based phylogenetic analysis produced similar tree topologies. The phylogenetic tree results were examined according to the clustering pattern obtained using STRUC-TURE. The NoV GII samples grouped into ten independent clades in the NJ tree (Fig. 4A) , which corresponds to the ten clusters (C-1.1a, C-1.1b, C-1.2, C-1.3a, C-1.3b, C-2.2a, C-2.2b, C-2.1a, C-2.1b, C-2.1c) obtained by STRUCTURE analysis. Some admixed strains were observed in the phylogenetic tree clade represented by the C-1.b*, C-1.2b*, revealed a basic stratification of all of the NoV isolates samples into two subpopulations. In addition, an AMOVA test suggested marked genetic distinction, i.e., F ST = 0.53293 (P = 0000), between the two subpopulation genetic components. Cluster 1 (C-1) acquired at K opt of 2, comprises all of the NoV genotype samples except GII.4. The GII.4 samples comprised a separate subpopulation (C-2) (Fig. 1B) . Several admixed strains were observed in both the C-1 and C-2 clusters obtained at K opt = 2. This observed genetic stratification of NoV samples was not congruent with the isolates' geographical origin. Further analysis was performed to further investigate the genetic stratification in each of the major genetic components. The C-1 cluster was stratified with a significant peak of K opt = 3, followed by minor peaks of K opt = 4 and K opt = 5 ( Fig. 2A) . The K opt value of 3 reveals the diversification of C-1 into three further subpopulations/lineages, designated as C-1.1, C-1.2, and C-1.3 (Fig. 2B) . C-1.1 consists of genotype GII.2 strains. The UK strains from GII.2 genotype were observed to be admixed, with significant membership scores ranging from 0.500 to 0.434 for the clusters C-1.2 and C-1.3. Cluster C-1.2 consists of genotype GII.17 strains, while the C-1.3 cluster consists of GII.3, GII.5, GII.6, GII.7, GII.8, GII.12, and GII.13 strains. At K opt = 4, C-1.3 further stratified into two subclusters (i.e., C-1.3a and C-1.3b) (Fig. 2B) . The GII.3 samples constitute C-1.3a, while the samples from genotypes GII.5, GII.6, GII.7, GII.12, GII.13, Fig. 1 [A] Determination of K opt for NoV GII. The graph shows a plot of K versus delta K, which defines the optimum number of clusters K opt in the NoV GII population. K represents the number of clusters, while delta K is the rate of change in likelihood posterior probability for the given subcluster K. The plot was executed using large values for simulation burn-ins (100,000) and burn-in length (100,000). The major peak at K = 2 shows that the NoV genetic structure is grouped into two main subpopulations. (B) Estimate of the population genetic structure of NoV at a K opt of 2 using an admixture model in STRUC-TURE software. C-1 comprises genomic entries from genotypes GII.1, GII.2, GII.3, GII.5, GII.6, GII.7, GII.8, GII.12, GII.13, GII.17, and GII.26, and C-2 comprises genotype GII.4 strains [C] The initial clustering pattern of the phylogenetic tree results is consistent with the STRUCTURE result. Recombination analysis was performed to confirm the admixed samples observed in STRUCTURE and phylogenetic tree analysis. A total of 40 recombinant strains from 822 sequences of NoV GII were identified, with a threshold of p < 0.00001 (Supplementary Table S3 ). The STRUC-TURE and RDP4 results were found congruent in the case of many admixed and recombinant strains, with a few exceptions. Different recombination breakpoints were observed in the GII.4 and non-GII.4 genotypes. In the non-GII.4 genotypes, the majority of recombination breakpoints were detected at the junction of ORF2 and ORF3, while in the GII.4 genotypes, recombination breakpoints were mostly detected in the ORF1 region. A few strains were found to have multiple recombination breakpoints. For instance, the sample MH218571.1 was observed to have undergone three recombination events. RDP4 also identified this as a recombinant with probable minor and major parents. Both inter-genotypic and intra-genotypic recombination events were observed in NoV GII genotypes. For example, a Chinese strain (MG745991.1) of genotype GII.2 had undergone intra-genotype recombination with both the major C-1.3b*, and C-2.1a* clusters. The ML phylogenetic tree revealed that cluster C-1.1 further stratified into five minor clades (Fig. 4B) . Contrary to the phylogenetic tree stratification, STRUCTURE failed to split GII.2 into additional lineages and identified only the two main subpopulations among the genotype GII.2 samples ( Fig. 2A and B) . The ML-based tree inferred a total of 15 clades with high bootstrap (>90%) support (Fig. 4B) . Each main clade in the ML tree was observed to stratify into additional small variants/ clades. In order to prevent possible bias during phylogenetic inferences, the analysis was performed after filtering out the recombinant sequences. However, a similar clustering pattern was observed in a phylogenetic tree constructed without recombinant sequences, except that the GII.4 Apel-doorn_2007 variant formed a separate lineage (Supplementary Fig. S1 ). The signature of episodic positive selection was found in all of the coding genes of NoVs. The MEME method identified a total of 72 codons that had possibly evolved under significant episodic diversifying selection (Supplementary Table S4 ). Most of these codons are found in the VP1 and RdRp coding genes. The VP2 and NTPase have 25 and 11 codons, respectively, that are under selection pressure. The protease (i.e. MG746023.1) and minor (MG745990.1) parents of the GII.2 genotype, while a Japanese strain (LC209439.1) of genotype GII.2 had undergone inter-genotypic recombination and originated from a GII.2 major parent (LC209463.1) and a minor parent (KJ196283.1) of the GII.4 genotype. The complete GII NoV genome sequences obtained from the public database spanned a period of almost two decades. BSP plot analysis generally showed a consistent pattern of the effective population size of GII NoV. However, a slight increase in the population size was observed from 2009 to Table S5 ). In the case of the NS5 (Vpg) protein, codon 127, coding for asparagine (N), is under the influence of episodic positive selection and is mutated to histidine (H) in GII.17 genotype samples. Likewise, in the case of the NS6 (protease) protein, only one codon is under episodic positive selection in GII.4 strains, but seven different codons are under episodic positive gene has six sites with evidence of episodic positive selection despite the fact that the coding region is comparatively short. The analysis conducted using FEL, IFEL, and SLAC identified limited signals of pervasive positive selection in the non-structural proteins p48, NTPase, p22, vpg, and RdRp (Table 1) . However, in the case of the VP1 and VP2 structural protein coding genes, many codons appeared to be under the influence of pervasive positive selection (Table 1) . Notably, a large number of codons are evolving under the influence of strong negative selection ( Table 2 ). The evidence of purifying selection indicates a highly adapted phenotype, probably caused by constraints imposed by protein structure and function. We then investigated whether the NoV GII population clusters and genotypes contain differential or homogeneous natural selection signatures, which highlight the differences in the antigenicity and dispersal pattern of the pathogen. The analysis revealed differential positive selection signatures in the structural and non-structural proteins of NoV GII. In the VP1 protein, 19 distinct sites with features of episodic positive selection were detected, specifically in the GII.4 strains (Supplementary Table S5 ). Likewise, in the case of VP2 protein, 14 distinct codons had undergone episodic positive selection only in the GII.4 genotypes (Supplementary Table S5 ). Among the nonstructural proteins NS1/2, NS3, NS4, NS5, NS6, and NS7, evidence of differential episodic positive selection was observed among different genotypes. A total of 13 codons in the NS1/2(p48) gene were found to be under positive natural selection. Among these, six were positively selected in the GII.4 genotype specifically, while the other seven were selected in the non-GII.4 genotypes samples (Supplementary Table S5 ). Codon 44 of NS1/2 codes for serine (S) in the UK isolates of genotype GII.3 and phenylalanine (F) in the Asian GII.3 strains. Similarly, in the NS3 (NTPase) gene, two codons were under positive selection among the GII.4 isolates, while among all non-GII.4 genotypes, 11 distinct sites were under positive selection pressure (Supplementary Table S5 ). Histidine 224 of the NTPase was substituted by lysine (K) in the GII.6, GII.7, and GII.14 genotypes and by glutamine (Q) in the GII.17 genotype, (Supplementary Table S5 ). A selection signature was observed across seven codons in the NS4 (P22) gene, differentially selected among the GII.4, GII.2, , have been reported to be prevalent globally [54] . Recombination among NoV strains occurs at high frequency and acts as a major driving force in viral evolution. Recombination allows the virus to increase its genetic fitness and spread in the host population by escaping the host immune response [55] . The admixture in NoV is possibly responsible for the genetic diversification of the C-1.2b and C-1.3b clusters. Likewise, the |D'|, r 2 , and I S A statistics indicated weak linkage of norovirus GII isolates in the current study, indirectly suggesting a role of recombination in shaping the evolution of norovirus GII isolates. The BSP plot generated based on markers throughout the genome suggests a stable effective population size for the NoV GII isolates originating from human hosts (Fig. 5) . The BSP plot implies a rapid increase in the effective population size during 2009-10. This might have been accompanied by the large outbreaks and epidemicity of the GII.4 New Orleans_2009 variant [57] . Likewise, a novel GII.12 strain also emerged during this period and caused several outbreaks [58] . The effective population size fell sharply in 2015, and this is likely to correspond to a gain of host immunity against the dominant NoV variants. Substantial signals of episodic diversifying selection were observed in all of the proteins, including both the structural and non-structural proteins. However, few pervasive positive selection signals were identified in the VP1 and VPG genes. Xingguang et al. reported a lack of episodic positive selection in genotype GII.2 isolates and suggested genetic drift as a possible mechanism for NoV GII.2 evolution [59] . However, significant positive selection signatures were identified for the GII.2 strains in the current study (Supplementary Tables S4 and S5 ). This suggests that selection pressure is a possible driving force in GII.2 evolution. Other studies have also shown a small number of positive selection sites in the VP1 protein of NoV GII isolates [52, 60] . The VP1 protein plays a fundamental role in the interaction of NoVs with their host cells and is considered to be a key site for immune recognition and receptor binding. Therefore, this protein could be a potential target for vaccine development [61] . We identified several sites that are under positive selection pressure in the P1, P2, and shell domains of the VP1 protein. Mutations at positions 282 to 395 of VP1 (Supplementary Table S5 ), which are part of the P2 domain, have been reported to play an important role in selection in strains non-GII4 genotypes. Similarly, the NS7 (RdRp) protein coding gene was also observed to be target of episodic diversifying selection, and 17 codons are specifically selected in the genotype GII.4. Different codons in the NS7 coding gene were found to be under selection pressure in the GII.2, GII.3, and GII.17 genotypes (Supplementary Table S5 ). Overall, major differential natural selection features were observed between the GII.4 and non-GII.4 genotypes, and marked differential selection signatures were observed among the non-GII.4 samples, including the GII.2, GII.3, and G.II.17 genotypes. A fast evolutionary rate, selection pressure, and recombination act as prodigious evolutionary forces to intensify the genetic diversity of noroviruses [50] . Owing to their small genome size, high mutation rate, short generation time, and large population size, RNA viruses are suitable models to study evolution in the context of population genetics. Previous studies have focused on specific NoV genotypes, part of the genome, or a specific geographic region [1, 51] . In the current study, we performed a genome-wide comprehensive analysis of NoV GII isolates from different continents to gain a better understanding of their genetic structure, recombination events, and natural selection pattern. The genetic structure analyses in the current study did not reveal any geographically based distinctions among the NoV GII isolates. Due to the high degree of mobility and frequent travel in the modern world, NoV GII isolates might have been disseminated worldwide, and hence, no regional distinctions were observed among the GII isolates. However, Tohma et al. recently reported some non-typeable genotypes of NoV GII circulating in South America that exhibited marked genetic divergence from other NoV genotypes [66] . Genetic structure analysis revealed that the genotype GII.4 strains differ from those of the other NoV GII genotypes (Fig. 1) . The stratification of the NoV GII samples into two main subpopulations was also supported by a branching pattern in a phylogenetic tree and PCA analysis (Figs. 3 and 4) . However, Kobayshi et al. reported three clusters in the NoV population based on analysis of ORF2. Moreover, additional analysis of GII.4 sequences suggested extra clustering at K = 2 and 5 (Fig. 2C ). At K = 5, the GII.4 Sydney_2012 variant stratified into three lineages. This stratification pattern of the Sydney_2012 variant was also reported earlier based on ORF2 gene sequences [53] . We identified admixture strains using the admixture model/linkage model implemented in the STRUCTURE program. The admixture model fails to take into account the physical relationship between loci, and the proportion of its interaction with human blood group antigens (HBGAs) [62] . The S domain is highly conserved across different genotypes, and the antigenic sites within this domain are mostly cross-reactive [63] . In addition to positive selection, a large number of sites were also found to be under the influence of negative selection, indicating that purifying selection has occurred. In general, positive selection sites may be influenced by immune pressure, leading to escape mutations, whereas negative selection sites may prevent deterioration of antigenic function and structures [64] . The sites under positive selection could provide markers for vaccine design. The identification of negatively selected sites in NoV GII genes might help to identify highly conserved regions that will be useful in new diagnostic protocols [65] . A marked difference in the positive selection signature pattern was observed between the GII.4 samples and the other GII genotypes, and this might have shaped the genetic composition of the GII.4 genotype. The complete-genome-based population genetic analysis presented here revealed significant differences between of the GII.4 genotype and the other NoV GII genotypes, which might be due to specific positive selection signatures. The genetic stratification of GII.4 samples suggests the emergence of additional GII.4 lineages. The analysis did not reveal geographical variations in the genetic composition of the NoV GII strains. The data also suggest that recombination and selection pressure are major factors driving the genetic diversification of NoV GII strains and the emergence of new lineages. These findings might be useful for planning effective strategies to combat NoV GII infections. The online version contains supplementary material available at https://doi.org/10.1007/s00705-022-05396-9. Genomic diversity and phylogeography of norovirus in China Visualization by immune electron microscopy of a 27-nm particle associated with acute infectious nonbacterial gastroenteritis World Health Organization global estimates and regional comparisons of the burden of foodborne disease in 2010 Norovirus Infections and Disease in Lower-Middle-and Low-Income Countries Characterization of the genomic diversity of norovirus in linked patients using a metagenomic deep sequencing approach High-resolution cryo-EM structures of outbreak strain human norovirus shells reveal size variations Deep sequencing of norovirus genomes defines evolutionary patterns in an urban tropical setting Genomics analyses of GIV and GVI noroviruses reveal the distinct clustering of human and animal viruses Updated classification of norovirus genogroups and genotypes Molecular epidemiology and spatiotemporal dynamics of norovirus associated with sporadic acute gastroenteritis during Norovirus transmission dynamics: a modelling review Evolutionary dynamics of GII. 4 noroviruses over a 34-year period Changes in norovirus genotype diversity in gastroenteritis outbreaks in Molecular and evolutionary characterization of norovirus GII. 17 in the northern region of Brazil Comparative evolution of GII. 3 and GII. 4 norovirus over a 31-year period Sister-scanning: a Monte Carlo procedure for assessing signals in recombinant sequences An exact nonparametric method for inferring mosaic structure in sequence triplets Bayesian coalescent inference of past population dynamics from molecular sequences BEAST 2: a software platform for Bayesian evolutionary analysis jModelTest: phylogenetic model averaging GUIDANCE: a web server for assessing alignment confidence scores Not so different after all: a comparison of methods for detecting amino acid sites under selection Adaptation to different human populations by HIV-1 revealed by codon-based analyses Datamonkey: rapid detection of selective pressure on individual sites of codon alignments HyPhy: hypothesis testing using phylogenies Detecting individual sites subject to episodic diversifying selection Comparative phylogenetic analyses of recombinant noroviruses based on different protein-encoding regions show the recombination-associated evolution pattern Temporal dynamics of norovirus GII. 4 variants in Brazil between Molecular evolution of the capsid gene in human norovirus genogroup II Evolutionary and molecular analysis of complete genome sequences of norovirus from Brazil: emerging recombinant strain GII. P16/GII. 4. Frontiers in microbiology Evolution of norovirus Prevalence and genetic diversity of noroviruses in adults with acute gastroenteritis in Impact of an emergent norovirus variant in 2009 on norovirus outbreak activity in the United States The emergence and evolution of the novel epidemic norovirus GII. 4 variant Sydney Genetic characterization of norovirus GII. 4 variants circulating in Canada using a metagenomic technique Virus pathogen database and analysis resource (ViPR): a comprehensive bioinformatics database and analysis resource for the coronavirus research community An automated genotyping tool for enteroviruses and noroviruses Clustal Omega for making accurate alignments of many protein sequences LIAN 3.0: detecting linkage disequilibrium in multilocus data DNA sequence polymorphism analysis of large data sets A comparison of linkage disequilibrium measures for fine-scale mapping Inference of population structure using multilocus genotype data Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method Arlequin (version 3.0): an integrated software package for population genetics data analysis PLINK: a tool set for whole-genome association and population-based linkage analyses IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies UFBoot2: improving the ultrafast bootstrap approximation RDP4: Detection and analysis of recombination patterns in virus genomes Possible emergence of new geminiviruses by frequent recombination A modified bootscan algorithm for automated identification of recombinant sequences and recombination breakpoints Analyzing the mosaic structure of genes Evaluation of methods for detecting recombination from DNA sequences: computer simulations Identification of a broadly cross-reactive epitope in the inner shell of the norovirus capsid Selective pressure on SARS-CoV-2 protein coding genes and glycosylation site prediction Genome-wide analyses of human noroviruses provide insights on evolutionary dynamics and evidence of coexisting viral populations evolving under recombination constraints Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Molecular evolution of human norovirus GII. 2 clusters Static and evolving norovirus genotypes: implications for epidemiology and immunity Human norovirus proteins: implications in the replicative cycle, pathogenesis, and the host immune response Integrated urban water cycle management: the UrbanCycle model The authors acknowledge the National Center of Physics, Islamabad, for providing access to high-performance computing (HPC) for data analysis. The authors declare that they have no competing interests.Authors' contributions S.K. and A.K. conceived the research plan. S.K. H.K., and K.N. performed the data analyses. S.K. wrote the initial draft of the manuscript. A.K. supervised the study, critically reviewed the analyses and finalized the draft preparation. The study was conducted without any specific funding or financial grant support. Availability of supplementary data Supplementary data relevant to this study are provided.