key: cord-316968-rowoylge authors: Zhang, Wenjuan; Zhang, Yuan; Zhong, Yang title: Using maximum likelihood method to detect adaptive evolution of HCV envelope protein-coding genes date: 2006 journal: Chin Sci Bull DOI: 10.1007/s11434-006-2118-9 sha: doc_id: 316968 cord_uid: rowoylge Nonsynonymous-synonymous substitution rate ratio (d (N)/d (S)) is an important measure for evaluating selective pressure based on the protein-coding sequences. Maximum likelihood (ML) method with codon-substitution models is a powerful statistic tool for detecting amino acid sites under positive selection and adaptive evolution. We analyzed the hepatitis C virus (HCV) envelope protein-coding sequences from 18 general geno/subtypes worldwide, and found 4 amino acid sites under positive selection. Since these sites are located in different immune epitopes, it is reasonable to anticipate that our study would have potential values in biomedicine. It also suggests that the ML method is an effective way to detect adaptive evolution in virus proteins with relatively high genetic diversity. The basic process of adaptive evolution by natural selection is the replacement of one allele gene by another with a higher fitness in a population. Detecting the adaptive evolution would be helpful for better understanding of bio-evolutionary mechanism and corresponding variation in structure and function [1] . The nonsynonymous-synonymous substitution rate ratio (d N /d S ) is an important indicator of selective pressure at the protein-coding gene, with d N /d S = 1 meaning neutral mutation, d N /d S <1 purifying selection, and d N /d S >1 diversifying positive selection (i.e. adaptive evolution). In comparison with a large amount of neutral mutations and purifying selections, positive selection is rare and hardly detected effectively because it often just occurs on a few of sites or during a short period [1] . In particular, for some protein-coding sequences with high genetic diversity that might result from relatively high mutation rates or long evolutionary history, it is much difficult to infer whether positive selection exists. For example, hepatitis C virus (HCV) is a type of RNA viruses with high mutation rates, and its genotypes have emerged in determining the clinic variation, main features of chronic infection and duration of antiviral therapy. Although the rapid variation of HCV has attracted the attention of virologists and evolution biologists, it is still unclear how HCV evades the host immune response and the mechanism of chronic infection [2] . Envelope glycoproteins E1 and E2 of HCV are involved in virus attaching to the host cell as well as in virus endocytosis and fusion with host membrane. E2 protein contains two highly variable regions called hypervariable regions 1 and 2 (HVR1 and HVR2) and two CD81-binding sites. As one of the receptors of E2 protein [3] , CD81 is a bridge between virus and host cell. HVR1 is implicated in the SCARB1-mediated cell entry. It was reported that despite strong amino acid sequence variability related to strong pressures towards change, the chemicophysical properties and conformation of HVR1 were highly conserved. The conservation of positively charged residues located at specific sequence positions of HVR1 indicates that HVR1 is involved in interactions with negatively charged molecules on host cell surface. This possible interaction probably plays a role in host cell recognition and attachment [4] . HVR2 and CD81-binding sites may be involved in sensitivity and/or resistance to IFN-alpha therapy. It is therefore considered that HCV evades the host immune response through mutation in some amino acid sites of envelope proteins, which result in recognition error during HCV contacting host cell. These mutations will be fixed under pressure driven by host immune system environment and form the adaptive evolution of HCV genomes. Previous studies focused on exploration of the adaptive evolution within an individual HCV subtype. For example, positively selected amino acid sites in the entire coding sequences of HCV subtype1b were identified [5] . The increasing availability of data storing in HCV databases allows us to analyze HCV genome evolution on a large scale of genetic diversity from more quantitative frameworks based on statistical inference [6, 7] . Technologically, a number of advanced methods have been proposed to reconstruct ancestral sequences or estimate the parameters under different substitution models when calculating nonsynonymous-synonymous rate ratio. For example, when all the sites evolve independently in one substitution model, a parsimony approach can be used to infer ancestral sequences and compute substitution numbers of different types. However, estimation of parameters by the parsimony approach may be biased because it does not account for multiple substitutions on one site. The maximum likelihood (ML) method is another way to estimate model parameters and more strict. It is employed not only to select the best phylogenic tree that fit the real data [1] , but also to detect select pressure on sites under different codon substitution models, especially for amino acid sites undergoing positive selection [8] . The purpose of this study is therefore to use the ML method [9] to infer adaptive evolution and positively selected amino acid sites of HCV envelope protein entire coding sequences containing all 6 HCV genotypes. The scientists that expert in the fields of HCV genetic variability and development of HCV sequence databases (such as the Hepatitis Virus Database (Japan), euHCVdb (France) [6] , and Los Alamos (United States) [7] ) meet to re-examine the status of HCV genotype nomenclature. HCV variants can be classified into 6 genotypes representing the 6 genetic groups defined by phylogenetic analysis [10] . The confirmed genotypes with 27 complete HCV genome sequences (18 subtypes) were defined according to the nomenclature stipulated in 2004 Heidelberg conference [10, 11] . The proposal provides the framework by which the HCV databases store and provide access to data on HCV. Considering the computational workload of adaptive evolution detection and statistical significance of the data analysis, this study used the sequences of these 27 complete annotated HCV genomes (Table 1) [6, 10] . The average numbers of amino acid sites for E1 and E2 were 192 and 367, respectively. Amino acid sequences of E1 and E2 proteins were aligned using Clustal X 1.83 [12] [13] [14] . The nucleic acid sequences were aligned according to the protein alignments with Tranalign program in the EMBOSSwin software package [15] . These nucleic acid sequences Table 1 The entire coding sequences of HCV used in this study a) Genotype GenBank Accession No. a) Classification of genotypes is available at http://euhcvdb.ibcp. fr/euHCVdb. were retrieved from the GenBank database (release 155.0). In order to infer a reliable phylogeny of HCV geno/subtypes, we used the alignment result of 27 complete HCV polyprotein coding sequences for evolution tree reconstruction. Neighbor-joining method with Kimura-2 parameter model implemented in MEGA 3.1 was used for phylogenetic analysis [16] [17] [18] . Clade robustness was measured by bootstrap method with 1000 replicates. Nonsynonymous-synonymous substitution rate ratio (ω = d N /d S ) was calculated by site-specific models of codon substitution models according to the results of phylogeny tree and sequences alignment. An ω significantly greater than 1 means that the nonsynonmous mutations are fixed at a higher rate than synonymous mutations and the evolution of this site is driven by positive selection. The model with maximum likelihood ratio is considered as the best model to fit the data. The likelihood-ratio test (LRT) was used to compare twice the log-likelihood differences between two nested models and with a χ 2 distribution to identify the statistics significance. The degrees of freedom (df) used in LRT were equal to the difference in the number of parameters between the two models [19] . We used three pairs of models to form three LRTs: M0 (one-ratio) and M3 (discrete), M1a (nearly neutral) and M2a (positive selection), and M7 (β ) and M8 (β & ω ). The simplest model, M0, assumes one ω for all sites. Model M1a (nearly neutral) allows two classes of sites with 0< ω 0 <1 and ω 1 = 1 in proportions p 0 of conserved sites and p 1 = 1−p 0 of neutral sites, respectively. Based on M1a, M2a (positive selection) adds an additional class of sites with frequency p 2 = 1−p 0 −p 1 and an ω 2 estimated from the data. M3 (discrete) uses an unconstrained discrete distribution to model heterogeneous ω ratios among sites. M7 (β ) assumes a β (p,q)-distribution for 0≤ω≤1. M8 (β & ω ) adds to M7 an extra category, with proportion p 1 of sites with ω 1 , while the rest of sites (at frequency p 0 = 1−p 1 ) have ω from the β (p,q)-distribution between 0 and 1. This model can be compared with M7 to test the presence of positive sites using a likelihood-ratio test (LRT) [19] . In this study, site-specific models were used with Codeml in the PAML 3.14b package [9] . We tested positive selection over sites of coding sequences by comparing twice the log-likelihood differences between M1a vs. M2a and M7 vs. M8 with a χ 2 distribution in the LRT. Phylogenic tree shown in Fig. 1 was consistent with the phylogeny analysis based on complete genome sequences published previously [2] and the one available in http://hcv.lanl.gov/content/hcv-db/Distances/HCV_vari ability.html. The results of identifying positively selected amino acid sites in the coding region of E1 are summarized in Table 2 . The LRTs of adaptive evolution suggested that the model of one ω ratio for all sites (M0) was rejected when compared with model M3 (2δL = 569.66, P<0.01, df = 4). The LRT statistic for comparing M1a (nearly neutral) and M2a (positive selection) showed that M2a did not have precedence over M1a (2δL = 0, df = 2). Indeed, model M2a and M1a had the same likelihood value and the estimations of parameters under these models were similar. In M2a, p 1 and p 2 could be combined into one because ω 2 = 1. In this way, M2a was equivalent to M1a. Therefore, we could not infer positive selection from this comparison. Model M8 was significantly prior to M7 (2δL = 6.24, P<0.05, df = 2). Model M3 provided three proportions of sites, p 0 , p 1 and p 2 with ω ratio of 0.0141, 0.1134, and 0.3231 respectively. It suggested a large proportion of sites (~85%) under strong purifying selection. Another piece of evidence for E1 gene being negatively selected was that ω in M8 was no greater than 1. Models that allow for positively selected sites are M2a and M8 in the three pairs of nested models. However, neither of these two models suggested the existence of sites of E1 protein under positive Darwinian selection. The results of identifying positively selected amino acid sites in the coding region of E2 are also summarized in Table 2 with ω values between 0 and 1. However, different from those in E1 protein, positively selected sites with ω ratios greater than 1 were detected in M2a (ω = 7.3185) and M8 (ω = 4.9507). These sites were 1E, 8N, 14A and 21T which all located in HVR1 of E2 protein. To assess the potential impact of the adaptive mutations, sites of E2 protein under positive selection were mapped onto immune epitope against HCV based on the epitope maps from HCV immunology database (http://hcv.lanl.gov/content/immuno/immuno-main.htm l) [20] (Table 3 ). All of the amino acid sites under adaptive evolution were located in B-cell epitopes of rat. Only one site (21T) was found in T-cell epitopes of human and transgenic mouse. Two 14A and 21T were also located in T-helper epitopes of human. It probably suggested that humoral immune response plays a key role in the immune clearance and exert more selective pressure on HCV replication than cell mediated response. Detecting adaptive evolution is a bioinformatics exploration based on the knowledge of genetics and statistics. d N and d S as well as their ratio ω which measures the selective pressure at the amino acid level provide powerful tools for better understanding of the effect of natural selection on molecular evolution. An ω significantly greater than 1 means that nonsynonymous mutations offer fitness advantages and this lineage (in lineage-specific models) or this critical amino acid site in the protein (in the site-specific models) are considered under positive selection driven by environment. Though ω ratio is a sensitive measure of positive selection, both lineage-specific models and site-specific models may lack power in detecting positive selection if adaptive evolution occurs at a few time points and affects a few amino acids. We need more robust statistic tools to test the hypothesis models [19, 21] . Maximum likelihood method and LRT could help to identify the best codon substitution models to fit the real data, and some models such as M2a and M8 have been successfully used for detecting positive selection. We took HCV envelope glycoprotein as an example to explore the adaptive evolution driven by immune environment pressure of coding sequences of 27 HCV containing 18 geno/subtypes and found that a number of amino acid sites were under positive selection and ML could be employed for identifying the adaptive evolution of RNA virus on a large scale of genetic diversity. Brown et al. [22] cloned HCV E1E2 full-length nucleotide sequences generated from serum samples of 4 chronically infected patients and identified 11 amino acid sites undergoing patient-specific adaptive evolution. In this study, we detected 4 amino acids sites of E2 protein under positive selection. Two of these sites were proved in Brown's work. Note that a region including the N-terminal 27-31 amino acid sites in E2 is known to be the most variable and is called hypervariable region 1 (HVR1) [23, 24] . This region is surface-exposed [25] and has been proposed as a major target of the immune response probably because its hypervariable is correlated with immune evasion [26] [27] [28] . For all of the positively selected amino acid sites located in HVR1 of E2 protein and in some immune epitopes, adaptive evolution of HCV could be the consequence of the environment pressure directly driven by host immune response. Recent studies revealed more information about how HCV escaping from host immune system response, but more comprehensive and careful research should be done to make clear the role of immune evasion in HCV chronically infection and explain the mechanism of HCV evolution involving immunology and virology. In other words, the positively selected sites' location indicated the immunogenicity of these sites and they might be candidate vaccination targets against HCV. The composite vaccines containing these different amino acid residues at the positively selected sites located in immune epitopes would be effective to preventing proliferation of escape mutants [5] . No amino acid sites exhibited positive selection within E1 protein in this study. It was consistent with the report from Brown et al. [22] . The possible reason was that E1 was unlikely to be surface-exposed [29] and not a major target for the host antibody response. It was reported that E1 protein was a poor natural immunogen for humoral response [30] . In other words, E1 protein was not under strong selective pressure of adaptive evolution driven by immune response. Suzuki and Gojobori [5] identified 13 positively selected amino acid sites in the entire coding region of HCV subtype 1b by parsimony method. Four of these sites were located in E1 and three located in E2. It is different from the results obtained from this study. The possible reasons may be: (1) the strategies to reconstruct ancestral sequences are different. Adaptsite, the program employed in Suzuki and Gojobori's work, uses maximum parsimony method to perform reconstruction while Codeml uses a likelihood reconstruction. Thus, the reconstructed ancestral states may be different. In general, the two implementations produce similar results in dataset with high similarity among sequences. However, ML provides a more reliable result when used to analyze small-size dataset with relatively high diversity in sequence similarity [31, 32] . They focused on HCV subtype 1b [5, 33] , and deleted any sequences with gap by pairwise-alignment with reference sequence (HCV-JS) to obtain dataset with highly similar sequences. Our study used sequences containing 18 HCV geno/subtypes with a relatively large scale of genetic diversity; thus the ML method was adopted; (2) the methods to estimate branch length are different. Adaptsite uses a neighbor-joining algorithm to estimate branch length [34] while Codeml uses a codon model M0 to do it; (3) for codons that are neighbors of stop codons, Adaptsite and Codeml count sites differently, e.g. TAC and TAT, Adaptsite counts S = 1 and N = 2, while Codeml gives 0.429 and 2.571; (4) missing data are handled differently. Adaptsite requires dataset without gaps [34] while Codeml implemented in this work allows sequences data with some gaps; and (5) mutation rates in RNA viruses are several orders of magnitude higher than those in DNA based life-forms. By limiting genome size and content, RNA virus genome can avoid deleterious mutation accumulation. It brings virus genomes to be inclined to concerted evolution or parallel evolution and own similar codon substitution patterns [35] . Therefore, it is not difficult to understand why most of amino acid sites undergoing purifying selection of genetic constraints though HCV envelope proteins still possess high mutation rates. Our study focused on a larger scale of genetic diversity than previous work. This study might probably miss particular positively selected amino acid sites of individual subtype but produce more general sites under positive Darwinian selection to all HCV genotypes. The parsimony method of Suzuki and Gojobori [33] and the maximum likelihood method developed by Nielsen and Yang [8] are two widely used methods for detecting natural selection in homologous protein-coding sequences. However, they have their own pros and cons. The former may fail to infer positively selected sites when the branches of the phylogenetic tree are long because the maximum parsimony method is not fit for multiple substitutions. In contrast, multiple substitutions in the Nielsen and Yang method may be corrected by assuming the codon substitution model. Suzuki also attempted to employ ML to modify previous method [36] . Our study showed an application of ML method in detecting adaptive evolution in HCV envelope protein-coding sequences based on 18 geo/ subtypes. It provided an instance for similar work of high diversity homologous genes analysis. Indeed, using ML method to infer adaptive evolution has been an effective strategy for some emerging viruses such as SARS-CoV. In this way we had successfully explored the adaptive evolution of SARS-CoV spike protein [37] . Genetic diversity and evolution of hepatitis C virus--15 years on Functional hepatitis C virus envelope glycoproteins Conservation of the conformation and positive charges of hepatitis C virus E2 envelope glycoprotein hypervariable region 1 points to a role in cell attachment Positively selected amino acid sites in the entire coding region of hepatitis C virus subtype 1b HCVDB: Hepatitis C virus sequences database The Los Alamos HCV sequence database Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene PAML: A program for package for phylogenetic analysis by maximum likelihood Consensus proposals for a unified system of nomenclature of hepatitis C virus genotypes Chronic hepatitis C virus infection: Genotyping and its clinical role A package for performing multiple sequence alignment on a microcomputer The Clustal-X windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools Multiple sequence alignment with Clustal X EMBOSS: The european molecular biology open software suite The neighbor-joining method: A new method for reconstructing phylogenetic trees MEGA3: Integrated software for molecular evolutionary genetics analysis and sequence alignment A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences Codon-substitution models for heterogeneous selection pressure at amino acid sites The Los Alamos hepatitis C immunology database Statistical methods for detecting molecular adaptation Evolutionary dynamics of hepatitis C virus envelope genes during chronic infection Hypervariable regions in the putative glycoprotein of hepatitis C virus Variable and hypervariable domains are found in the regions of HCV corresponding to the flavivirus envelope and NS1 proteins and the pestivirus envelope glycoproteins A model for the hepatitis C virus envelope glycoprotein E2 Evidence for immune selection of hepatitis C virus (HCV) putative envelope glycoprotein variants: Potential role in chronic HCV infections Prevention of hepatitis C virus infection in chimpanzees by hyperimmune serum against the hypervariable region 1 of the envelope 2 protein Epitope mapping of antibodies directed against hypervariable region 1 in acute self-limiting and chronic infections due to hepatitis C virus Mutational evidence for an internal fusion peptide in flavivirus envelope protein E Induction of hepatitis C virus E1 envelope protein-specific immune response can be enhanced by mutation of N-glycosylation sites A new method of inference of ancestral nucleotide and amino acid sequences Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods A method for detecting positive selection at single amino acid sites ADAPTSITE: Detecting natural selection at single amino acid sites Error thresholds and the constraints to RNA virus evolution New methods for detecting positive selection at single amino acid sites Reconstruction of the most recent common ancestor sequences of SARS-CoV S gene and detection of adaptive evolution in the spike protein