key: cord-0961450-oununr9g authors: He, Wei; Wang, Ningning; Tan, Jimin; Wang, Ruyi; Yang, Yichen; Li, Gairu; Guan, Haifei; Zheng, Yuna; Shi, Xinze; Ye, Rui; Su, Shuo; Zhou, Jiyong title: Comprehensive codon usage analysis of porcine deltacoronavirus date: 2019-09-16 journal: Mol Phylogenet Evol DOI: 10.1016/j.ympev.2019.106618 sha: d39f04584293e05c47a5c4039a8c76b69c1f473a doc_id: 961450 cord_uid: oununr9g Porcine deltacoronavirus (PDCoV) is a newly identified coronavirus of pigs that was first reported in Hong Kong in 2012. Since then, many PDCoV isolates have been identified worldwide. In this study, we analyzed the codon usage pattern of the S gene using complete coding sequences and complete PDCoV genomes to gain a deeper understanding of their genetic relationships and evolutionary history. We found that during evolution three groups evolved with a relatively low codon usage bias (effective number of codons (ENC) of 52). The factors driving bias were complex. However, the primary element influencing the codon bias of PDCoVs was natural selection. Our results revealed that different natural environments may have a significant impact on the genetic characteristics of the strains. In the future, more epidemiological surveys are required to examine the factors that resulted in the emergence and outbreak of this virus. Coronaviruses (CoVs) are the causative agents of major diseases in a variety of avian and mammalian species including humans. CoVs belong to the subfamily Orthocoronavirinae of the Coronaviridae, order Nidovirales. The Orthocoronavirinae subfamily is further divided into four genera including, Alphacoronavirus, Betacoronavirus, Gammacoronavirus, and the recently identified Deltacoronavirus (Chan et al., 2013; King et al., 2018) . To date, six CoVs have been reported in pigs: transmissible gastroenteritis virus (TGEV), porcine respiratory coronavirus (PRCV), swine enteric alphacoronavirus (SeACoV), porcine epidemic diarrhea virus (PEDV), porcine hemagglutinating encephalomyelitis virus (PHEV), and porcine deltacoronavirus (PDCoV) (Pan et al., 2017; Homwong et al., 2016) . PDCoV was first recorded as an emerging enteropathogenic coronavirus in pigs in Hong Kong in 2012 (Chan et al., 2013; Woo et al., 2012) , and thereafter was isolated from a swine farm in Ohio, USA in 2014 (Wang et al., 2014a) . Since then, PDCoV has been reported in many countries and regions, including USA, Canada, South Korea, mainland China, Mexico, Japan, Thailand, Viet Nam, and Lao PDR (Lee and Lee, 2014; Suzuki et al., 2018; Saeng-Chuto et al., 2017; Wang et al., 2014b; Ajayi et al., 2018; Perez-Rivera et al., 2019) . A previous study showed that the global PDCoVs consist of the China lineage, the USA/Japan/South Korea lineage, and the Viet Nam/Laos/Thailand lineage . PDCoV is an enveloped, positive-sense, and single-stranded RNA virus with a genome size of approximately 25.4 kb. The genome includes a 5′UTR, ORF1a/1b, the spike (S), the envelope (E), the membrane (M), nonstructural protein 6 (NS6), the nucleocapsid (N), the nonstructural protein 7 (NS7), and a 3′UTR (Lee and Lee, 2014) . The codon usage pattern is an important indicator of genome evolution. Except for methionine and tryptophan, more than one codon can encode an amino acid due to the redundancy of the genetic code. Codons encoding the same amino acid also are known as synonymous codons. Interestingly, the codon usage is not random and some codons are used more than others, a phenomenon referred to as codon usage bias (Marin et al., 1989) . Codon usage bias has been reported for some RNA viruses. However, the degree of bias varies depending on the identity of the specific virus. For instance, Rubella virus and Rotavirus show strong codon usage biases, whereas Equine infectious anemia virus (EIAV), Ebola virus (EBOV), the N gene of Rabies virus (RABV), and Porcine epidemic diarrhea virus (PEDV) show weak codon usage bias (Belalov and Lukashev, 2013; Yin et al., 2013; Cristina et al., 2015; He et al., 2017) . Natural selection, mutation pressure, the abundance of tRNA, RNA structure, and gene length all contribute to the codon usage bias (Jenkins and Holmes, 2003; Parmley and Hurst, 2007; Hershberg and Petrov, 2008; Plotkin and Kudla, 2011) . The virus and host can both influence codon usage, which likely affects the survival, evolution, fitness, and immune evasion of the virus from host defenses (Li et al., 2018b He et al., 2019) . Indeed, synonymous triplets are not used randomly, and factors such as natural selection and saltatorial bias can cause synonymous codon usage to diverge (Sharp and Li, 1986) . Investigating the codon usage patterns of viruses could provide insights into their molecular evolution and viral gene expression regulation, assisting vaccine design, in which high levels of viral antigen expression are likely to be needed to produce immunity (Butt et al., 2014) . Given the recent increase in PDCoV epidemics and the threat to pork production, in the present study, we reported an exhaustive genome-wide investigation of PDCoV codon usage and evaluated the possible influencing factors. We retrieved all PDCoVs sequences from the National Center for Biotechnology Information (NCBI) nucleotide database (http://www. ncbi.nlm.nih.gov) available up to April 2019. The detailed sequence information (serial number, strain name, accession number, location, and isolation year) for all 159 complete coding sequences of the S gene and 98 complete coding sequences (with the following concatenated order: ORF1ab-S-E-M-NS6-N-NS7) of PDCoV are displayed in supplementary materials (Table S1 ). Potential recombination signals were detected using RDP4 (Recombination Detection Program version 4) (Martin et al., 2015) with default settings. Seven methods were chosen for the analysis, including RDP, GENECONV, Chimaera, MaxChi, BootScan, SiScan, and 3 Seq. In particular, four methods were firstly applied. Thereafter, the remaining sequences were run again with at least two methods until there was no recombination signal. Phylogenetic trees were reconstructed in RAxML (v8.2.10) To study the relationship between the multivariate and sample, a multidimensional statistical method, PCA, was applied. PCA is mainly a mathematical transformation process that converts the relevant variables (dependent on the relative synonymous codon usage (RSCU) values) into a smaller number of irrelevant variables (called the principal components). Every coding sequence was split into a 59-dimensional vector, and each dimension represented the matching dedication of the RSCU values of 59 different synonymous codons, which included only a specific amino group, without AUG, UGG and the three stop codons. The parameters used for the PCA were calculated in program Codon W (http://codonw.sourceforge.net/). The compositional characteristics of the PDCoV coding sequences of the S gene and complete genomes, were calculated. The frequency of all nucleotides (GC%, AU%, A%, U%, G% and C%) was estimated using BioEdit (http://www.softpedia.com/get/Science-CAD/BioEdit.shtml). The A, C, G, and U frequencies in synonymous codons at different sites (GC1%, GC2%, GC3%, GC12%, A3%, U3%, G3%, C3%, AU3%) of each sequence were computed using CUSP (http://emboss.toulouse.inra.fr/ cgi-bin/emboss/cusp) and Codon W (http://codonw.sourceforge.net/). The relative dinucleotides abundances were computed according to a previously reported method (Karlin and Burge, 1995) . The odds ratio of the ability of the observed frequencies of the 16 dinucleotides was computed using the equation below: = P f f f xy xy y x where the frequency of nucleotide X is represented by f x , the frequency of nucleotide Y is represented by f y , the expected frequency of the dinucleotide XY is represented by f y f x , and the frequency of the dinucleotide XY is represented by f xy . As an universal standard, for < 0.78 or xy > 1.23, we considered that the XY pair was under-represented or over-represented respectively, compared with the random association of single nucleotides and according to its relative abundance (Butt et al., 2016) . RSCU refers to the relative probability of a specific synonymous codon, which indicates whether the codon usage is influenced by the amino acid composition. In the case where all synonymous codons of a particular amino acid are assumed to be used equally, the RSCU value of a sequence is the ratio of the frequency at which the codon is actually observed at its expected frequency . The RSCU is calculated as: where g ij is the derived value of the ith codon for the jth amino acid with n i kinds of synonymous codons. RSCU values = 1.0, > 1.0, and < 1.0, represent no bias, positive codon usage bias, and negative codon usage bias, respectively. The RSCU was calculated using MEGA7 (https://www.megasoftware.net/). The degree of codon usage bias, measured by the ENC, was estimated taking into account the number of amino acids and the gene length. The ENC values vary between 20 and 61, with values closer to 20 indicating a high codon usage bias and values closer to 61 indicating a low codon usage bias. The ENC value can reflect the preference of a synonymous codon in a family of codons. Highly expressed genes often show a high codon usage bias, whereas poorly expressed genes contain more rare codons and thus a lower codon usage bias. Generally, the codon usage is considered to show strong bias when the ENC value is less than or equal to 45 (Comeron and Aguade, 1998) . We used the following equation to calculate the ENC (Fuglsang, 2006) : where the average value of F i (i = 2, 3, 4, 6) for the i-fold degenerate amino acids is represented by F. The following equation was used to calculate F i values: where the total number of appearances of the codons for that amino acid is represented by n and the total number of appearances of the j th codon for that amino acid is represented by n j . Relative dinucleotides frequencies among different groups of S gene and complete genomes of PDCoV strains. ENC-plot analysis is commonly used to determine the factors influencing the codon usage bias (i.e. mutation pressure). The ENC values relative to the GC3 values (the frequency of guanine or cytosine at the third codon position of synonymous codons excluding Met, Trp and stop codons) were plotted (Karlin and Burge, 1995) . When the codon usage is limited only to the GC3 mutation, the expected ENC value falls on a theoretical curve (the functional relationship between the ENC expectation curve and the GC3 value). When the actual ENC-plot values of these sequences are lower than the standard curve, it is suggestive of natural selection playing a role in driving codon usage bias (Fuglsang, 2008) . The theoretical ENC values in ENC-plot analysis were calculated as follows. where s denotes the frequency of C or G at the synonymous codons third position (i.e. GC3). Neutrality analysis or neutrality evolution analysis was carried out to compare and define the effect of natural selection and mutation pressure on the PDCoV codon usage patterns by comparing the value of GC12s of synonymous codons with the GC3s value using diagonal analysis. In the graph, the plot regression coefficient is considered as the mutation selection balance coefficient, and the evolutionary rates caused by natural selection pressure and mutation pressure are represented by the slope of the regression line. If all points are distributed along the diagonal and there is no significant difference in the three codon positions, this indicates that there is only weak or no external selection pressure. However, if the regression curve is parallel or tilted to the horizontal axis, this would indicate that the correlation between the changes of GC12 and GC3 is very low. Thus, the regression curve shows that the effect of natural selection evolution effectively balances the degree of neutrality (Kumar et al., 2016) . PR2 analysis was used to investigate the effect of selection and mutation pressure on gene codon usage. PR2 is a gene map with AU deviation [A3/ (A3 + U3)] as the ordinate and GC deviation [G3/ (G3 + C3)] as the abscissa. At the center of the graph, the values of the two coordinates are 0.5, which means that G = C and A = U (PR2), and there is no deviation between the mutation effect and selectivity (substitution rate) (Sueoka, 1996) . After removal of recombinant sequences, 132 S gene and 64 complete genomes were left for further analysis. Phylogenetic analysis of S gene based on ML (Fig. 1A) and BI (Fig. 1B) trees revealed three individual PDCoV groups including, China, USA-Japan-Korea, and Thailand-Early-China-Vietnam groups. We then used these three groups to investigate into codon usage and associations. PCA showed that the three groups clustered separately, especially the USA-Japan-Korea group, although several overlaps existed between the USA-Japan-Korea and Thailand-Early China-Vietnam groups (Fig. 2) . For whole genomes, the three groups clustered separately too, except for several overlaps between the USA-Japan-Korea and the Thailand-Early China-Vietnam groups. The nucleotide U was the most abundant in the S gene, followed by A, C and G, regardless of the individual phylogenetic group (Table 1) . The detailed information of the nucleotide composition is shown in Table S2 . The nucleotide composition of synonymous codons at the third position of (A3, C3, G3, U3) showed that the frequencies of U3 and A3 were higher than C3 and G3. The percentage content of AU and GC were indicative of AU-rich component in the coding sequences of PDCoV. Analysis of the synonymous codons at the first, second and third position showed that the values of GC1 were the highest, followed by GC2 and GC3 (Table S2 ). The same pattern was identified for whole genomes. Overall, these results illustrated that a relatively large part of the PDCoV coding sequence comprises A and U nucleotides. All of the PDCoV 18 optimal synonymous codons for the corresponding amino acids of the S gene ended with U (Perez-Rivera et al., 2019) ( Table 2) . A total of 7 of the 18 priority codons had RSCU values greater than 1.6 (CUU (L), GUU (V), UCU (S), CCU (P), ACU (T), AGA (R), and GGC (G)). However, the remaining codons had RSCU values less than 1.6, with no underrepresented codons observed within the preferred codons. For whole genomes, U-ended codons were also the preferred codons among the 18 most abundant synonymous codons ( Table 2 ). The RSCU analyses and the nucleotide composition revealed that the compositional constraints (the nucleotides U in this case) had the most influence on the selection of the preferred codons. The relative abundances of the 16 dinucleotides of PDCoV coding sequences were calculated. We found that dinucleotides were not present randomly. None of the dinucleotide relative abundance values corresponded to the theoretical frequency (i.e., 1.0) (Fig. 3, Table 3 ). Furthermore, in the S gene, CpA (1.29 ± 0.0016) and UpG (1.32 ± 0.008) showed different degrees (marginal or peripheral) of overrepresentation. Only CpG (0.514 ± 0.011) was underrepresented. For whole genomes, the overrepresented and underrepresented dinucleotides were UpG (1.34 ± 0.002) and CpG (0.59 ± 0.003), respectively. ENC values were estimated to evaluate the extent of codon usage deviation within coding sequences of different PDCoV isolates. This analysis showed that PDCoV coding sequences were relatively conserved and stable in terms of the S coding sequences or whole genomes with a low codon usage bias. The ENC values of the S coding sequences ranged from 52.71 to 52.97, with an average of 52.853 (ENC > 40) ( Table 1 ). The ENC values of complete genome coding sequences were also within the range of the S gene, with no obvious difference in relation to phylogenetic groups. 3.7. Influence of mutation pressure on the PDCoV codon usage pattern ENC-plot analysis was carried out to reveal the constraint of mutation pressure on the PDCoV codon usage pattern. The values of GC3 were plotted against the ENC values according to individual phylogenetic group. We found that all points regardless of group concentrated on the left side and near to the expected curve for the S gene (Fig. 4A) . For whole genome coding sequences, all the points were also under but close to the standard curve (Fig. 4B) . Here, neutrality analysis or diagonal analysis was used, between the GC3s and GC12s values, to judge the effects of natural selection and mutation pressure (Fig. 5 ). In the S gene, the relationships between GC3s and GC12s were calculated based on the three phylogenetic groups. The correlation coefficient in the USA-Japan-Korea group, China group, and Thailand-Early China-Vietnam group were the 0.2017 ± 0.3707, 0.143 ± 0.3942, and 0.1142 ± 0.4873, respectively. Thus, the percentages of constrain of natural selection were 79.83%, 85.7%, and 88.58% for the S gene (Fig. 5A) . For whole genomes, GC12s and GC3s significantly correlated, with a correlation coefficient of 0.1897 ± 0.387 according to the USA-Japan-Korea group, indicating an 81.03% limit for natural selection or 18.97% of GC3 relative binding (100% neutral or 0% constraint) (Fig. 5B) . Overall, the above results indicate that the effect of mutation pressure is in all codon positions, but natural selection plays a major role driving the codon usage bias of PDCoV. Considering the limited number of sequences in the China and Thailand-Early China-Vietnam groups, they were excluded from the results. In addition, PR2 analysis was carried out (Fig. 6) . We found that the A ≠ U, C ≠ G, for both the S gene and whole genomes, which indicates the inequivalent role of mutation pressure and natural selection in shaping the codon usage of PDCoV. PDCoV is an emerging coronavirus that infects the whole of the small intestine, especially the jejunum and ileum, causing severe enteritis, diarrhea, and vomiting in piglets. PDCoV was first discovered in Hong Kong, China in 2012 (Woo et al., 2012) . At the beginning of 2014, PDCoV was first reported in the USA, after which at least 17 USA states confirmed its presence as of December 2014. In recent years, China, South Korea, Thailand, and other Asian countries have suffered from recurrent outbreaks (Lorsirigool et al., 2016; Janetanakit et al., 2016; Dong et al., 2015; Lee et al., 2016) . Phylogenetic analysis is well studied to demonstrate the evolution of virus (He et al., 2018; Li et al., 2018a; Su et al., 2017 Su et al., , 2016 Here, we first analyzed the codon usage patterns of the S coding sequences, as well as whole genome coding sequences of PDCoVs isolated from around the world to determine the factors driving codon usage, and provided a comprehensive understanding of the characteristics and evolution of PDCoV whole coding genes. Phylogenetic analysis of the S gene revealed that sequences clustered into three different groups similarly to a previous study , but with more accuracy since more methods were applied and recombinant sequences were excluded. Additionally, PCA analysis also indicated three potential evolutionary groups. Based on the S coding gene and complete coding genomes, we found a significant preference for A and U nucleotides, rather than G and C. The contents of AU and GC were not equal and were more inclined towards the usage of AU nucleotides. If the use of a synonymous codon was affected only by mutation pressure, the frequency of U and A nucleotides in the third codon position should be equal to the frequency of G and C ( van Hemert et al., 2016) . Thus, we can conclude that there was a low bias in the usage of nucleotides in all PDCoV strains. RSCU analysis revealed that PDCoV genomes have a tendency towards Uending codons. In addition, the relative probability distribution of 16 dinucleotides showed that codons and dinucleotides were used unequal and followed certain rules. Dinucleotide abundance influences the codon usage bias in certain organisms, including RNA and DNA viruses (Rothberg and Wimmer, 1981) . Dinucleotide sequences may be derived from odd partial of amino acid changes or codon usage bias; therefore, we analyzed dinucleotide composition distribution (Plotkin et al., 2004; Cristina et al., 2015) . The translational selection pressure on a dinucleotide is the entropy cost of a given set of constraints that alter the number of dinucleotide occurrences, in this case the amino acid sequence of the given protein sequence and the cost of the codon usage bias (Cristina et al., 2015) . Analyses of the frequencies of codons and dinucleotides revealed that translation selection also played a part in the codon usage of PDCoVs. These initial observations prompted further investigation to assess the extent of codon usage bias using ENC analyses. For PDCoV, the ENC value based on the S gene or complete coding genomes was 52, indicative of slight bias and that different PDCoVs are relatively conserved and stable. Previous studies indicated that ENC values correlate negatively with gene expression . Thus, a higher ENC value indicates lower gene expression and lower codon preference. A low codon bias could be explained by the need to better adapt towards efficient replication and survival in the host, and to reduce the energy required for virus biosynthesis while avoiding competition with host protein synthesis . When the ENC and GC3 values of PDCoVs were plotted, mutation pressure was revealed as a moderate factor influencing the PDCoV codon usage pattern. According to previous reports, both natural selection and mutation pressure can affect the ENC value, which indicates that the relative contribution of selection and mutation on the codon usage pattern are not robust Gu et al., 2004) . It is worth mentioning that the codon usage bias of species with A/U biased genomes is different from that of genomes with a G/C bias. Therefore, simple ENC-GC3 map analysis might be misleading. Generally, mutation pressure will always have a role in driving the codon usage of viruses. Here, using neutrality plots we found that natural selection was a more dominant factor compared with mutation pressure (Shi et al., 2013) . Natural selection can lead to weak codon usage bias while the virus is trying to adapt to the host cells (Matsumoto et al., 2016) . PR2 bias plot analysis showed that both natural selection and mutation pressure contributed to the observed codon bias consistent with the neutrality analysis. In summary, we found that the codon usage of the S gene was similar to the complete coding genome. To open new perspectives, a further exploration of the function and features of functional genes is worth studying. Here, we found that, to a large extent, the codon usage pattern and the sequences characteristics of PDCoVs were restricted by evolutionary processes. Briefly, PDCoV has a low codon usage bias, which was affected by natural selection, mutation pressure, and dinucleotide abundancy. The primary element affecting the PDCoV codon usage pattern was natural selection. Additionally, the results of PCA and phylogenetic analysis were highly consistent suggesting that the codon usage pattern study can reveal the evolutionary clustering relationship between strains based on their genetic composition. This study suggests that monitoring the updated sequences of this novel, emerging virus would provide clues to better understand viral evolution and the disease. Herd-level prevalence and incidence of porcine epidemic diarrhoea virus (PEDV) and porcine deltacoronavirus (PDCoV) in swine herds in Ontario Causes and implications of codon usage bias in RNA viruses Genome-wide analysis of codon usage and influencing factors in chikungunya viruses Evolution of codon usage in Zika virus genomes is host and vector specific Interspecies transmission and emergence of novel viruses: lessons from bats and birds Extensive homologous recombination in classical swine fever virus: A re-evaluation of homologous recombination events in the strain AF407339 Characterization of the porcine epidemic diarrhea virus codon usage bias An evaluation of measures of synonymous codon usage bias Genome-wide analysis of codon usage bias in Ebolavirus Porcine Deltacoronavirus in Mainland China Estimating the "Effective number of codons": The Wright way of determining codon homozygosity leads to superior estimates Impact of bias discrepancy and amino acid usage on estimates of the effective number of codons used in a gene, and a test for selection on codon usage Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales Codon usage bias in the N gene of rabies virus Interspecies transmission, genetic diversity, and evolutionary dynamics of pseudorabies virus Genetic analysis and evolutionary changes of Porcine circovirus 2 Selection on Codon Bias Characterization and evolution of porcine deltacoronavirus in the United States The extent of codon usage bias in human RNA viruses and its evolutionary origin Dinucleotide relative abundance extremes -a genomic signature Changes to taxonomy and the International Code of Virus Classification and Nomenclature ratified by the International Committee on Taxonomy of Viruses Revelation of influencing factors in overall codon usage bias of equine influenza viruses Detection and Phylogenetic Analysis of Porcine Deltacoronavirus in Korean Swine Farms Complete genome characterization of Korean porcine deltacoronavirus strain KOR/KNU14-04 Origin, genetic diversity, and evolutionary dynamics of novel porcine circovirus 3 Insights into the genetic and host adaptability of emerging porcine circovirus 3 Genetic analysis and evolutionary changes of the torque teno sus virus The first detection and full-length genome sequence of porcine deltacoronavirus isolated in Lao PDR Variation in G + C-content and codon choice: differences among synonymous codon groups in vertebrate genes RDP4: detection and analysis of recombination patterns in virus genomes Codon usage selection can bias estimation of the fraction of adaptive amino acid fixations Discovery of a novel swine enteric alphacoronavirus (SeACoV) in southern China How do synonymous mutations affect fitness? First report and phylogenetic analysis of porcine deltacoronavirus in Mexico Synonymous but not the same: the causes and consequences of codon bias Tissue-specific codon usage and the expression of human genes MrBayes 3.2: efficient bayesian phylogenetic inference and model choice across a large model space Mononucleotide and dinucleotide frequencies, and codon usage in poliovirion RNA Different lineage of porcine deltacoronavirus in Thailand, Vietnam and Lao PDR in 2015 An evolutionary perspective on synonymous codon usage in unicellular organisms Selective pressure dominates the synonymous codon usage in parvoviridae RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies Epidemiology, genetic recombination, and pathogenesis of coronaviruses Epidemiology, evolution,and pathogenesis of H7N9 influenza viruses in five epidemic waves since 2013 in China Intrastrand parity rules of DNA base composition and usage biases of synonymous codons Genetic characterization and pathogenicity of Japanese porcine deltacoronavirus Nucleotide composition of the Zika virus RNA genome and its codon usage Impact of the biased nucleotide composition of viral RNA genomes on RNA structure and codon usage Detection and genetic characterization of deltacoronavirus in pigs Porcine coronavirus HKU15 detected in 9 US States Discovery of seven novel Mammalian and avian coronaviruses in the genus deltacoronavirus supports bat coronaviruses as the gene source of alphacoronavirus and betacoronavirus and avian coronaviruses as the gene source of gammacoronavirus and deltacoronavirus Comprehensive analysis of the overall codon usage patterns in equine infectious anemia virus Detection and spike gene characterization in porcine deltacoronavirus in China during The authors declare no competing financial interest. Supplementary data to this article can be found online at https:// doi.org/10.1016/j.ympev.2019.106618.