key: cord-0790344-7lqjxxw6
authors: Farsani, Seyed Mohammad Jazaeri; Dijkman, Ronald; Jebbink, Maarten F.; Goossens, Herman; Ieven, Margareta; Deijs, Martin; Molenkamp, Richard; van der Hoek, Lia
title: The first complete genome sequences of clinical isolates of human coronavirus 229E
date: 2012-08-25
journal: Virus Genes
DOI: 10.1007/s11262-012-0807-9
sha: fccba984e421b3df0ba8cae61b62fa6ac4700efa
doc_id: 790344
cord_uid: 7lqjxxw6

Human coronavirus 229E has been identified in the mid-1960s, yet still only one full-genome sequence is available. This full-length sequence has been determined from the cDNA-clone Inf-1 that is based on the lab-adapted strain VR-740. Lab-adaptation might have resulted in genomic changes, due to insufficient pressure to maintain gene integrity of non-essential genes. We present here the first full-length genome sequence of two clinical isolates. Each encoded gene was compared to Inf-1. In general, little sequence changes were noted, most could be attributed to genetic drift, since the clinical isolates originate from 2009 to 2010 and VR740 from 1962. Hot spots of substitutions were situated in the S1 region of the Spike, the nucleocapsid gene, and the non-structural protein 3 gene, whereas several deletions were detected in the 3′UTR. Most notable was the difference in genome organization: instead of an ORF4A and ORF4B, an intact ORF4 was present in clinical isolates.

Coronaviruses are a large group of viruses that infect a lot of animal species such as mammals and birds. Coronaviruses are enveloped, plus strand RNA viruses and belong to the Coronaviridae family. The genomes are linear, nonsegmented, and single strand. The coronavirus genomes with 27-31.5 kb in length are the largest of the known RNA viruses. The genomes are polycistronic generating a nested set of subgenomic RNAs with common 5 0 and 3 0 sequences [1] . Based on serologic and phylogenic relationship, coronaviruses are classified into three genera. Alphacoronavirus and betacoronavirus consist of various mammalian coronaviruses, whereas gammacoronavirus includes bird viruses [2] . The genus of Alphacoronavirus includes transmissible gastroenteritis virus (also referred to as alphacoronavirus I; ICTV 2009), porcine epidemic diarrhea virus, some bat coronaviruses, and the human coronaviruses (HCoVs) NL63 and 229E. In general, HCoV-229E virus causes common cold but occasionally it can be associated with more severe respiratory infections in children, elderly and persons with underlying illness [3] [4] [5] .

So far only one full genome has been determined for HCoV-229E [6] . This reference sequence is obtained from the infectious HCoV-229E cDNA clone (Inf-1) that is based on the 1973-deposited laboratory-adapted prototype strain of HCoV-229E (VR-740). The 1973-deposited prototype strain was originally isolated in 1962 from a medical student with an upper respiratory infection at the University of Chicago [7, 8] . Of the current HCoV-229E isolates, only limited sequence data have been obtained. Chibu and Birch [5] have investigated the evolution of HCoV-229E by sequencing part of the S and the N gene from clinical samples collected between 1979 and 2004. Sequence data from other genomic regions are still lacking, as no full-genome sequence of a non-lab-adapted virus is available.

We propagated several contemporary strains of HCoV-229E upon pseudostratified human airway epithelial. One of these, clinical strain HCoV-229E 0349 (21050349), was isolated from a respiratory swab collected in the Netherlands in 2010, from a stem cell transplantation recipient, who presented in hospital with fever and respiratory infection. The virus was propagated upon human airway epithelial cells, as described previously [9] . The apical supernatant was harvested 72 h post-infection by apical washing. A second clinical isolate was uncultured, J0304, obtained from an adult with symptoms of lower respiratory tract infections in Italy, collected via the GRACE European Network of Excellence [10] . Ethics review committees in each country approved the study, and written informed consent was provided by all study participants.

Full-genome sequencing of the HCoV-229E clinical isolates Total RNA was extracted from the apical washing from 0349 and from the Copan collected swab of J0304 (Copan Diagnostics) as described [11] . Reverse transcription was performed at 37°C for 1 h using random hexamers and superscript II (Invitrogen). The HCoV-229E Inf-1 reference sequence (Accession number NC_002645.1) was used as scaffold for designing bidirectional PCR-primer combinations, amplifying an average fragment length of 500 bp with a minimum overlap of 80 bp with adjacent primer combinations. Primers sequences are available upon request. Amplification of the fragments was performed with the following thermal cycle profile: 5 min at 95°C, 45 cycles of 95°C for 1 min, 55°C for 1 min, and 72°C for 2 min, followed by a final elongation step of 7 min at 72°C. PCR fragments were visualized upon agarose gel electrophoreses by ethidium bromide staining. Positive PCR fragments were directly sequenced with their forward and reverse primers in both the directions. Sequencing reactions were performed according to the BigDye Terminator v1.1 protocol (ABI life science). Sequences were analyzed with Coloncode Aligner software (version 3.7.1). Sequences have been submitted to GenBank (JX503060, JX503061). 5 0 and 3 0 RACE In order to complete the full-genome sequence with 5 0 and 3 0 termini, 5 0 and 3 0 RACE was performed. The 5 0 end was determined with the 5 0 RACE kit (Invitrogen) according to the manufactures protocol. Gene-specific primers for 5 0 RACE PCR amplification were designed to flank approximately 100 nt of the 5 0 region. The 3 0 end of HCoV-229E clinical strain was determined with 3 0 RACE, with an RT reaction performed using the Oligo-dT-JZH primer and PCR amplification with the JZH primer and a gene-specific primer [9] .The PCR products were excised after agarose electrophoresis and purified with the Nucleospin Extract II kit (Machery-Nagel) according to the manufacture protocol. Purified PCR products were cloned into the pCRII-TOPO TA vector (invitrogen) and chemically competent E. Coli according to manufacture protocol (Top 10 cells, Invitrogen). Transformants were directly analyzed via colony PCR with T7 and M13Rev primers. PCR products were sequenced as described above.

Full-genome sequence analysis The ZCURVE_CoV1.0 program was used to recognize and predict putative proteins coding genes [12, 13] . Phylogenic analyses (neighbor-joining method) were conducted using MEGA, version 4.02. The identity between HCoV-229E clinical isolates and the reference sequence of 229E (inf-1 NC_002645.1) was investigated by pairwise alignment using BioEdit Sequence Aligner. Simplot (version 3.5.1) was used to draw similarity/distance plots. N-and O-linked glycosylation sites and signal peptide cleavage sites were predicated using the NetNGly 1.0, NetOGly 3.1, and Sig-nalP 4.0 analysis tools, from the Center for Biological Sequence Analysis (http://www.cbs.dtu.dk/services/). The identity comparisons per gene were investigated by pairwise alignment using BioEdit Sequence Aligner.

The full-genome HCoV-229E strains 0349 and J0304 consist of round about 27.240 nt. The GC content is 38.07 % in 0349 and 38.13 % in J0304; this percentage is 38.26 for the reference sequence of the laboratory-adapted virus (inf-1, NC_002645.1). Between the two clinical isolates not much sequence difference was noted: only 135 nucleotide differences, one codon insertion/deletion in the S gene and a 2 nt insertion/deletion in the 3 0 UTR. According to the ZCURVE_CoV1.0 program, the genome organization of 0349 and J0304 is similar to the reference sequence with one major difference. The clinical strains of HCoV-229E have seven putative protein-coding genes, while there are eight in the reference sequence. This difference rises from the fact that the clinical strains have an intact ORF4, the reference has ORF4A and ORF4B. Table 1 shows the nucleotide and amino acid similarities among the ORFs of HCoV-229E strain 0349, J0304, and the reference sequence. Most distances ([2 % at nt level) are observed at the non-structural protein 3 (NSP3) gene, spike gene, nucleocapsid gene, and the 3 0 UTR.

Full-genome alignment of 0349 and J0304 with the reference sequence reveals 168 and 175 substitutions, respectively, in the 1a replicase gene of which 66 are nonsynonymous in both. In the 1b replicase gene, there are 84 in 0349 and 81 in J0304 substitutions resulting in 20 and 19 amino acid changes, respectively. Furthermore, one deletion and an insertion were observed. The NSP3 gene encodes for the largest non-structural protein and comprises 1,594 amino acids residues. This protein is a multi-functional protein and acts as Papain-Like protease (PL pro) and also has catalytic activity [14] . In 0349 and J0304, we observed 74 nt substitutions (34 non-synonymous in 0349 and 33 in J0304), a 21 nt insertion at position 286, and a 6 nt deletion at the positions 316-321. The NSP4 gene is situated between the two autoproteolytical proteins NSP3 and NSP5. There are 21 nt substitutions in 0349 that cause five amino acid changes. In J0304, there are 19 substitutions, 4 of them are non-synonymous. NSP5 has a proteolytic role with cysteine protease activity. It is very important in viral replication and, therefore, often referred to as the main protease (Mpro) [14, 15] . The Mpro-mediated processing pathways are well conserved in all coronaviruses and it cleaves as many as 11 pp1a/pp1b sites to produce a total of 13 mature proteins [16] . Although there are 7 changes at nucleotide acid level in 0349, there is only one amino acid difference (E222D). The same for J0304 with 10 substitutions that cause only two changes in amino acid level (F12L and E222D). The NSP6 gene encodes a membrane-spanning protein [14] . Eight nucleotide substitutions are found in 0349 and 7 in J0304, of which one is non-synonymous (V86I). The HCoV-229E genome encodes several small non-structural proteins, like NSP7 to NSP10, that have RNA-binding activity and are believed to be involved in viral RNA synthesis [14] . There is one substitution in the NSP7 gene in 0349 and 3 in J0304 that have no effect on the encoded protein. In NSP8, there are 6 nt substitutions, all except one are synonymous (I179T). There are three nucleic acids substitutions in NSP9 of 0349 and five in J0304, one of them non-synonymous (T23I). Within the NSP10 gene there are 6 nt substitutions in 0349, including two that change the amino acid sequence (S18A and C57S). Although we have 8 nt substitutions in J0304, there is only one that has effect on amino acid level (C57S). NSP12, or RNA-dependent RNA polymerase, consists of 927 amino acids. In 0349, there are 35 and in J0304 34 nt substitutions that result in 12 and 11 changes, respectively, in amino acid levels in comparison with the reference sequence (N4S, A32V, K131R, E134G, S147N, S233A, M458I, S524L, S757G, G767E, I842V, H906Q). Moreover, there is a difference in cleavage site between NSP10 and NSP12. In the reference, the cleavage site is TAIQ/SFDN, whereas it is TAIQ/SFDS in both clinical strains.

NSP13 encodes the viral helicase. We noticed 19 nt substitutions in 0349 and 20 in J0304, only three of them causing changes at the amino acid level (N58T, N352T, and I591V). In J0304, there is one more non-synonymous substitution (I475T).The same pattern is observed in NSP14 of 0349, the gene encoding the exoribonuclease: 17 nt substitutions and only three changes in amino acid sequence (T254N, D372E, and E514D), and in J0304, there are 13 mutations that cause two amino acid changes in protein (D372E and E514D). In NSP15, seven substitutions of which two are non-synonymous (P187H and V307L), and in NSP16 there are 7 nt substitutions in 0349 and 8 in J0304, of which one is non-synonymous (E270D).

The S gene encodes two spike protein domains, S1 and S2 (codons 1-560 and 561-1,173, respectively). Similar to the reference strain no furin cleavage site is present between the S1 and S2 domain. Alignment of the S genes identified two deletions in both clinical strains and 117 nt substitutions, of which 56 are non-synonymous substitutions in 0349 and 57 non-synonymous in J0304.

Most amino acid changes (including 48 amino acid substitutions in 0349 and 49 amino acid substitutions in J0304) and the deletions are within the S1 domain, especially at positions 223-229, 307-324, 349-358, and 401-411. The deletions in 0349 result in two codon deletions in S1 at amino acid positions 228 and 354. The deletions in J0304 result in three codon deletions in S1 at amino acid positions 228, 354, and 355. Forty-four of the non-synonymous changes and the two codon deletions have been described before by Chibo and Birch [5] , suggesting hat they are shaped by evolution and the positive selection pressure on the S1. Amino acid changes in the S1 allow escape to virus neutralizing antibodies [5] .

The receptor of HCoV-229E is human aminopeptidase N that is recognized by the S1 region. Studies show that between amino acids 417 and 517 there is an important region for binding to the receptor [17] . In this region, our clinical isolates have four synonymous and five non-synonymous substitutions. One study indicated that the area between 278 and 329 amino acids is also important for binding aminopeptidase N [18] . In this region, the clinical isolates have 10 non-synonymous substitutions. Of the total 15 amino acid changes that might affect receptor binding, all have been described previously [5] . Despite the amino acid changes at the receptor-binding regions, we have no indications that the clinical isolate 0349 was different in cell tropism compared to the Inf-1 isolate. Both strains have identical cell tropism in pseudostratified respiratory epithelium cultures (R. Dijkman et al., manuscript in preparation).

The putative S proteins of 0349 and J0304 contain, respectively, 27 and 26 potential N-glycosylation sites upon analysis with the NetNgly 1.0 analysis tool, from the Center for Biological Sequence Analysis (http://www. cbs.dtu.dk/services/), while the reference sequence contains 24 potential N-glycosylation sites. Indeed these 24 are conserved and there are three extra at positions 20, 111, and 488 in isolate 0349, and in J0304 there are two extra at positions 111 and 488. The predicted signal peptide of S in the reference sequence is present at amino acids 1-16, using the SignalP 4.0 tool from the Center for Biological Sequence Analysis. The predicted signal peptide of both clinical isolates is also located at amino acids 1-16, with potential cleavage site between 16 and 17.

The S2 part contains the heptad repeats (HR1 at codons 777-916; HR2 at codons 1,057-1,105) [19] , the transmembrane domain (codons 1,117-1,138), and a cytoplasmic tail. There are only eight amino acid substitutions, one of them is located in the HR1 region position 871 (T871I). All have been described previously by Chibo and Birch [5] . None of the amino acid changes was located in the transmembrane domain or the cytoplasmic tail.

A phylogenetic analysis which includes our clinical isolates, the reference sequence, and various S sequences from clinical samples collected between 1979 and 2004 provides further evidence of divergence in time as shown previously (Fig. 1a) [5] . Chibo and Birch presented four phylogenetically distinct S gene sequences of which the clustering matched with the year of isolation, indicating genetic drift in time. Isolate 0349 and J0304 cluster with the group 4 viruses, a group that contains all GenBank S-sequences that have been collected from 1999 onwards.

The ORF4 gene is located between the spike and envelope gene. Its function is unknown, yet studies with HCoV-NL63 accessory protein ORF3, a homolog of 229E-ORF4, revealed that the protein is incorporated into virions and is, therefore, an additional structural protein [13] .

There are major differences between the clinical isolates and the reference sequence. Most important is a 2 nt deletion in the reference strain resulting in an interruption of the gene, whereas the two clinical isolates have an intact ORF4, as previously published [20] . The putative full protein is 219 amino acids in size. Besides the insertion/ deletion 17 nt changes are observed (equal for both clinical isolates), resulting in nine amino acid substitutions. The envelope protein encodes a 77 amino acid protein, and the clinical isolates have only 2 nt different from the reference, one of these results in an amino acid change (V12I).

The membrane gene consists of 678 nucleotides, encoding a 225 amino acid protein. Both clinical isolates have 10 nt differences with the reference sequence, of which only one is non-synonymous (F82L).

The nucleocapsid protein plays a fundamental role in virus assembly and RNA synthesis [21] . Among the 1,170 nt that encode the N protein 26 nt substitutions with seven changes in amino acid sequence were noted in 0349 and in J0304 there are 25 nt substitutions that cause eight changes in amino acid level. Three of these changes are located in a hot spot between amino acids 224 and 228. Inspection of N-gene sequences that are available in Genbank revealed that six of the seven amino acid changes have been present in the circulating HCoV-229E strains for decades. This includes the 224-228 hot spot that has been present as early as 1982 [5] . Phylogenetic analysis shows clustering with strains obtained most recently (Fig. 1b) .

In the coronavirus genome, transcription-regulating sequences (TRS) are present in 3 0 end of the leader sequence and upstream of each structural gene. For both clinical isolates, the TRS has the core structure UCUCAACU, except the ORF4 gene for which the TRS is UCAACU. The TRS core structure for the M and the N gene have a 1 nt different UCUAAACU, that is also found in the reference sequence. All are exactly the same as in the reference sequence (Volker Thiel, personal communication), including the position of the TRSs with its adjacent AUG ( Table 2) .

Evaluation of the 3 0 UTR reveals significant variation. In the clinical isolate 0349, there are 8 nt substitutions and three deletions observed. In J0304, there are nine substitutions and two deletions. The largest deletion in both isolates is a 38 nt deletion. Furthermore, two short deletions with lengths of 2 and 4 nt are observed in 0349, and a 4 nt deletion in J0304. The effect of such deletions on 3 0 UTR structure and function is unknown. For the betacoronaviruses MHV and BCoV, it has been proposed that there are two conserved RNA structures at the upstream end of the 3 0 UTR: a bulged stem-loop and an adjacent pseudoknot [22] . There is a highly conserved pseudoknot in all alphacoronaviruses but not a detectable counterpart of the bulged stem-loop in any proximity, upstream or downstream of the pseudoknot [22, 23] . Strain 0349 showed most substitution (7/8) and deletions in the first 150 nt of the 3 0 UTR. Surprisingly, the downstream remainder of the 3 0 UTR, in which we observe only 1 substitution, is labeled hyper-variable region (HVR). The HVR is marked as highly divergent in sequence and structure, even among closely related coronaviruses [24] . This HVR region harbors the octanucleotide 5 0 -GGAAGAGC-3 0 , which is conserved in all coronavirus 3 0 UTRs, situated around 70-80 nt from the 3 0 end of the genome [25] . In HCoV-229E strain 0349 and J0304, this region is also conserved at 73 nt from the 3 0 end.

In this study, we report the first full-genome sequences of two non-laboratory adapted strains. Alignment of nucleotide and protein sequences and phylogenetic analysis of the two HCoV-229E strains showed several differences with the reference sequence. Genetic drift was noticed in the spike gene, and the only part of the genome truly affected by lab-adaptation is the ORF4 gene.

Table 2 The leader and body TRSs in HCoV-229E clinical isolates (0349, J0304) and laboratory adapted isolate